i created 100 fetchlists from a 50million url db and
when i try an run fetch i'm getting a few fetches done
and then tons of errors on url normalizer - anyone
else seeing this?
050414 014446 fetching http://www.theastonline.com/
050414 014446 fetching
http://authoryellowpages.com/featureslist.asp
I just want to fetch all the pages in http://news.buaa.edu.cn
So I modified my crawl-urlfilter.txt like this:
#
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|https):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xl
Thank you
On 4/14/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
> Stefan Groschupf wrote:
> > Some weeks ago I was staring to write a small tool to be able comparing
> > result via command line.
> > However I never finished the work, but if you like I can send you
> > sources but there is still
[ http://issues.apache.org/jira/browse/NUTCH-35?page=history ]
Stefan Grroschupf updated NUTCH-35:
---
Attachment: xmlApiPatchIII.patch
It's a shame, however I'm sure one day there will be a patch from me that just
need to be assigned - I hope. :-)
The p
OK Jack, but the details of my analyser aren't particularly exciting.
I need to index a site that has a mixture of documents in English and
Te Reo Maori (indigenous language of New Zealand). Vowels in Te Reo
Maori are sometimes written with short overlines (also known as
macrons), to indicate a
Jérôme Charron wrote:
Using this model is important also from another point of view: with the
current code, where NutchConf is a singleton, it's not possible to run
several tasks in parallel within a single JVM, but with radically
different parameters. E.g.: if you want to run several CrawlTool wit
Folks,
The new wiki is up and running. Basically this means the all important pages
have been moved over and the FrontPage is pretty much the same as the old
one.
A few links were broken on the FrontPage or obsolete and those I did not
move over.
Given that no one said otherwise, I would say tha
TestSegmentMergeTool fail
-
Key: NUTCH-40
URL: http://issues.apache.org/jira/browse/NUTCH-40
Project: Nutch
Type: Bug
Reporter: Stefan Grroschupf
Assigned to: Andrzej Bialecki
Priority: Trivial
ant clean && ant test
...
050413 22
You need to modify the jsp page in any case.
What you can do as well, is to write a custom index filter plugin that
adds another meta data (your link) to the document in the index.
However you need to edit the jsp to show your link instead of the
default url.
Stefan
Am 13.04.2005 um 23:16 schrie
Hi all,
sorry to ask the same question on the user mailing list, but I didn't
get any answer to my problem.
I have a filesystem with files to index.
-> no problem to index the files.
I want to search them remote via the WAR using Tomcat.
-> no problem by moving the segments to the correct positio
Dear Nutch developers,
Using the IndexReader, I am able to read the segments and obtain term
frequencies of documents (using their ids)
Now I want to actually retrieve the document data like -url, title of the
document, document content etc. using the document ids.
1: How can i retrieve the d
Jérôme Charron wrote:
Using this model is important also from another point of view: with the
current code, where NutchConf is a singleton, it's not possible to run
several tasks in parallel within a single JVM, but with radically
different parameters. E.g.: if you want to run several CrawlTool wit
> Since we have no QA and no test group yet, what you think about we
> close bugs when releasing the next version?
> Make that sense? I will take care of this process but need a rule to
> follow.
Here is a proposal:
Since there's no QA team, why not processing as follow:
1. mark resolved once a
> Using this model is important also from another point of view: with the
> current code, where NutchConf is a singleton, it's not possible to run
> several tasks in parallel within a single JVM, but with radically
> different parameters. E.g.: if you want to run several CrawlTool with
> different
[ http://issues.apache.org/jira/browse/NUTCH-35?page=comments#action_62762
]
Doug Cutting commented on NUTCH-35:
---
TestFetcher is still failing for me with this patch:
fetch of http://sourceforge.net/projects/nutch/ failed with:
java.lang.NoClassDefF
Andrzej:
Thank you for your response to my comments.
The reason I said there may be bug in the fetcher is that in our case there
was no JVM crash or OOM Exception during the fetch and the fetch process was
successful by reading the log.
file. So I cannot tell what caused the truncation (Unexpected
[ http://issues.apache.org/jira/browse/NUTCH-5?page=history ]
Doug Cutting resolved NUTCH-5:
--
Resolution: Fixed
I have applied this patch.
Thanks, Andy!
> Hit limiter off-by-one bug
> --
>
> Key: NUTCH-5
> UR
Jay Yu wrote:
I have a similar problem when the segread tool (acutually any code that
needs to read the seg) was just hanging there forever on a
truncated segment. I think there are at least 2 bugs: one in the fetcher
which generated the truncated seg without any error message, the 2nd is the
Well
[ http://issues.apache.org/jira/browse/NUTCH-5?page=history ]
Doug Cutting closed NUTCH-5:
> Hit limiter off-by-one bug
> --
>
> Key: NUTCH-5
> URL: http://issues.apache.org/jira/browse/NUTCH-5
> Project: Nutch
Stefan Groschupf wrote:
Some weeks ago I was staring to write a small tool to be able comparing
result via command line.
However I never finished the work, but if you like I can send you
sources but there is still some work to do.
Mike wrote code to do this a while back. It was difficult to up
I have a similar problem when the segread tool (acutually any code that
needs to read the seg) was just hanging there forever on a
truncated segment. I think there are at least 2 bugs: one in the fetcher
which generated the truncated seg without any error message, the 2nd is the
MapFile/SequenceF
An issue is typically marked as Resolved after the patch is applied,
unit tests passed, and the modified code is committed to the
repository.
In enterprise environments the issue is typically closed after it's
been verified by the QA. In case of an open-source project there may
be no need for sep
Stephan,
I already started some tests on using cli2. CLI v. 1 is in my opinion
not supporting al required parameters.
Can you please be more specific?
I defined a interface "Tool" and created a AbstractTool class.
Currently i started to change the existing tools to be extended from
them.
May be
Sure! Until working we mark the issue as in progress and when the patch
is committed we mark it as as resolved. So the only thing to discuss
when we should close a bug.
Since we have no QA and no test group yet, what you think about we
close bugs when releasing the next version?
Make that sense?
I've had problems trying to access truncated segments in the past.
The process would hang when I tried to read the segment. Have you
tried using the segread tool to see if it can be accessed correctly?
Have you tried reparing the segment? One week for 4 million records
is way long, so I would s
Hello Luke,
Have you changed default values of parameters related to indexing?
It helped in my case - Yesterday I was indexing ~3.5mln pages segment
and it took 3.5h and optimization took 10 minutes.
I am using linux (ext3) on AMD Opteron 2.2GHz +SCSI drives.
I am using (probably not the best value
Hi,
Doug, can you or someone else please commit the classes you suggested,
I think most / all agree and we can start porting things, but if all
people create now own NutchConfigurable interfaces we will run in
trouble and people are unhappy when they need to correct patches they
submitted or pa
Hey,
Is there some sort of optimal or maximum segment size? I have a segment
with 3.9 million records and it appears to be taking a really long time
to index. The index process has been optimizing the index for over a
week. The server I'm running it on is a dual Xeon 3.0 Ghz with 2GB of
RAM.
[ http://issues.apache.org/jira/browse/NUTCH-5?page=history ]
Andy Liu updated NUTCH-5:
-
Attachment: fix-hitlimiting.patch
Patches NutchBean to fix hit limiting off-by-one issue.
> Hit limiter off-by-one bug
> --
>
> Key: NUTCH-
Please submit a patch.
To construct a patch, do something like:
ant test
# check that there are no failures
ant clean
svn add src/java/org/apache/nutch/myPackage/MyClass.java
svn status
# make sure that you've added all new files
svn diff > my.patch
Doug
David Spencer wrote:
At a glance it seems th
At a glance it seems that org.apache.nutch.db.WebDBInjector should (or
could) have the DMOZ code taken out of it and put somewhere else, as the
DMOZ code is really just a use of WebDBInjector and not essential to it
and in theory there could be lots of different injectors (e.g. URLs from
a DB..
try
+^http://news.buaa.edu.cn/*
On 4/13/05, cao yuzhong <[EMAIL PROTECTED]> wrote:
>
> I want crawl all the pages in http://news.buaa.edu.cn
>
> Following is my crawl-urlfilter.txt:
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto|https):
>
> # skip image and other suffixes we can
You would need to make a custom query filter plugin. You'll want to
look at the query-basic plugin as an example of how it constructs a
Lucene query from a Nutch query.
On Apr 12, 2005 12:34 AM, zhang jin <[EMAIL PROTECTED]> wrote:
> If I want to support Or.How I should do?
> Thanks very much!
>
Stefan Groschupf wrote:
I personal understand the life cycle of a issue like this:
- Create an issue.
- Assign an issue to a developer (optional)
- Resolve a issue as soon someone start to work on this issue
- Close the issue as soon the patch is in the sources
I'm used to not resolving them until
[ http://issues.apache.org/jira/browse/NUTCH-35?page=comments#action_62673
]
Doug Cutting commented on NUTCH-35:
---
Three unit tests fail after I apply this patch:
[junit] Test org.apache.nutch.analysis.TestQueryParser FAILED
[junit] Test org.a
[ http://issues.apache.org/jira/browse/NUTCH-35?page=comments#action_62679
]
Stefan Grroschupf commented on NUTCH-35:
Strange, I focused on the pluginsystem test case that by the way only works in
case the inlude pattern allows all plugins.
Ho
I want crawl all the pages in http://news.buaa.edu.cn
Following is my crawl-urlfilter.txt:
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|https):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
# skip
Stefan Groschupf wrote:
Hi developers,
just a comment about the planed porting of tools to actions.
Related to:
http://issues.apache.org/jira/browse/NUTCH-27
(Patch to get a status of running Fetcher)
I suggest that all action have a api to query status information.
This will be very helpful for th
38 matches
Mail list logo