Nutch does not work with a lot of urls in fetch queue (25000 for example). 
Version is one of the night builds (end of February). I started it several
times with topn and without topn parameter.

1. First of all fetcher through this exception (several times):
fetch of http://www.wildrosemx.com/news/wp-includes/wlwmanifest.xml failed
with: java.lang.NullPointerException
java.lang.NullPointerException
at java.lang.System.arraycopy(Native Method)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:812)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:729)
at java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:290)
at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:270)
at org.apache.hadoop.io.Text.write(Text.java:281)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:595)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:357)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:249)
fetcher caught:java.lang.NullPointerException

2. Segments merging takes a huge amount of time. Current segment contains
about 50000 indexed pages and I try to fetch and add another 25000 pages but
it takes more than 24 hours and nutch uses more than 100GB of hard drive in
temporary folder.

3. Segments merger does not delete _temporary folder in crawl/segments
folder after it finish the work. So invert links and index are failed
because try to work with segments/_temporary folder which is empty.
But when I delete this empty _temporary  folder invert link does not work:

LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20090302034844
LinkDb: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)

-- 
View this message in context: 
http://www.nabble.com/Errors.-Nutch-1.0-dev-tp22315216p22315216.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to