Nutch does not work with a lot of urls in fetch queue (25000 for example). Version is one of the night builds (end of February). I started it several times with topn and without topn parameter.
1. First of all fetcher through this exception (several times): fetch of http://www.wildrosemx.com/news/wp-includes/wlwmanifest.xml failed with: java.lang.NullPointerException java.lang.NullPointerException at java.lang.System.arraycopy(Native Method) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:812) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:729) at java.io.DataOutputStream.writeByte(DataOutputStream.java:136) at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:290) at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:270) at org.apache.hadoop.io.Text.write(Text.java:281) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:595) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:357) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:249) fetcher caught:java.lang.NullPointerException 2. Segments merging takes a huge amount of time. Current segment contains about 50000 indexed pages and I try to fetch and add another 25000 pages but it takes more than 24 hours and nutch uses more than 100GB of hard drive in temporary folder. 3. Segments merger does not delete _temporary folder in crawl/segments folder after it finish the work. So invert links and index are failed because try to work with segments/_temporary folder which is empty. But when I delete this empty _temporary folder invert link does not work: LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: crawl/segments/20090302034844 LinkDb: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248) -- View this message in context: http://www.nabble.com/Errors.-Nutch-1.0-dev-tp22315216p22315216.html Sent from the Nutch - User mailing list archive at Nabble.com.
