This is what I get after setting the logger to DEBUG. This is what I
get in the crawl log. I'm starting my crawl with the following:
nutch crawl urls -dir crawl_test -depth 8 >& crawl_test.log &
fetch of http://www.lib.ncsu.edu/congbibs/house/100hdgst2.html failed
with: java.lang.NullPointerException
java.lang.NullPointerException
at java.lang.System.arraycopy(Native Method)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer
$Buffer.write(MapTask.java:822)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer
$Buffer.write(MapTask.java:739)
at java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:290)
at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:270)
at org.apache.hadoop.io.Text.write(Text.java:281)
at org.apache.hadoop.io.serializer.WritableSerialization
$WritableSerializer.serialize(WritableSerialization.java:90)
at org.apache.hadoop.io.serializer.WritableSerialization
$WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.mapred.MapTask
$MapOutputBuffer.collect(MapTask.java:605)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:
786)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:662)
fetcher caught:java.lang.NullPointerException
Attempting to finish item from unknown queue:
org.apache.nutch.fetcher.fetcher$fetchi...@1d57c7
-finishing thread FetcherThread, activeThreads=13
Tim
On May 4, 2009, at 4:09 AM, Andrzej Bialecki wrote:
tsmori wrote:
I'm having an interesting problem that I think revolves around the
interplay
of a few settings that I'm not really clear on how they affect the
crawl.
Currently I have:
content.limit = -1
fetcher.threads = 1000
fetcher.threads.per host = 100
indexer.max.tokens = 750000
I also increased the JAVA_HEAP space to account for the additional
tokens.
I'm not getting any out of memory errors, so that part should be
okay.
The problem is that with the content limit set high or not at all
(I have
tried other values), I get Fetch errors with NullPointerExceptions
on one
set of files (html files), these are fairly large html files, but
not over
1MB. If I set the content limit to a reasonable amount, say 5MB, the
nullpointerexceptions go away, but I get a lot of truncation errors
on a
different group of files (pdf files, all over 5MB).
Could you please copy the full stack trace, including line numbers?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Timothy S. Mori
Systems Librarian for Enterprise Operations
IT Department
North Carolina State University Libraries
Campus Box 7111
Raleigh, NC 27695-7111
919.515.6182 (phone)