Hi,

I'm running an intranet crawl on a fairly large site with nutch 0.9,
using this commandline:

nice nohup bin/nutch crawl /data/crawl/urls -dir /data/crawl/intranet3
-threads 625 -depth 10

After fetching 300,000 or so pages in the first segment, it crashes
unceremoniously. I see this near the end of hadoop.log:

2007-04-27 04:50:55,415 INFO  fetcher.Fetcher - fetching
http://(internal url).html
2007-04-27 04:50:55,489 WARN  mapred.LocalJobRunner - job_6qek1v
java.lang.ArrayIndexOutOfBoundsException: 401
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:509)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
2007-04-27 04:50:57,295 INFO  fetcher.Fetcher - fetch of
http://(internal url).doc failed with: Http code=406,
url=http://(internal url).doc

...and in the console:
fetch of http://(internal url).doc failed with: Http code=406,
url=http://(internal url).doc
Exception in thread "main" java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
 at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)

(Environment: RHEL, 8 GB RAM, lots of disk space. Logs show the system
never ran out of disk space.)

Does anyone have any idea what's going on? How I could continue from
this point? How I can avoid this sort of crash in the future?

Thanks in advance for your help.
--Mike

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to