Hi all, I'm trying to run a whole-web crawling on my notebook. with : - 40 GB Free for the crawl directory - and I set “hadoop.tmp.dir” to an external hard disk (usb) with 90 GB free.
I used the the crawl script from the wiki: http://wiki.apache.org/nutch/Crawl with the following configuration: depth = 10 threads = 5 topN = 10000 (I don' use the adddays parameter) I run the script two times successively without any problem, but in the third cycle I got the following message by merging the segments (10 segments with 90 Mbytes and 1 segment with 1GByte) : “org.apache.hadoop.fs.FSError: java.io.IOException: File too large” Does “mergesegs -slice” avoid this problem? Thanks, Patricio
