Hi all,

I'm trying to run a whole-web crawling on my notebook.
with :
- 40 GB Free for the crawl directory
- and  I set “hadoop.tmp.dir” to an external hard disk (usb) with 90 GB
free.

I used the the crawl script from the wiki:
http://wiki.apache.org/nutch/Crawl with the following configuration:
depth = 10
threads = 5
topN = 10000
(I don' use the adddays parameter)

I run the script two times successively without any problem, but in the
third cycle I got the following message by merging the segments (10
segments with 90 Mbytes and 1 segment with 1GByte) :

“org.apache.hadoop.fs.FSError: java.io.IOException: File too large”

Does “mergesegs -slice” avoid this problem?

Thanks,
Patricio

Reply via email to