You can limit the size of each segment by using the crawler's -topN option. This will limit the number of URLs per segment. You have to run multiple crawl cycles to fetch your 40k urls. Note that if your documents produce new outlinks, they are put into the crawldb after each cycle. The order in which they are fetched is determined by the scoring plugin(s).
Btw, if you use a local filesystem you might able to recovery some of the fetched data, see: http://issues.apache.org/jira/browse/NUTCH-451 Mathijs Harmesh, V2solutions wrote: > hi all, > I had run a crawl of approxmately 40,000 urls . It stop in between > giving an error of no disk available. Is there any way to restrict the size > of segements so that only a few MB goes in paticular segment . > thanks in advance. > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
