Hi

I set db.update.additions.allowed to false so nutch will only crawl the pages I injected. I set the db.default.fetch.interval and db.fetch.interval.default to 7 days and nutch will re-crawl those pages every 7 days. I did a segment merging after every recrawling. There's no significant file size change of segments/content, segments/crawl_fetch, segments/crawl_generate, segments/parse_data, segments/parse_text. However, for segments/crawl_parse, it kept growing after every segment merging (crawl_data grew from 120M to 150M, from 150M to 500M, from 500M to 1.5G) and eventually it made the segment merging fail because it required too much memory. My question is, how to prevent the crawl_data directory growing after each recrawling and segment merging? Why did it keep growing? Am I doing some wrong?

Thanks very much for your help.

Best Regards,
--
Justin Yao
Snooth -- Over 2 million ratings and counting...

Reply via email to