I dumped the segment using command:
bin/nutch readseg -dump crawl/segments/20090330155113 dumpdir -nocontent
-nofetch -nogenerate -noparsedata -noparsetext
then I opened file dumpdir/dump and found a lot of duplicate entries like:
Recno:: 2124
URL:: http://20g.fr/shop/products_new.php
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018518519
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018518519
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.023255814
Signature: null
Metadata:
.....
Recno:: 2125
URL:: http://20g.fr/shop/products_new.php?osC
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.022727273
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018181818
Signature: null
Metadata
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018181818
Signature: null
Metadata
......
The same records for a same URL were duplicated tens of times or even
hundreds of times.
Does any of you know what may cause this problem?
Thanks,
Justin
Justin Yao wrote:
a correction to my email:
the "crawl_data" should be "crawl_parse"
Justin
Justin Yao wrote:
Hi
I set db.update.additions.allowed to false so nutch will only crawl
the pages I injected. I set the db.default.fetch.interval and
db.fetch.interval.default to 7 days and nutch will re-crawl those
pages every 7 days.
I did a segment merging after every recrawling. There's no significant
file size change of segments/content, segments/crawl_fetch,
segments/crawl_generate, segments/parse_data, segments/parse_text.
However, for segments/crawl_parse, it kept growing after every segment
merging (crawl_data grew from 120M to 150M, from 150M to 500M, from
500M to 1.5G) and eventually it made the segment merging fail because
it required too much memory.
My question is, how to prevent the crawl_data directory growing after
each recrawling and segment merging? Why did it keep growing? Am I
doing some wrong?
Thanks very much for your help.
Best Regards,
--
Justin Yao
Snooth
o: 646.723.4328
c: 718.662.6362
jus...@snooth.com
Snooth -- Over 2 million ratings and counting...