I dumped the segment using command:

bin/nutch readseg -dump crawl/segments/20090330155113 dumpdir -nocontent -nofetch -nogenerate -noparsedata -noparsetext

then I opened file dumpdir/dump and found a lot of duplicate entries like:


Recno:: 2124
URL:: http://20g.fr/shop/products_new.php

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018518519
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018518519
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.023255814
Signature: null
Metadata:

.....


Recno:: 2125
URL:: http://20g.fr/shop/products_new.php?osC

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.022727273
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018181818
Signature: null
Metadata

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018181818
Signature: null
Metadata

......

The same records for a same URL were duplicated tens of times or even hundreds of times.

Does any of you know what may cause this problem?

Thanks,
Justin

Justin Yao wrote:
a correction to my email:

the "crawl_data" should be "crawl_parse"

Justin

Justin Yao wrote:
Hi

I set db.update.additions.allowed to false so nutch will only crawl the pages I injected. I set the db.default.fetch.interval and db.fetch.interval.default to 7 days and nutch will re-crawl those pages every 7 days. I did a segment merging after every recrawling. There's no significant file size change of segments/content, segments/crawl_fetch, segments/crawl_generate, segments/parse_data, segments/parse_text. However, for segments/crawl_parse, it kept growing after every segment merging (crawl_data grew from 120M to 150M, from 150M to 500M, from 500M to 1.5G) and eventually it made the segment merging fail because it required too much memory. My question is, how to prevent the crawl_data directory growing after each recrawling and segment merging? Why did it keep growing? Am I doing some wrong?

Thanks very much for your help.

Best Regards,


--
Justin Yao
Snooth
o: 646.723.4328
c: 718.662.6362
jus...@snooth.com

Snooth -- Over 2 million ratings and counting...

Reply via email to