Re: crawl_parse keeps growing after re-crawling and segment merging

Justin Yao Mon, 30 Mar 2009 13:06:37 -0700

I dumped the segment using command:

bin/nutch readseg -dump crawl/segments/20090330155113 dumpdir -nocontent-nofetch -nogenerate -noparsedata -noparsetext


then I opened file dumpdir/dump and found a lot of duplicate entries like:


Recno:: 2124
URL:: http://20g.fr/shop/products_new.php

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018518519
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018518519
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.023255814
Signature: null
Metadata:

.....


Recno:: 2125
URL:: http://20g.fr/shop/products_new.php?osC

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.022727273
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018181818
Signature: null
Metadata

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018181818
Signature: null
Metadata

......

The same records for a same URL were duplicated tens of times or evenhundreds of times.


Does any of you know what may cause this problem?

Thanks,
Justin

Justin Yao wrote:

a correction to my email:

the "crawl_data" should be "crawl_parse"

Justin

Justin Yao wrote:
Hi
I set db.update.additions.allowed to false so nutch will only crawlthe pages I injected. I set the db.default.fetch.interval anddb.fetch.interval.default to 7 days and nutch will re-crawl thosepages every 7 days.I did a segment merging after every recrawling. There's no significantfile size change of segments/content, segments/crawl_fetch,segments/crawl_generate, segments/parse_data, segments/parse_text.However, for segments/crawl_parse, it kept growing after every segmentmerging (crawl_data grew from 120M to 150M, from 150M to 500M, from500M to 1.5G) and eventually it made the segment merging fail becauseit required too much memory.My question is, how to prevent the crawl_data directory growing aftereach recrawling and segment merging? Why did it keep growing? Am Idoing some wrong?
Thanks very much for your help.

Best Regards,


--
Justin Yao
Snooth
o: 646.723.4328
c: 718.662.6362
jus...@snooth.com

Snooth -- Over 2 million ratings and counting...

Re: crawl_parse keeps growing after re-crawling and segment merging

Reply via email to