It seems if a destination url is not fetched, nutch doesn't have
signature for that page, then the CrawlDatum data for that page won't be
merged during the segment merging process, is this correct?
I found following dump:
Recno:: 116
URL:: http://20g.fr/shop/product_info.php?products_id=1031
CrawlDatum::
Version: 7
Status: 65 (signature)
Fetch time: Wed Apr 08 18:39:12 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: 80407d5db2d88c2494915eeaee578f5c
Metadata:
Recno:: 117
URL:: http://20g.fr/shop/product_info.php?products_id=1031&action=notify
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Apr 08 18:39:12 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 0.018518519
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Apr 08 18:39:12 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 0.018518519
Signature: null
Metadata:
"http://20g.fr/shop/product_info.php?products_id=1031" exists in my
crawldb and it is fetched,
"http://20g.fr/shop/product_info.php?products_id=1031&action=notify"
doesn't exist in my crawldb and it's never be fetched because I set
"db.update.additions.allowed" to "false". However,
"http://20g.fr/shop/product_info.php?products_id=1031&action=notify" is
a link in page "http://20g.fr/shop/product_info.php?products_id=1031" so
it will be recorded for score calculation. Nutch doesn't know the
signature for this page and it can't merge those CrawlDatum-s. As I have
more and more recrawling and segment merging, it becomes a big pain to
handle those data.
Is there any way to merge those CrawlDatum-s with null signature or how
could I eliminate those CrawlDatum-s?
Thanks,
Justin
Justin Yao wrote:
It's possible my conclusion in last email is wrong. I found some
CrawlDatum-s for same url have different scores as listed below.
Whatever, it's a really boring issue made me a lot of headache. We can't
merge those segments now as it takes too much memory. I even tried to
allocate 7G memory for each child task, it still failed with all memory
occupied and the reducer hung there.
Recno:: 3285
URL::
http://adserver.adtech.de/adiframe|3.0|224|1105005|0|168|ADTECH;target=_blank;grp=1
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Mar 31 10:54:12 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.012987013
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Mar 31 10:54:13 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.013157895
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Mar 31 10:54:13 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.013333334
Signature: null
Justin Yao wrote:
Hi,
I got more information for this problem and it seems those information
is pretty close to the root cause.
It seems those CrawlDatum-s without signature are not merged correctly:
For example:
Recno:: 233
URL:: http://20g.fr/shop/index.php?cPath=54&osC
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Mar 31 15:25:09 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.022727273
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Mar 31 15:25:09 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.022727273
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Mar 31 15:25:09 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.022727273
Signature: null
Metadata:
Here is my analysis:
a). I checked the merged segment data which is done on 04/02/2009:
$bin/nutch readdb /home/snoothbot/crawl_20090402170013/crawldb -stats
CrawlDb statistics start: /home/snoothbot/crawl_20090402170013/crawldb
Statistics for CrawlDb: /home/snoothbot/crawl_20090402170013/crawldb
TOTAL urls: 605601
retry 0: 593942
retry 1: 706
retry 10: 1
retry 11: 15
retry 12: 35
retry 13: 267
retry 14: 522
retry 15: 1914
retry 16: 1639
retry 17: 955
retry 18: 698
retry 19: 834
retry 2: 309
retry 20: 920
retry 21: 784
retry 22: 601
retry 23: 350
retry 24: 737
retry 25: 54
retry 3: 6
retry 4: 8
retry 8: 157
retry 9: 147
min score: 1.0
avg score: 1.0247672
max score: 189.096
status 1 (db_unfetched): 53404
status 2 (db_fetched): 437700
status 3 (db_gone): 62983
status 4 (db_redir_temp): 23295
status 5 (db_redir_perm): 28219
CrawlDb statistics: done
$ bin/nutch readseg -dump crawl_20090402170013/segments/20090402110727
dumpdir -nocontent -nofetch -nogenerate -noparsedata -noparsetext
$ cd dumpdir
$ grep '^Signature' dump > sdump
$ wc -l sdump
5748848 sdump
$ grep 'Signature: null' sdump | wc -l
5311158
5748848 - 5311158 = 437690
437,690 is the CrawlDatum with non-null signature in the dump file.
There are a lot of CrawlDatum-s with null signature (5,311,158)
*** 437690 ~= (db_fetched): 437700
b). I checked the merged segment data which is done on 04/06/2009:
$ bin/nutch readdb /home/snoothbot/crawl_20090406202406/crawldb -stats
CrawlDb statistics start: /home/snoothbot/crawl_20090406202406/crawldb
Statistics for CrawlDb: /home/snoothbot/crawl_20090406202406/crawldb
TOTAL urls: 605601
retry 0: 594396
retry 1: 688
retry 16: 145
retry 17: 3
retry 2: 309
retry 20: 1
retry 21: 143
retry 22: 4
retry 23: 26
retry 24: 15
retry 26: 1
retry 27: 1
retry 28: 181
retry 29: 270
retry 3: 6
retry 30: 519
retry 31: 1869
retry 32: 1374
retry 33: 895
retry 34: 676
retry 35: 794
retry 36: 883
retry 37: 732
retry 38: 588
retry 39: 325
retry 4: 7
retry 40: 694
retry 41: 53
retry 6: 1
retry 8: 1
retry 9: 1
min score: 1.0
avg score: 1.0248147
max score: 189.096
status 1 (db_unfetched): 30091
status 2 (db_fetched): 441368
status 3 (db_gone): 78547
status 4 (db_redir_temp): 23462
status 5 (db_redir_perm): 32133
CrawlDb statistics: done
$ bin/nutch readseg -dump crawl_20090406202406/segments/20090406113824
dumpdir -nocontent -nofetch -nogenerate -noparsedata -noparsetext
$ cd dumpdir
$ grep '^Signature' dump > sdump
$ wc -l sdump
11003435 sdump
$ grep 'Signature: null' sdump | wc -l
10562077
10562077 - 11003435 = 441358
441,358 is the CrawlDatum with non-null signature in the dump file.
There are a lot of CrawlDatum-s with null signature (11,003,435)
*** 431358 ~= (db_fetched): 441368
I injected around 0.5M urls to crawldb at the beginning and I set
"db.update.additions.allowed" to "false" in nutch-site.xml to make
sure no new URLs will be added. Not sure if this setting will have
side effect to above problem or not.
Could any of you could provide some ideas of how to fix it?
Thanks,
Justin
Justin Yao wrote:
Hi Doğacan,
Thanks for your suggestion.
I've downloaded nutch-1.0 and made a fresh start of the crawling,
merging, recrawling, merging process.
I'll open a JIRA if I could reproduce the problem again.
Thanks,
Justin
Doğacan Güney wrote:
On Mon, Mar 30, 2009 at 23:05, Justin Yao <jus...@snooth.com> wrote:
I dumped the segment using command:
bin/nutch readseg -dump crawl/segments/20090330155113 dumpdir
-nocontent
-nofetch -nogenerate -noparsedata -noparsetext
then I opened file dumpdir/dump and found a lot of duplicate
entries like:
Recno:: 2124
URL:: http://20g.fr/shop/products_new.php
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018518519
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018518519
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.023255814
Signature: null
Metadata:
.....
Recno:: 2125
URL:: http://20g.fr/shop/products_new.php?osC
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.022727273
Signature: null
Metadata:
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018181818
Signature: null
Metadata
CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018181818
Signature: null
Metadata
......
The same records for a same URL were duplicated tens of times or even
hundreds of times.
Does any of you know what may cause this problem?
This is not necessarily a bug, as same entries may be duplicated.
But this
happening
hundreds of times is indeed fishy. Can you create a JIRA issue with the
details?
Thanks,
Justin
Justin Yao wrote:
a correction to my email:
the "crawl_data" should be "crawl_parse"
Justin
Justin Yao wrote:
Hi
I set db.update.additions.allowed to false so nutch will only
crawl the
pages I injected. I set the db.default.fetch.interval and
db.fetch.interval.default to 7 days and nutch will re-crawl those
pages
every 7 days.
I did a segment merging after every recrawling. There's no
significant
file size change of segments/content, segments/crawl_fetch,
segments/crawl_generate, segments/parse_data,
segments/parse_text. However,
for segments/crawl_parse, it kept growing after every segment
merging
(crawl_data grew from 120M to 150M, from 150M to 500M, from 500M
to 1.5G)
and eventually it made the segment merging fail because it
required too much
memory.
My question is, how to prevent the crawl_data directory growing
after
each recrawling and segment merging? Why did it keep growing? Am
I doing
some wrong?
Thanks very much for your help.
Best Regards,
--
Justin Yao
Snooth
o: 646.723.4328
c: 718.662.6362
jus...@snooth.com
Snooth -- Over 2 million ratings and counting...