It seems if a destination url is not fetched, nutch doesn't have signature for that page, then the CrawlDatum data for that page won't be merged during the segment merging process, is this correct?

I found following dump:

Recno:: 116
URL:: http://20g.fr/shop/product_info.php?products_id=1031

CrawlDatum::
Version: 7
Status: 65 (signature)
Fetch time: Wed Apr 08 18:39:12 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: 80407d5db2d88c2494915eeaee578f5c
Metadata:

Recno:: 117
URL:: http://20g.fr/shop/product_info.php?products_id=1031&action=notify

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Apr 08 18:39:12 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 0.018518519
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Apr 08 18:39:12 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 0.018518519
Signature: null
Metadata:


"http://20g.fr/shop/product_info.php?products_id=1031"; exists in my crawldb and it is fetched, "http://20g.fr/shop/product_info.php?products_id=1031&action=notify"; doesn't exist in my crawldb and it's never be fetched because I set "db.update.additions.allowed" to "false". However, "http://20g.fr/shop/product_info.php?products_id=1031&action=notify"; is a link in page "http://20g.fr/shop/product_info.php?products_id=1031"; so it will be recorded for score calculation. Nutch doesn't know the signature for this page and it can't merge those CrawlDatum-s. As I have more and more recrawling and segment merging, it becomes a big pain to handle those data.

Is there any way to merge those CrawlDatum-s with null signature or how could I eliminate those CrawlDatum-s?

Thanks,
Justin





Justin Yao wrote:
It's possible my conclusion in last email is wrong. I found some CrawlDatum-s for same url have different scores as listed below.

Whatever, it's a really boring issue made me a lot of headache. We can't merge those segments now as it takes too much memory. I even tried to allocate 7G memory for each child task, it still failed with all memory occupied and the reducer hung there.

Recno:: 3285
URL:: http://adserver.adtech.de/adiframe|3.0|224|1105005|0|168|ADTECH;target=_blank;grp=1

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Mar 31 10:54:12 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.012987013
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Mar 31 10:54:13 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.013157895
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Mar 31 10:54:13 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.013333334
Signature: null



Justin Yao wrote:
Hi,

I got more information for this problem and it seems those information is pretty close to the root cause.

It seems those CrawlDatum-s without signature are not merged correctly:
For example:

Recno:: 233
URL:: http://20g.fr/shop/index.php?cPath=54&osC

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Mar 31 15:25:09 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.022727273
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Mar 31 15:25:09 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.022727273
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Mar 31 15:25:09 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.022727273
Signature: null
Metadata:

Here is my analysis:

a). I checked the merged segment data which is done on 04/02/2009:

$bin/nutch readdb  /home/snoothbot/crawl_20090402170013/crawldb -stats
CrawlDb statistics start: /home/snoothbot/crawl_20090402170013/crawldb
Statistics for CrawlDb: /home/snoothbot/crawl_20090402170013/crawldb
TOTAL urls:     605601
retry 0:        593942
retry 1:        706
retry 10:       1
retry 11:       15
retry 12:       35
retry 13:       267
retry 14:       522
retry 15:       1914
retry 16:       1639
retry 17:       955
retry 18:       698
retry 19:       834
retry 2:        309
retry 20:       920
retry 21:       784
retry 22:       601
retry 23:       350
retry 24:       737
retry 25:       54
retry 3:        6
retry 4:        8
retry 8:        157
retry 9:        147
min score:      1.0
avg score:      1.0247672
max score:      189.096
status 1 (db_unfetched):        53404
status 2 (db_fetched):  437700
status 3 (db_gone):     62983
status 4 (db_redir_temp):       23295
status 5 (db_redir_perm):       28219
CrawlDb statistics: done

$ bin/nutch readseg -dump crawl_20090402170013/segments/20090402110727 dumpdir -nocontent -nofetch -nogenerate -noparsedata -noparsetext
$ cd dumpdir
$ grep '^Signature' dump > sdump
$ wc -l sdump
5748848 sdump
$ grep 'Signature: null' sdump | wc -l
5311158

5748848 - 5311158 = 437690

437,690 is the CrawlDatum with non-null signature in the dump file.
There are a lot of CrawlDatum-s with null signature (5,311,158)

*** 437690 ~= (db_fetched):  437700

b). I checked the merged segment data which is done on 04/06/2009:

$ bin/nutch readdb /home/snoothbot/crawl_20090406202406/crawldb -stats
CrawlDb statistics start: /home/snoothbot/crawl_20090406202406/crawldb
Statistics for CrawlDb: /home/snoothbot/crawl_20090406202406/crawldb
TOTAL urls:     605601
retry 0:        594396
retry 1:        688
retry 16:       145
retry 17:       3
retry 2:        309
retry 20:       1
retry 21:       143
retry 22:       4
retry 23:       26
retry 24:       15
retry 26:       1
retry 27:       1
retry 28:       181
retry 29:       270
retry 3:        6
retry 30:       519
retry 31:       1869
retry 32:       1374
retry 33:       895
retry 34:       676
retry 35:       794
retry 36:       883
retry 37:       732
retry 38:       588
retry 39:       325
retry 4:        7
retry 40:       694
retry 41:       53
retry 6:        1
retry 8:        1
retry 9:        1
min score:      1.0
avg score:      1.0248147
max score:      189.096
status 1 (db_unfetched):        30091
status 2 (db_fetched):  441368
status 3 (db_gone):     78547
status 4 (db_redir_temp):       23462
status 5 (db_redir_perm):       32133
CrawlDb statistics: done


$ bin/nutch readseg -dump crawl_20090406202406/segments/20090406113824 dumpdir -nocontent -nofetch -nogenerate -noparsedata -noparsetext
$ cd dumpdir
$ grep '^Signature' dump > sdump
$ wc -l sdump
11003435 sdump
$ grep 'Signature: null' sdump | wc -l
10562077

10562077 - 11003435 = 441358

441,358 is the CrawlDatum with non-null signature in the dump file.
There are a lot of CrawlDatum-s with null signature (11,003,435)

*** 431358 ~= (db_fetched):  441368


I injected around 0.5M urls to crawldb at the beginning and I set "db.update.additions.allowed" to "false" in nutch-site.xml to make sure no new URLs will be added. Not sure if this setting will have side effect to above problem or not.

Could any of you could provide some ideas of how to fix it?

Thanks,
Justin

Justin Yao wrote:
Hi Doğacan,

Thanks for your suggestion.
I've downloaded nutch-1.0 and made a fresh start of the crawling, merging, recrawling, merging process.
I'll open a JIRA if I could reproduce the problem again.

Thanks,
Justin

Doğacan Güney wrote:
On Mon, Mar 30, 2009 at 23:05, Justin Yao <jus...@snooth.com> wrote:

I dumped the segment using command:

bin/nutch readseg -dump crawl/segments/20090330155113 dumpdir -nocontent
-nofetch -nogenerate -noparsedata -noparsetext

then I opened file dumpdir/dump and found a lot of duplicate entries like:


Recno:: 2124
URL:: http://20g.fr/shop/products_new.php

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018518519
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018518519
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:05 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.023255814
Signature: null
Metadata:

.....


Recno:: 2125
URL:: http://20g.fr/shop/products_new.php?osC

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.022727273
Signature: null
Metadata:

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018181818
Signature: null
Metadata

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Wed Mar 18 11:02:06 EDT 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 604800 seconds (7 days)
Score: 0.018181818
Signature: null
Metadata

......

The same records for a same URL were duplicated tens of times or even
hundreds of times.

Does any of you know what may cause this problem?


This is not necessarily a bug, as same entries may be duplicated. But this
happening
hundreds of times is indeed fishy. Can you create a JIRA issue with the
details?


Thanks,
Justin


Justin Yao wrote:

a correction to my email:

the "crawl_data" should be "crawl_parse"

Justin

Justin Yao wrote:

Hi

I set db.update.additions.allowed to false so nutch will only crawl the
pages I injected. I set the db.default.fetch.interval and
db.fetch.interval.default to 7 days and nutch will re-crawl those pages
every 7 days.
I did a segment merging after every recrawling. There's no significant
file size change of segments/content, segments/crawl_fetch,
segments/crawl_generate, segments/parse_data, segments/parse_text. However, for segments/crawl_parse, it kept growing after every segment merging (crawl_data grew from 120M to 150M, from 150M to 500M, from 500M to 1.5G) and eventually it made the segment merging fail because it required too much
memory.
My question is, how to prevent the crawl_data directory growing after each recrawling and segment merging? Why did it keep growing? Am I doing
some wrong?

Thanks very much for your help.

Best Regards,


--
Justin Yao
Snooth
o: 646.723.4328
c: 718.662.6362
jus...@snooth.com

Snooth -- Over 2 million ratings and counting...









Reply via email to