Hello,

I am trying to determine which Protocol plugin to use in Nutch - http or
httpclient.  So, I ran 5 consecutive crawls (with the same configuration)
with each protocol.  The crawls were targeted on a static site that I set up
on another machine in my LAN.  This static site has approx. 28,000 pages.  


Here's an example call for the crawl:

 % ./bin/nutch crawl urls/onion.txt -dir ~/nutch/http-expt/crawl-onion-1
-depth 20

However, I found that each crawl (even if it had the same config and
protocol plugin) resulted in a different number of URLs being processed. I
was expecting Nutch to return similar results for every crawl run. Is this a
known behavior of Nutch?  Is there any reason why Nutch would produce such
results?

=========================================

 The following are the results of the crawl experiments:

 

Using org.apache.nutch.protocol.httpclient:

------------------------------------------

 

[EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb
~/nutch/httpclient-expt/crawl-onion-1/crawldb -stats

CrawlDb statistics start:
/home/purshah/nutch/httpclient-expt/crawl-onion-1/crawldb

Statistics for CrawlDb:
/home/purshah/nutch/httpclient-expt/crawl-onion-1/crawldb

TOTAL urls:     12631

retry 0:        12571

retry 1:        24

retry 2:        19

retry 3:        1

retry 4:        2

retry 5:        1

retry 6:        13

min score:      0.0

avg score:      0.0

max score:      1.038

status 1 (db_unfetched):        6301

status 2 (db_fetched):  5506

status 3 (db_gone):     824

CrawlDb statistics: done

 

[EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb
~/nutch/httpclient-expt/crawl-onion-2/crawldb -stats

CrawlDb statistics start:
/home/purshah/nutch/httpclient-expt/crawl-onion-2/crawldb

Statistics for CrawlDb:
/home/purshah/nutch/httpclient-expt/crawl-onion-2/crawldb

TOTAL urls:     15240

retry 0:        15106

retry 1:        31

retry 2:        36

retry 3:        16

retry 4:        12

retry 5:        4

retry 6:        35

min score:      0.0

avg score:      0.0

max score:      1.039

status 1 (db_unfetched):        4456

status 2 (db_fetched):  8978

status 3 (db_gone):     1805

status 5 (db_redir_perm):       1

CrawlDb statistics: done

 

[EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb
~/nutch/httpclient-expt/crawl-onion-3/crawldb -stats

CrawlDb statistics start:
/home/purshah/nutch/httpclient-expt/crawl-onion-3/crawldb

Statistics for CrawlDb:
/home/purshah/nutch/httpclient-expt/crawl-onion-3/crawldb

TOTAL urls:     17893

retry 0:        17704

retry 1:        23

retry 2:        20

retry 3:        18

retry 4:        23

retry 5:        25

retry 6:        80

min score:      0.0

avg score:      0.0010

max score:      14.035

status 1 (db_unfetched):        2595

status 2 (db_fetched):  12434

status 3 (db_gone):     2861

status 5 (db_redir_perm):       3

CrawlDb statistics: done

 

[EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb
~/nutch/httpclient-expt/crawl-onion-4/crawldb -stats

CrawlDb statistics start:
/home/purshah/nutch/httpclient-expt/crawl-onion-4/crawldb

Statistics for CrawlDb:
/home/purshah/nutch/httpclient-expt/crawl-onion-4/crawldb

TOTAL urls:     18720

retry 0:        18505

retry 1:        8

retry 2:        4

retry 4:        1

retry 5:        2

retry 6:        200

min score:      0.0

avg score:      0.0020

max score:      36.198

status 1 (db_unfetched):        1132

status 2 (db_fetched):  13956

status 3 (db_gone):     3629

status 5 (db_redir_perm):       3

CrawlDb statistics: done

 

[EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb
~/nutch/httpclient-expt/crawl-onion-5/crawldb -stats

 

CrawlDb statistics start:
/home/purshah/nutch/httpclient-expt/crawl-onion-5/crawldb

Statistics for CrawlDb:
/home/purshah/nutch/httpclient-expt/crawl-onion-5/crawldb

TOTAL urls:     19462

retry 0:        19212

retry 4:        1

retry 5:        2

retry 6:        247

min score:      0.0

avg score:      0.0020

max score:      41.205

status 1 (db_unfetched):        20

status 2 (db_fetched):  15287

status 3 (db_gone):     4152

status 5 (db_redir_perm):       3

CrawlDb statistics: done

 

Using org.apache.nutch.protocol.http:

------------------------------------

 

[EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb
~/nutch/http-expt/crawl-onion-1/crawldb -stats

CrawlDb statistics start:
/home/purshah/nutch/http-expt/crawl-onion-1/crawldb

Statistics for CrawlDb: /home/purshah/nutch/http-expt/crawl-onion-1/crawldb

TOTAL urls:     13117

retry 0:        13055

retry 1:        8

retry 2:        4

retry 3:        6

retry 4:        13

retry 5:        10

retry 6:        21

min score:      0.0

avg score:      0.0

max score:      1.038

status 1 (db_unfetched):        5744

status 2 (db_fetched):  6268

status 3 (db_gone):     1105

CrawlDb statistics: done

 

[EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb
~/nutch/http-expt/crawl-onion-2/crawldb -stats

CrawlDb statistics start:
/home/purshah/nutch/http-expt/crawl-onion-2/crawldb

Statistics for CrawlDb: /home/purshah/nutch/http-expt/crawl-onion-2/crawldb

TOTAL urls:     17753

retry 0:        17627

retry 1:        19

retry 2:        29

retry 3:        11

retry 4:        1

retry 5:        3

retry 6:        63

min score:      0.0

avg score:      0.0

max score:      7.03

status 1 (db_unfetched):        2997

status 2 (db_fetched):  12070

status 3 (db_gone):     2684

status 5 (db_redir_perm):       2

CrawlDb statistics: done

 

[EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb
~/nutch/http-expt/crawl-onion-3/crawldb -stats

CrawlDb statistics start:
/home/purshah/nutch/http-expt/crawl-onion-3/crawldb

Statistics for CrawlDb: /home/purshah/nutch/http-expt/crawl-onion-3/crawldb

TOTAL urls:     19285

retry 0:        19085

retry 3:        11

retry 4:        23

retry 5:        16

retry 6:        150

min score:      0.0

avg score:      0.0020

max score:      29.186

status 1 (db_unfetched):        86

status 2 (db_fetched):  15160

status 3 (db_gone):     4036

status 5 (db_redir_perm):       3

CrawlDb statistics: done

 

[EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb
~/nutch/http-expt/crawl-onion-4/crawldb -stats

CrawlDb statistics start:
/home/purshah/nutch/http-expt/crawl-onion-4/crawldb

Statistics for CrawlDb: /home/purshah/nutch/http-expt/crawl-onion-4/crawldb

TOTAL urls:     19460

retry 0:        19260

retry 6:        200

min score:      0.0

avg score:      0.0020

max score:      42.213

status 1 (db_unfetched):        18

status 2 (db_fetched):  15271

status 3 (db_gone):     4168

status 5 (db_redir_perm):       3

CrawlDb statistics: done

 

[EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb
~/nutch/http-expt/crawl-onion-5/crawldb -stats

CrawlDb statistics start:
/home/purshah/nutch/http-expt/crawl-onion-5/crawldb

Statistics for CrawlDb: /home/purshah/nutch/http-expt/crawl-onion-5/crawldb

TOTAL urls:     19457

retry 0:        19257

retry 6:        200

min score:      0.0

avg score:      0.0020

max score:      35.182

status 1 (db_unfetched):        21

status 2 (db_fetched):  15273

status 3 (db_gone):     4160

status 5 (db_redir_perm):       3

CrawlDb statistics: done

 

 

In advance, thank you for your time and support.

Cheers,
Audrey

 

-- 
View this message in context: 
http://www.nabble.com/Different-results-for-consecutive-crawls-tf4214719.html#a11990695
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to