Hello, I am trying to determine which Protocol plugin to use in Nutch - http or httpclient. So, I ran 5 consecutive crawls (with the same configuration) with each protocol. The crawls were targeted on a static site that I set up on another machine in my LAN. This static site has approx. 28,000 pages.
Here's an example call for the crawl: % ./bin/nutch crawl urls/onion.txt -dir ~/nutch/http-expt/crawl-onion-1 -depth 20 However, I found that each crawl (even if it had the same config and protocol plugin) resulted in a different number of URLs being processed. I was expecting Nutch to return similar results for every crawl run. Is this a known behavior of Nutch? Is there any reason why Nutch would produce such results? ========================================= The following are the results of the crawl experiments: Using org.apache.nutch.protocol.httpclient: ------------------------------------------ [EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb ~/nutch/httpclient-expt/crawl-onion-1/crawldb -stats CrawlDb statistics start: /home/purshah/nutch/httpclient-expt/crawl-onion-1/crawldb Statistics for CrawlDb: /home/purshah/nutch/httpclient-expt/crawl-onion-1/crawldb TOTAL urls: 12631 retry 0: 12571 retry 1: 24 retry 2: 19 retry 3: 1 retry 4: 2 retry 5: 1 retry 6: 13 min score: 0.0 avg score: 0.0 max score: 1.038 status 1 (db_unfetched): 6301 status 2 (db_fetched): 5506 status 3 (db_gone): 824 CrawlDb statistics: done [EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb ~/nutch/httpclient-expt/crawl-onion-2/crawldb -stats CrawlDb statistics start: /home/purshah/nutch/httpclient-expt/crawl-onion-2/crawldb Statistics for CrawlDb: /home/purshah/nutch/httpclient-expt/crawl-onion-2/crawldb TOTAL urls: 15240 retry 0: 15106 retry 1: 31 retry 2: 36 retry 3: 16 retry 4: 12 retry 5: 4 retry 6: 35 min score: 0.0 avg score: 0.0 max score: 1.039 status 1 (db_unfetched): 4456 status 2 (db_fetched): 8978 status 3 (db_gone): 1805 status 5 (db_redir_perm): 1 CrawlDb statistics: done [EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb ~/nutch/httpclient-expt/crawl-onion-3/crawldb -stats CrawlDb statistics start: /home/purshah/nutch/httpclient-expt/crawl-onion-3/crawldb Statistics for CrawlDb: /home/purshah/nutch/httpclient-expt/crawl-onion-3/crawldb TOTAL urls: 17893 retry 0: 17704 retry 1: 23 retry 2: 20 retry 3: 18 retry 4: 23 retry 5: 25 retry 6: 80 min score: 0.0 avg score: 0.0010 max score: 14.035 status 1 (db_unfetched): 2595 status 2 (db_fetched): 12434 status 3 (db_gone): 2861 status 5 (db_redir_perm): 3 CrawlDb statistics: done [EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb ~/nutch/httpclient-expt/crawl-onion-4/crawldb -stats CrawlDb statistics start: /home/purshah/nutch/httpclient-expt/crawl-onion-4/crawldb Statistics for CrawlDb: /home/purshah/nutch/httpclient-expt/crawl-onion-4/crawldb TOTAL urls: 18720 retry 0: 18505 retry 1: 8 retry 2: 4 retry 4: 1 retry 5: 2 retry 6: 200 min score: 0.0 avg score: 0.0020 max score: 36.198 status 1 (db_unfetched): 1132 status 2 (db_fetched): 13956 status 3 (db_gone): 3629 status 5 (db_redir_perm): 3 CrawlDb statistics: done [EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb ~/nutch/httpclient-expt/crawl-onion-5/crawldb -stats CrawlDb statistics start: /home/purshah/nutch/httpclient-expt/crawl-onion-5/crawldb Statistics for CrawlDb: /home/purshah/nutch/httpclient-expt/crawl-onion-5/crawldb TOTAL urls: 19462 retry 0: 19212 retry 4: 1 retry 5: 2 retry 6: 247 min score: 0.0 avg score: 0.0020 max score: 41.205 status 1 (db_unfetched): 20 status 2 (db_fetched): 15287 status 3 (db_gone): 4152 status 5 (db_redir_perm): 3 CrawlDb statistics: done Using org.apache.nutch.protocol.http: ------------------------------------ [EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb ~/nutch/http-expt/crawl-onion-1/crawldb -stats CrawlDb statistics start: /home/purshah/nutch/http-expt/crawl-onion-1/crawldb Statistics for CrawlDb: /home/purshah/nutch/http-expt/crawl-onion-1/crawldb TOTAL urls: 13117 retry 0: 13055 retry 1: 8 retry 2: 4 retry 3: 6 retry 4: 13 retry 5: 10 retry 6: 21 min score: 0.0 avg score: 0.0 max score: 1.038 status 1 (db_unfetched): 5744 status 2 (db_fetched): 6268 status 3 (db_gone): 1105 CrawlDb statistics: done [EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb ~/nutch/http-expt/crawl-onion-2/crawldb -stats CrawlDb statistics start: /home/purshah/nutch/http-expt/crawl-onion-2/crawldb Statistics for CrawlDb: /home/purshah/nutch/http-expt/crawl-onion-2/crawldb TOTAL urls: 17753 retry 0: 17627 retry 1: 19 retry 2: 29 retry 3: 11 retry 4: 1 retry 5: 3 retry 6: 63 min score: 0.0 avg score: 0.0 max score: 7.03 status 1 (db_unfetched): 2997 status 2 (db_fetched): 12070 status 3 (db_gone): 2684 status 5 (db_redir_perm): 2 CrawlDb statistics: done [EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb ~/nutch/http-expt/crawl-onion-3/crawldb -stats CrawlDb statistics start: /home/purshah/nutch/http-expt/crawl-onion-3/crawldb Statistics for CrawlDb: /home/purshah/nutch/http-expt/crawl-onion-3/crawldb TOTAL urls: 19285 retry 0: 19085 retry 3: 11 retry 4: 23 retry 5: 16 retry 6: 150 min score: 0.0 avg score: 0.0020 max score: 29.186 status 1 (db_unfetched): 86 status 2 (db_fetched): 15160 status 3 (db_gone): 4036 status 5 (db_redir_perm): 3 CrawlDb statistics: done [EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb ~/nutch/http-expt/crawl-onion-4/crawldb -stats CrawlDb statistics start: /home/purshah/nutch/http-expt/crawl-onion-4/crawldb Statistics for CrawlDb: /home/purshah/nutch/http-expt/crawl-onion-4/crawldb TOTAL urls: 19460 retry 0: 19260 retry 6: 200 min score: 0.0 avg score: 0.0020 max score: 42.213 status 1 (db_unfetched): 18 status 2 (db_fetched): 15271 status 3 (db_gone): 4168 status 5 (db_redir_perm): 3 CrawlDb statistics: done [EMAIL PROTECTED] NutchTest]$ ./bin/nutch readdb ~/nutch/http-expt/crawl-onion-5/crawldb -stats CrawlDb statistics start: /home/purshah/nutch/http-expt/crawl-onion-5/crawldb Statistics for CrawlDb: /home/purshah/nutch/http-expt/crawl-onion-5/crawldb TOTAL urls: 19457 retry 0: 19257 retry 6: 200 min score: 0.0 avg score: 0.0020 max score: 35.182 status 1 (db_unfetched): 21 status 2 (db_fetched): 15273 status 3 (db_gone): 4160 status 5 (db_redir_perm): 3 CrawlDb statistics: done In advance, thank you for your time and support. Cheers, Audrey -- View this message in context: http://www.nabble.com/Different-results-for-consecutive-crawls-tf4214719.html#a11990695 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
