CrawlDbReader: add some new stats + dump into a csv format ----------------------------------------------------------
Key: NUTCH-528 URL: https://issues.apache.org/jira/browse/NUTCH-528 Project: Nutch Issue Type: Improvement Environment: Java 1.6, Linux 2.6 Reporter: Emmanuel Joke Assignee: Emmanuel Joke Priority: Minor Fix For: 1.0.0 * I've added improve the stats to list the number of urls by status and by hosts. This is an option which is not mandatory. For instance if you set sortByHost option, it will show: bin/nutch readdb crawl/crawldb -stats sortByHost CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 36 retry 0: 36 min score: 0.0020 avg score: 0.059 max score: 1.0 status 1 (db_unfetched): 33 www.yahoo.com : 33 status 2 (db_fetched): 3 www.yahoo.com : 3 CrawlDb statistics: done Of course without this option the stats are unchanged. * I've add a new option to dump the crawldb into a CSV format. It will then be easy to integrate the file in Excel and make some more complex statistics. bin/nutch readdb crawl/crawldb -dump FOLDER toCsv Extract of the file: Url;Status code;Status name;Fetch Time;Modified Time;Retries since fetch;Retry interval;Score;Signature;Metadata "http://www.yahoo.com/";1;"db_unfetched";Wed Jul 25 14:59:59 CST 2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.04151206;"null";"null" "http://www.yahoo.com/help.html";1;"db_unfetched";Wed Jul 25 15:08:09 CST 2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null" "http://www.yahoo.com/contacts.html";1;"db_unfetched";Wed Jul 25 15:08:12 CST 2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null" * I've removed some unused code ( CrawlDbDumpReducer ) as confirmed by Andrzej. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers