[ https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Emmanuel Joke updated NUTCH-528: -------------------------------- Attachment: NUTCH-528.patch patch attached > CrawlDbReader: add some new stats + dump into a csv format > ---------------------------------------------------------- > > Key: NUTCH-528 > URL: https://issues.apache.org/jira/browse/NUTCH-528 > Project: Nutch > Issue Type: Improvement > Environment: Java 1.6, Linux 2.6 > Reporter: Emmanuel Joke > Assignee: Emmanuel Joke > Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-528.patch > > > * I've added improve the stats to list the number of urls by status and by > hosts. This is an option which is not mandatory. > For instance if you set sortByHost option, it will show: > bin/nutch readdb crawl/crawldb -stats sortByHost > CrawlDb statistics start: crawl/crawldb > Statistics for CrawlDb: crawl/crawldb > TOTAL urls: 36 > retry 0: 36 > min score: 0.0020 > avg score: 0.059 > max score: 1.0 > status 1 (db_unfetched): 33 > www.yahoo.com : 33 > status 2 (db_fetched): 3 > www.yahoo.com : 3 > CrawlDb statistics: done > Of course without this option the stats are unchanged. > * I've add a new option to dump the crawldb into a CSV format. It will then > be easy to integrate the file in Excel and make some more complex statistics. > bin/nutch readdb crawl/crawldb -dump FOLDER toCsv > Extract of the file: > Url;Status code;Status name;Fetch Time;Modified Time;Retries since > fetch;Retry interval;Score;Signature;Metadata > "http://www.yahoo.com/";1;"db_unfetched";Wed Jul 25 14:59:59 CST 2007;Thu Jan > 01 08:00:00 CST 1970;0;2592000.0;30.0;0.04151206;"null";"null" > "http://www.yahoo.com/help.html";1;"db_unfetched";Wed Jul 25 15:08:09 CST > 2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null" > "http://www.yahoo.com/contacts.html";1;"db_unfetched";Wed Jul 25 15:08:12 CST > 2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null" > * I've removed some unused code ( CrawlDbDumpReducer ) as confirmed by > Andrzej. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers