CrawlDbReader: add some new stats + dump into a csv format
----------------------------------------------------------

                 Key: NUTCH-528
                 URL: https://issues.apache.org/jira/browse/NUTCH-528
             Project: Nutch
          Issue Type: Improvement
         Environment: Java 1.6, Linux 2.6
            Reporter: Emmanuel Joke
            Assignee: Emmanuel Joke
            Priority: Minor
             Fix For: 1.0.0


* I've added improve the stats to list the number of urls by status and by 
hosts. This is an option which is not mandatory.

For instance if you set sortByHost option, it will show:
bin/nutch readdb crawl/crawldb -stats sortByHost
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     36
retry 0:        36
min score:      0.0020
avg score:      0.059
max score:      1.0
status 1 (db_unfetched):        33
   www.yahoo.com :      33
status 2 (db_fetched):  3
   www.yahoo.com :      3
CrawlDb statistics: done

Of course without this option the stats are unchanged.

* I've add a new option to dump the crawldb into a CSV format. It will then be 
easy to integrate the file in Excel and make some more complex statistics.
bin/nutch readdb crawl/crawldb -dump FOLDER toCsv
Extract of the file:
Url;Status code;Status name;Fetch Time;Modified Time;Retries since fetch;Retry 
interval;Score;Signature;Metadata
"http://www.yahoo.com/";1;"db_unfetched";Wed Jul 25 14:59:59 CST 2007;Thu Jan 
01 08:00:00 CST 1970;0;2592000.0;30.0;0.04151206;"null";"null"
"http://www.yahoo.com/help.html";1;"db_unfetched";Wed Jul 25 15:08:09 CST 
2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
"http://www.yahoo.com/contacts.html";1;"db_unfetched";Wed Jul 25 15:08:12 CST 
2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"

* I've removed some unused code ( CrawlDbDumpReducer ) as confirmed by Andrzej.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to