[jira] Commented: (NUTCH-784) CrawlDBScanner

Andrzej Bialecki (JIRA) Mon, 29 Mar 2010 05:29:53 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850896#action_12850896
 ]


Andrzej Bialecki  commented on NUTCH-784:
-----------------------------------------

This should have been reviewed first - I don't question the usefulness of this 
class, but I think that this should have been added as an option to 
CrawlDbReader. As it is now we get a new tool with a cryptic name that performs 
a function that is a variant of another existing tool...

> CrawlDBScanner 
> ---------------
>
>                 Key: NUTCH-784
>                 URL: https://issues.apache.org/jira/browse/NUTCH-784
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.1
>
>         Attachments: NUTCH-784.patch
>
>
> The patch file contains a utility which dumps all the entries matching a 
> regular expression on their URL. The dump mechanism of the crawldb reader is 
> not  very useful on large crawldbs as the ouput can be extremely large and 
> the -url  function can't help if we don't know what url we want to have a 
> look at.
> The CrawlDBScanner can either generate a text representation of the 
> CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
> Usage: CrawlDBScanner <crawldb> <output> <regex> [-s <status>] <-text>
> regex: regular expression on the crawldb key
> -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
> db_unfetched
> -text : if this parameter is used, the output will be of TextOutputFormat; 
> otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
> for instance the command below : 
> ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
> -s db_fetched -text
> will generate a text file /tmp/amazon-dump containing all the entries of the 
> crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-784) CrawlDBScanner

Reply via email to