[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851238#action_12851238 ]
Hudson commented on NUTCH-784: ------------------------------ Integrated in Nutch-trunk #1111 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1111/]) : CrawlDBScanner > CrawlDBScanner > --------------- > > Key: NUTCH-784 > URL: https://issues.apache.org/jira/browse/NUTCH-784 > Project: Nutch > Issue Type: New Feature > Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 1.1 > > Attachments: NUTCH-784.patch > > > The patch file contains a utility which dumps all the entries matching a > regular expression on their URL. The dump mechanism of the crawldb reader is > not very useful on large crawldbs as the ouput can be extremely large and > the -url function can't help if we don't know what url we want to have a > look at. > The CrawlDBScanner can either generate a text representation of the > CrawlDatum-s or binary objects which can then be used as a new CrawlDB. > Usage: CrawlDBScanner <crawldb> <output> <regex> [-s <status>] <-text> > regex: regular expression on the crawldb key > -s status : constraint on the status of the crawldb entries e.g. db_fetched, > db_unfetched > -text : if this parameter is used, the output will be of TextOutputFormat; > otherwise it generates a 'normal' crawldb with the MapFileOutputFormat > for instance the command below : > ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* > -s db_fetched -text > will generate a text file /tmp/amazon-dump containing all the entries of the > crawldb matching the regexp .+amazon.com.* and having a status of db_fetched -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.