[ https://issues.apache.org/jira/browse/NUTCH-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-1526: ---------------------------------------- Labels: content dumper memex nutch tools (was: content dumper nutch tools) > Create SegmentContentDumperTool for easily extracting out file contents from > SegmentDirs > ---------------------------------------------------------------------------------------- > > Key: NUTCH-1526 > URL: https://issues.apache.org/jira/browse/NUTCH-1526 > Project: Nutch > Issue Type: New Feature > Components: storage > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Labels: content, dumper, memex, nutch, tools > Fix For: 1.10 > > Attachments: NUTCH-1526.Mattmann.090514.patch.txt > > > It only took me 1.2 years, but I finally got around to it. This patch will > deliver a SegmentContentDumper tool per the description here: > http://s.apache.org/kv > And per the interface here: > {noformat} > ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options] > -segmentRootDir full file path to the root segment directory, e.g., > crawl/segments > -regexUrlPattern a regex URL pattern to select URL keys to dump from the > content DB in each segment > -outputDir The output directory to write file names to. > -metadata --key=value where key is a Content Metadata key and value is a > value to check. > {noformat} > If the URL and its content metadata have a matching key,value pair, dump it. > Allow for regex matching on the value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)