Jorge Luis Betancourt Gonzalez created NUTCH-2095: -----------------------------------------------------
Summary: WARC exporter for the CommonCrawlDataDumper Key: NUTCH-2095 URL: https://issues.apache.org/jira/browse/NUTCH-2095 Project: Nutch Issue Type: Improvement Components: commoncrawl, tool Affects Versions: 1.11 Reporter: Jorge Luis Betancourt Gonzalez Priority: Minor Adds the possibility of exporting the nutch segments to a WARC files. >From the usage point of view a couple of new command line options are >available: {{-warc}}: enables the functionality to export into WARC files, if not specified the default JACKSON formatter is used. {{-warcSize}}: enable the option to define a max file size for each WARC file, if not specified a default of 1GB per file is used as recommended by the WARC ISO standard. The usual {{-gzip}} flag can be used to enable compression on the WARC files. Some changes to the default {{CommonCrawlDataDumper}} were done, essentially some changes to the Factory and to the Formats. This changes avoid creating a new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)