[ https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jorge Luis Betancourt Gonzalez updated NUTCH-2095: -------------------------------------------------- Attachment: NUTCH-2095.patch > WARC exporter for the CommonCrawlDataDumper > ------------------------------------------- > > Key: NUTCH-2095 > URL: https://issues.apache.org/jira/browse/NUTCH-2095 > Project: Nutch > Issue Type: Improvement > Components: commoncrawl, tool > Affects Versions: 1.11 > Reporter: Jorge Luis Betancourt Gonzalez > Priority: Minor > Labels: tools, warc > Attachments: NUTCH-2095.patch > > > Adds the possibility of exporting the nutch segments to a WARC files. > From the usage point of view a couple of new command line options are > available: > {{-warc}}: enables the functionality to export into WARC files, if not > specified the default JACKSON formatter is used. > {{-warcSize}}: enable the option to define a max file size for each WARC > file, if not specified a default of 1GB per file is used as recommended by > the WARC ISO standard. > The usual {{-gzip}} flag can be used to enable compression on the WARC files. > Some changes to the default {{CommonCrawlDataDumper}} were done, essentially > some changes to the Factory and to the Formats. This changes avoid creating a > new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)