[ https://issues.apache.org/jira/browse/NUTCH-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann reassigned NUTCH-1975: ---------------------------------------- Assignee: Chris A. Mattmann > New configuration for CommonCrawlDataDumper tool > ------------------------------------------------ > > Key: NUTCH-1975 > URL: https://issues.apache.org/jira/browse/NUTCH-1975 > Project: Nutch > Issue Type: Improvement > Components: tool > Affects Versions: 1.9 > Reporter: Giuseppe Totaro > Assignee: Chris A. Mattmann > Priority: Minor > Attachments: NUTCH-1975.patch > > > Hi all, you can find in attachment a new patch including support for new > options for {{CommonCrawlDataDumper}}. > In particultar, new options are passed to {{CommonCrawlFormat}} object (which > provides methods to create JSON output) using a configuration object > ({{CommonCrawlConfig}}). > In particular, in this patch {{CommonCrawlDataDumper}} provides support for > the following options: > * {{-SimpleDataFormat}}: enables timestamps in GMT epoche (milliseconds) > format. > * {{-epochFilename}}: files extracted will be organized in a reversed-NDS > tree based on the FQDN of the webpage, followed by a SHA1 hash of the > complete URL. Scraped data will be stored in these directories as individual > GMT-timestamped files using "epoche time (in milliseconds)" plus file > extension. > * {{-jsonArray}}: organizes both request and response headers into a JSON > array instead of using a JSON sub-object. > *{{-reverseKey}}: enables to use the same layout as described for > -epochFilename option, with underscore in place of directory separators. > You can use the options above in addition to the options already supported, > as described in the [Nutch > wiki|https://wiki.apache.org/nutch/CommonCrawlDataDumper] page. > This patch starts from > [NUTCH-1974|https://issues.apache.org/jira/browse/NUTCH-1974]. > Thanks [~chrismattmann] and [~annieburgess] for supporting me on this work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)