[ https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346265#comment-14346265 ]
Jorge Luis Betancourt Gonzalez commented on NUTCH-1949: ------------------------------------------------------- +1 > Dump out the Nuth data into the Common Crawl format > --------------------------------------------------- > > Key: NUTCH-1949 > URL: https://issues.apache.org/jira/browse/NUTCH-1949 > Project: Nutch > Issue Type: New Feature > Reporter: Giuseppe Totaro > Assignee: Giuseppe Totaro > Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, > CommonCrawlDataDumper_v02.pdf > > > We are going to develop a {{CommonCrawlDataDumper.java}} class. The > {{CommonCrawlDataDumper}} is a tool able to perfom the following steps: > # deserialize the crawled data from Nutch > # map serialized data on the proper JSON structure > # serialize the data into [CBOR|http://cbor.io] format > # optionally, compress the serialized data using {{gzip}} > This tool has to be able to work with either single Nutch segments or > directory including segments as input data. > Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support > and code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)