[ https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Giuseppe Totaro updated NUTCH-1949: ----------------------------------- Attachment: CommonCrawlDataDumper_v02.pdf CommonCrawlDataDumper.xlsx Hi all, I revisited the workflow diagram of the {{CommonCrawlDataDumper}} tool. You can find in attachment also an Excel file including some open issues about this tool. Please read it (especially "Feature" sheet) and give me your feedback. Then I will explore how this tool can be done as a plugin as suggested by [~jnioche]. Thanks [~chrismattmann] and [~lewismc] for your invaluable support. > Dump out the Nuth data into the Common Crawl format > --------------------------------------------------- > > Key: NUTCH-1949 > URL: https://issues.apache.org/jira/browse/NUTCH-1949 > Project: Nutch > Issue Type: New Feature > Reporter: Giuseppe Totaro > Assignee: Giuseppe Totaro > Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, > CommonCrawlDataDumper_v02.pdf > > > We are going to develop a {{CommonCrawlDataDumper.java}} class. The > {{CommonCrawlDataDumper}} is a tool able to perfom the following steps: > # deserialize the crawled data from Nutch > # map serialized data on the proper JSON structure > # serialize the data into [CBOR|http://cbor.io] format > # optionally, compress the serialized data using {{gzip}} > This tool has to be able to work with either single Nutch segments or > directory including segments as input data. > Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support > and code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)