[jira] [Updated] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format

2015-03-03 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-1949:
-
Component/s: tool
 storage
 linkdb
 crawldb

 Dump out the Nuth data into the Common Crawl format
 ---

 Key: NUTCH-1949
 URL: https://issues.apache.org/jira/browse/NUTCH-1949
 Project: Nutch
  Issue Type: New Feature
  Components: crawldb, linkdb, storage, tool
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
 Fix For: 1.10

 Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, 
 CommonCrawlDataDumper_v02.pdf


 We are going to develop a {{CommonCrawlDataDumper.java}} class. The 
 {{CommonCrawlDataDumper}} is a tool able to perfom the following steps:  
 # deserialize the crawled data from Nutch
 # map serialized data on the proper JSON structure
 # serialize the data into [CBOR|http://cbor.io] format
 # optionally, compress the serialized data using {{gzip}}
 This tool has to be able to work with either single Nutch segments or 
 directory including segments as input data.
 Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support 
 and code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format

2015-03-03 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-1949:
-
Assignee: Lewis John McGibbney  (was: Giuseppe Totaro)

 Dump out the Nuth data into the Common Crawl format
 ---

 Key: NUTCH-1949
 URL: https://issues.apache.org/jira/browse/NUTCH-1949
 Project: Nutch
  Issue Type: New Feature
  Components: crawldb, linkdb, storage, tool
Reporter: Giuseppe Totaro
Assignee: Lewis John McGibbney
 Fix For: 1.10

 Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, 
 CommonCrawlDataDumper_v02.pdf


 We are going to develop a {{CommonCrawlDataDumper.java}} class. The 
 {{CommonCrawlDataDumper}} is a tool able to perfom the following steps:  
 # deserialize the crawled data from Nutch
 # map serialized data on the proper JSON structure
 # serialize the data into [CBOR|http://cbor.io] format
 # optionally, compress the serialized data using {{gzip}}
 This tool has to be able to work with either single Nutch segments or 
 directory including segments as input data.
 Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support 
 and code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format

2015-03-03 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-1949:
-
Fix Version/s: 1.10

 Dump out the Nuth data into the Common Crawl format
 ---

 Key: NUTCH-1949
 URL: https://issues.apache.org/jira/browse/NUTCH-1949
 Project: Nutch
  Issue Type: New Feature
  Components: crawldb, linkdb, storage, tool
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
 Fix For: 1.10

 Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, 
 CommonCrawlDataDumper_v02.pdf


 We are going to develop a {{CommonCrawlDataDumper.java}} class. The 
 {{CommonCrawlDataDumper}} is a tool able to perfom the following steps:  
 # deserialize the crawled data from Nutch
 # map serialized data on the proper JSON structure
 # serialize the data into [CBOR|http://cbor.io] format
 # optionally, compress the serialized data using {{gzip}}
 This tool has to be able to work with either single Nutch segments or 
 directory including segments as input data.
 Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support 
 and code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format

2015-02-25 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated NUTCH-1949:
---
Attachment: CommonCrawlDataDumper.pdf

You can find in attachment my workflow diagram. I will update you as soon as 
possible.
Thanks [~chrismattmann] and [~jnioche].

 Dump out the Nuth data into the Common Crawl format
 ---

 Key: NUTCH-1949
 URL: https://issues.apache.org/jira/browse/NUTCH-1949
 Project: Nutch
  Issue Type: New Feature
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
 Attachments: CommonCrawlDataDumper.pdf


 We are going to develop a {{CommonCrawlDataDumper.java}} class. The 
 {{CommonCrawlDataDumper}} is a tool able to perfom the following steps:  
 # deserialize the crawled data from Nutch
 # map serialized data on the proper JSON structure
 # serialize the data into [CBOR|http://cbor.io] format
 # optionally, compress the serialized data using {{gzip}}
 This tool has to be able to work with either single Nutch segments or 
 directory including segments as input data.
 Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support 
 and code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format

2015-02-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1949:

Assignee: Giuseppe Totaro

 Dump out the Nuth data into the Common Crawl format
 ---

 Key: NUTCH-1949
 URL: https://issues.apache.org/jira/browse/NUTCH-1949
 Project: Nutch
  Issue Type: New Feature
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro

 We are going to develop a {{CommonCrawlDataDumper.java}} class. The 
 {{CommonCrawlDataDumper}} is a tool able to perfom the following steps:  
 # deserialize the crawled data from Nutch
 # map serialized data on the proper JSON structure
 # serialize the data into [CBOR|http://cbor.io] format
 # optionally, compress the serialized data using {{gzip}}
 This tool has to be able to work with either single Nutch segments or 
 directory including segments as input data.
 Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support 
 and code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)