[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875754#comment-14875754
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2102:
-------------------------------------------------------

+1 It looks good, the nutch entry will definitively will make it easier to use 
:)

> WARC Exporter
> -------------
>
>                 Key: NUTCH-2102
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2102
>             Project: Nutch
>          Issue Type: Improvement
>          Components: commoncrawl, dumpers
>    Affects Versions: 1.10
>            Reporter: Julien Nioche
>         Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to