[jira] [Commented] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

Chris A. Mattmann (JIRA) Wed, 10 Feb 2016 07:12:12 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140950#comment-15140950
 ]


Chris A. Mattmann commented on NUTCH-2213:
------------------------------------------

Hi [~jrsr] thanks for the issue request. You may also want to have a look at:

http://wiki.apache.org/nutch/CommonCrawlDataDumper

It's probably worth noting here too that the tool "scaling" is probably in the 
eyes of the beholder. We regularly use the tool a ton in my team at NASA JPL to 
dump loads of data (terabytes) from Nutch crawls, etc. It takes up a bunch of 
memory and isn't necessarily as fast as it could be (as [~jnioche] noted). The 
main reason is that it needs to implement Map Reduce and right now the tool 
does everything on the head node. If you're interested in the tool or if you 
find it useful, we would be happy to work with you and/or anyone to port it to 
Map Reduce which would be trivial.

Cheers,
Chris


> CommonCrawlDataDumper saves gzipped body in extracted form
> ----------------------------------------------------------
>
>                 Key: NUTCH-2213
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2213
>             Project: Nutch
>          Issue Type: Bug
>          Components: commoncrawl, dumpers
>            Reporter: Joris Rau
>            Priority: Critical
>              Labels: easyfix
>
> I have downloaded [a WARC 
> file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
>  from the common crawl data. This file contains several gzipped responses 
> which are stored plaintext (without the gzip encoding).
> I used [warctools|https://github.com/internetarchive/warctools] from Internet 
> Archive to extract the responses out of the WARC file. However this tool 
> expects the Content-Length field to match the actual length of the body in 
> the WARC ([See the issue on 
> github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
>  warctools uses a more up to date version of hanzo warctools which is 
> recommended on the [Common Crawl 
> website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
> file format".
> I have not been using Nutch and can therefore not say which versions are 
> affected by this.
> After reading [the official WARC 
> draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
> could not find out how gzipped content is supposed to be stored. However 
> probably multiple WARC file parsers will have an issue with this.
> It would be nice to know whether you consider this a bug and plan on fixing 
> this and whether this is a major issue which concerns most WARC files of the 
> Common Crawl data or only a small part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

Reply via email to