Joris Rau created NUTCH-2213:
--------------------------------
Summary: CommonCrawlDataDumper saves gzipped body in extracted form
Key: NUTCH-2213
URL: https://issues.apache.org/jira/browse/NUTCH-2213
Project: Nutch
Issue Type: Bug
Components: commoncrawl, dumpers
Reporter: Joris Rau
Priority: Critical
I have downloaded [a WARC
file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
from the common crawl data. This file contains several gzipped responses which
are stored plaintext (without the gzip encoding).
I used [warctools|https://github.com/internetarchive/warctools] from Internet
Archive to extract the responses out of the WARC file. However this tool
expects the Content-Length field to match the actual length of the body in the
WARC ([See the issue on
github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
I have not been using Nutch and can therefore not say which versions are
affected by this.
After reading [the official WARC
draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I
could not find out how gzipped content is supposed to be stored. However
probably multiple WARC file parsers will have an issue with this.
It would be nice to know whether you consider this a bug and plan on fixing
this and whether this is a major issue which concerns most WARC files of the
Common Crawl data or only a small part.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)