[ https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann resolved NUTCH-2213. -------------------------------------- Resolution: Fixed Fix Version/s: 1.12 > CommonCrawlDataDumper saves gzipped body in extracted form > ---------------------------------------------------------- > > Key: NUTCH-2213 > URL: https://issues.apache.org/jira/browse/NUTCH-2213 > Project: Nutch > Issue Type: Bug > Components: commoncrawl, dumpers > Reporter: Joris Rau > Assignee: Chris A. Mattmann > Priority: Critical > Labels: easyfix > Fix For: 1.12 > > > I have downloaded [a WARC > file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz] > from the common crawl data. This file contains several gzipped responses > which are stored plaintext (without the gzip encoding). > I used [warctools|https://github.com/internetarchive/warctools] from Internet > Archive to extract the responses out of the WARC file. However this tool > expects the Content-Length field to match the actual length of the body in > the WARC ([See the issue on > github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]). > warctools uses a more up to date version of hanzo warctools which is > recommended on the [Common Crawl > website|https://commoncrawl.org/the-data/get-started/] under "Processing the > file format". > I have not been using Nutch and can therefore not say which versions are > affected by this. > After reading [the official WARC > draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I > could not find out how gzipped content is supposed to be stored. However > probably multiple WARC file parsers will have an issue with this. > It would be nice to know whether you consider this a bug and plan on fixing > this and whether this is a major issue which concerns most WARC files of the > Common Crawl data or only a small part. -- This message was sent by Atlassian JIRA (v6.3.4#6332)