[
https://issues.apache.org/jira/browse/PIG-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585696#comment-14585696
]
Tomas Hudik commented on PIG-4533:
----------------------------------
done, now it should make more sense
> Document error: Pig does support concatenated gz file
> -----------------------------------------------------
>
> Key: PIG-4533
> URL: https://issues.apache.org/jira/browse/PIG-4533
> Project: Pig
> Issue Type: Bug
> Components: documentation, parser
> Reporter: Tomas Hudik
> Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4533-1.patch
>
>
> Documentation (since 0.11.1 at least) says :
> http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
> _"Note: PigStorage and TextLoader correctly read compressed files as long as
> they are NOT CONCATENATED FILES generated in this manner: ..."_
> This is not true for gz, since
> * I did a test - concatenated&compress some files and processed them. The
> same was done with the raw files (no compression). The results were identical
> Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and
> https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation
> problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively for both: bz2
> and gz. That said Hadoop (1 and 2) are supporting concatenated archives bz2,
> gz already.
> Pig deals with bz2 on its own(historical reasons) which is redundant to
> hadoop-common. Therefore this activity should be left to hadoop-common (there
> is no need to be handled by Pig anymore).
> The documentation needs to be updated accordingly (concatenated gz, bz2 are
> processing correctly with hadoop-commons). Also a remark that tar.gz and
> tar.bz2 are not supported would be helpful since many users are using tar.gz
> or tar.bz2 automatically.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)