[ https://issues.apache.org/jira/browse/PIG-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981383#comment-14981383 ]
Rohini Palaniswamy commented on PIG-4533: ----------------------------------------- Ignore my previous comment. Missed the @Test (expected=IOException.class) on top of that test. > Document error: Pig does support concatenated gz file > ----------------------------------------------------- > > Key: PIG-4533 > URL: https://issues.apache.org/jira/browse/PIG-4533 > Project: Pig > Issue Type: Bug > Components: documentation, parser > Reporter: Tomas Hudik > Assignee: Daniel Dai > Fix For: 0.16.0 > > Attachments: PIG-4533-1.patch > > > Documentation (since 0.11.1 at least) says : > http://pig.apache.org/docs/r0.11.1/func.html#handling-compression > _"Note: PigStorage and TextLoader correctly read compressed files as long as > they are NOT CONCATENATED FILES generated in this manner: ..."_ > This is not true for gz, since > * I did a test - concatenated&compress some files and processed them. The > same was done with the raw files (no compression). The results were identical > Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and > https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation > problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively for both: bz2 > and gz. That said Hadoop (1 and 2) are supporting concatenated archives bz2, > gz already. > Pig deals with bz2 on its own(historical reasons) which is redundant to > hadoop-common. Therefore this activity should be left to hadoop-common (there > is no need to be handled by Pig anymore). > The documentation needs to be updated accordingly (concatenated gz, bz2 are > processing correctly with hadoop-commons). Also a remark that tar.gz and > tar.bz2 are not supported would be helpful since many users are using tar.gz > or tar.bz2 automatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)