[ https://issues.apache.org/jira/browse/PIG-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550037#comment-14550037 ]
Tomas Hudik commented on PIG-4533: ---------------------------------- Based on the discussion in the mailing list http://mail-archives.apache.org/mod_mbox/pig-user/201505.mbox/browser (click on: "concatenated gzip/bzip in Pig 0.11 and higher") seems that gzip files are ok since they are processed by hadoop which can handle concatenated gzip (https://issues.apache.org/jira/browse/HADOOP-6835). The problem is with bzip2 which are processed by Pig. Since Hadoop is able to process also concatenated bzip2 files as well (https://issues.apache.org/jira/browse/HADOOP-4012 or http://stackoverflow.com/a/25888475/1408096 ) I'd let Hadoop do this job instead of Pig. If so the documentation (http://pig.apache.org/docs/r0.11.1/func.html#handling-compression) needs to be updated accordingly > support of concatenated bz2/gz files > ------------------------------------ > > Key: PIG-4533 > URL: https://issues.apache.org/jira/browse/PIG-4533 > Project: Pig > Issue Type: Bug > Components: documentation > Reporter: Tomas Hudik > > Documentation (since 0.11.1 at least) says : > http://pig.apache.org/docs/r0.11.1/func.html#handling-compression > _"Note: PigStorage and TextLoader correctly read compressed files as long as > they are NOT CONCATENATED FILES generated in this manner: ..."_ > I doubt this is still true, since > 1. I did a test - concatenated some files and processed them. However, all the > results were identical to ones that were produces on non-concatenated > files. Why? They should be different... > 2. Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and > https://issues.apache.org/jira/i#browse/HADOOP-6835 says this was fixed in > Hadoop 0.22, Hadoop 0.20 respectively. That said Hadoop (1 and 2) are > supporting this. I suppose Pig do not make compression on its own but rather > depends on hadoop-core (hadoo-common respectively) libraries. > If I'm right, the documentation should be fixed (delete the part about > concatinated compression files problems) -- This message was sent by Atlassian JIRA (v6.3.4#6332)