[ https://issues.apache.org/jira/browse/PIG-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583948#comment-14583948 ]
Rohini Palaniswamy commented on PIG-4533: ----------------------------------------- +1 > Document error: Pig does support concatenated gz file > ----------------------------------------------------- > > Key: PIG-4533 > URL: https://issues.apache.org/jira/browse/PIG-4533 > Project: Pig > Issue Type: Bug > Components: documentation, parser > Reporter: Tomas Hudik > Assignee: Daniel Dai > Fix For: 0.16.0 > > Attachments: PIG-4533-1.patch > > > Documentation (since 0.11.1 at least) says : > http://pig.apache.org/docs/r0.11.1/func.html#handling-compression > _"Note: PigStorage and TextLoader correctly read compressed files as long as > they are NOT CONCATENATED FILES generated in this manner: ..."_ > This is not true for tar.gz, since > # I did a test - concatenated&compress some files and processed them. The > same was done with the raw files (no compression). The results were identical > # Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and > https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation > problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively. That said > Hadoop (1 and 2) are supporting this already. > Pig is handling tar.bz2 only (tar.gz is handled by hadoop-common). > Therefore, > # tar.bz2 should be handled by hadoop-common as well (there is no need to be > handled by Pig anymore). (I believe > https://github.com/apache/pig/tree/trunk/lib-src/bzip2/org/apache should be > removed) > # correct documentation accordingly (concatenated tar.gz, tar.bz2 are > processing correctly) -- This message was sent by Atlassian JIRA (v6.3.4#6332)