[ 
https://issues.apache.org/jira/browse/PIG-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas Hudik updated PIG-4533:
-----------------------------
    Description: 
Documentation (since 0.11.1 at least) says :
http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
_"Note: PigStorage and TextLoader correctly read compressed files as long as 
they are NOT CONCATENATED FILES generated in this manner: ..."_

This is not true for gz, since
* I did a test - concatenated&compress some files and processed them. The same 
was done with the raw files (no compression). The results were identical

Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and 
https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation 
problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively for both: bz2 and 
gz. That said Hadoop (1 and 2) are supporting concatenated archives bz2, gz  
already. 

Pig deals with bz2 on its own(historical reasons) which is redundant to 
hadoop-common. Therefore this activity should be left to hadoop-common (there 
is no need to be handled by Pig anymore). 

The documentation needs to be updated accordingly (concatenated gz, bz2 are 
processing correctly with hadoop-commons). Also a remark that tar.gz and 
tar.bz2 are not supported would be helpful since many users are using tar.gz or 
tar.bz2 automatically.





  was:
Documentation (since 0.11.1 at least) says :
http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
_"Note: PigStorage and TextLoader correctly read compressed files as long as 
they are NOT CONCATENATED FILES generated in this manner: ..."_

This is not true for tar.gz, since
# I did a test - concatenated&compress some files and processed them. The same 
was done with the raw files (no compression). The results were identical
# Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and 
https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation 
problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively. That said Hadoop 
(1 and 2) are supporting this already. 

Pig is handling tar.bz2 only (tar.gz is handled by hadoop-common). 
Therefore, 
# tar.bz2 should be handled by hadoop-common as well (there is no need to be 
handled by Pig anymore). (I believe 
https://github.com/apache/pig/tree/trunk/lib-src/bzip2/org/apache should be 
removed)
# correct documentation accordingly (concatenated tar.gz, tar.bz2 are 
processing correctly)






> Document error: Pig does support concatenated gz file
> -----------------------------------------------------
>
>                 Key: PIG-4533
>                 URL: https://issues.apache.org/jira/browse/PIG-4533
>             Project: Pig
>          Issue Type: Bug
>          Components: documentation, parser
>            Reporter: Tomas Hudik
>            Assignee: Daniel Dai
>             Fix For: 0.16.0
>
>         Attachments: PIG-4533-1.patch
>
>
> Documentation (since 0.11.1 at least) says :
> http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
> _"Note: PigStorage and TextLoader correctly read compressed files as long as 
> they are NOT CONCATENATED FILES generated in this manner: ..."_
> This is not true for gz, since
> * I did a test - concatenated&compress some files and processed them. The 
> same was done with the raw files (no compression). The results were identical
> Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and 
> https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation 
> problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively for both: bz2 
> and gz. That said Hadoop (1 and 2) are supporting concatenated archives bz2, 
> gz  already. 
> Pig deals with bz2 on its own(historical reasons) which is redundant to 
> hadoop-common. Therefore this activity should be left to hadoop-common (there 
> is no need to be handled by Pig anymore). 
> The documentation needs to be updated accordingly (concatenated gz, bz2 are 
> processing correctly with hadoop-commons). Also a remark that tar.gz and 
> tar.bz2 are not supported would be helpful since many users are using tar.gz 
> or tar.bz2 automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to