[ 
https://issues.apache.org/jira/browse/PIG-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas Hudik updated PIG-4533:
-----------------------------
    Description: 
Documentation (since 0.11.1 at least) says :
http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
_"Note: PigStorage and TextLoader correctly read compressed files as long as 
they are NOT CONCATENATED FILES generated in this manner: ..."_

This is not true for tar.gz, since
# I did a test - concatenated&compress some files and processed them. The same 
was done with the raw files (no compression). The results were identical
# Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and 
https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation 
problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively. That said Hadoop 
(1 and 2) are supporting this already. 

Pig is handling tar.bz2 only (tar.gz is handled by hadoop-common). 
Therefore, 
# tar.bz2 should be handled by hadoop-common as well (there is no need to be 
handled by Pig anymore). (I believe 
https://github.com/apache/pig/tree/trunk/lib-src/bzip2/org/apache should be 
removed)
# correct documentation accordingly (concatenated tar.gz, tar.bz2 are 
processing correctly)





  was:
Documentation (since 0.11.1 at least) says :
http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
_"Note: PigStorage and TextLoader correctly read compressed files as long as 
they are NOT CONCATENATED FILES generated in this manner: ..."_

I doubt this is still true, since
1. I did a test - concatenated some files and processed them. However, all the
results were identical to ones that were produces on non-concatenated
files. Why? They should be different...
2. Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and 
https://issues.apache.org/jira/i#browse/HADOOP-6835 says this was fixed in 
Hadoop 0.22, Hadoop 0.20 respectively. That said Hadoop (1 and 2) are 
supporting this. I suppose Pig do not make compression on its own but rather 
depends on hadoop-core (hadoo-common respectively) libraries.

If I'm right, the documentation should be fixed (delete the part about 
concatinated compression files problems)







> support of concatenated bz2/gz files
> ------------------------------------
>
>                 Key: PIG-4533
>                 URL: https://issues.apache.org/jira/browse/PIG-4533
>             Project: Pig
>          Issue Type: Bug
>          Components: documentation, parser
>            Reporter: Tomas Hudik
>             Fix For: 0.16.0
>
>
> Documentation (since 0.11.1 at least) says :
> http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
> _"Note: PigStorage and TextLoader correctly read compressed files as long as 
> they are NOT CONCATENATED FILES generated in this manner: ..."_
> This is not true for tar.gz, since
> # I did a test - concatenated&compress some files and processed them. The 
> same was done with the raw files (no compression). The results were identical
> # Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and 
> https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation 
> problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively. That said 
> Hadoop (1 and 2) are supporting this already. 
> Pig is handling tar.bz2 only (tar.gz is handled by hadoop-common). 
> Therefore, 
> # tar.bz2 should be handled by hadoop-common as well (there is no need to be 
> handled by Pig anymore). (I believe 
> https://github.com/apache/pig/tree/trunk/lib-src/bzip2/org/apache should be 
> removed)
> # correct documentation accordingly (concatenated tar.gz, tar.bz2 are 
> processing correctly)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to