[
https://issues.apache.org/jira/browse/PIG-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tomas Hudik updated PIG-4533:
-----------------------------
Description:
Documentation (since 0.11.1 at least) says :
http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
_"Note: PigStorage and TextLoader correctly read compressed files as long as
they are NOT CONCATENATED FILES generated in this manner: ..."_
This is not true for gz, since
* I did a test - concatenated&compress some files and processed them. The same
was done with the raw files (no compression). The results were identical
Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and
https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation
problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively for both: bz2 and
gz. That said Hadoop (1 and 2) are supporting concatenated archives bz2, gz
already.
Pig deals with bz2 on its own(historical reasons) which is redundant to
hadoop-common. Therefore this activity should be left to hadoop-common (there
is no need to be handled by Pig anymore).
The documentation needs to be updated accordingly (concatenated gz, bz2 are
processing correctly with hadoop-commons). Also a remark that tar.gz and
tar.bz2 are not supported would be helpful since many users are using tar.gz or
tar.bz2 automatically.
was:
Documentation (since 0.11.1 at least) says :
http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
_"Note: PigStorage and TextLoader correctly read compressed files as long as
they are NOT CONCATENATED FILES generated in this manner: ..."_
This is not true for tar.gz, since
# I did a test - concatenated&compress some files and processed them. The same
was done with the raw files (no compression). The results were identical
# Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and
https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation
problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively. That said Hadoop
(1 and 2) are supporting this already.
Pig is handling tar.bz2 only (tar.gz is handled by hadoop-common).
Therefore,
# tar.bz2 should be handled by hadoop-common as well (there is no need to be
handled by Pig anymore). (I believe
https://github.com/apache/pig/tree/trunk/lib-src/bzip2/org/apache should be
removed)
# correct documentation accordingly (concatenated tar.gz, tar.bz2 are
processing correctly)
> Document error: Pig does support concatenated gz file
> -----------------------------------------------------
>
> Key: PIG-4533
> URL: https://issues.apache.org/jira/browse/PIG-4533
> Project: Pig
> Issue Type: Bug
> Components: documentation, parser
> Reporter: Tomas Hudik
> Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4533-1.patch
>
>
> Documentation (since 0.11.1 at least) says :
> http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
> _"Note: PigStorage and TextLoader correctly read compressed files as long as
> they are NOT CONCATENATED FILES generated in this manner: ..."_
> This is not true for gz, since
> * I did a test - concatenated&compress some files and processed them. The
> same was done with the raw files (no compression). The results were identical
> Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and
> https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation
> problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively for both: bz2
> and gz. That said Hadoop (1 and 2) are supporting concatenated archives bz2,
> gz already.
> Pig deals with bz2 on its own(historical reasons) which is redundant to
> hadoop-common. Therefore this activity should be left to hadoop-common (there
> is no need to be handled by Pig anymore).
> The documentation needs to be updated accordingly (concatenated gz, bz2 are
> processing correctly with hadoop-commons). Also a remark that tar.gz and
> tar.bz2 are not supported would be helpful since many users are using tar.gz
> or tar.bz2 automatically.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)