[ 
https://issues.apache.org/jira/browse/PIG-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586152#comment-14586152
 ] 

Koji Noguchi commented on PIG-4599:
-----------------------------------

{quote}
I'm just wondering if hadoop doesn't support tar archives what 
org.apache.hadoop.fs.FileUtil.unTar is responsible for? (it untarring files) 
@Koji -do you have some explanation? Or, some documentation...
{quote}
I believe FileUtil.unTar is used when localizing distributed cache.  NOT as an 
input for user's code.

Tar file itself is unfit for hadoop like job since the file metainfo is spread 
everywhere. 
Say we tar one thousand files.  This tarball would consists of 
{panel}
<file1_meta><file1_content><file2_meta><file2_content><file3_meta><file3_content>...<file1000_meta><file1000_contrent>
{panel}
So in order to list the files for job submission, you would need to do 
read+seek 1000 times pulling many blocks.

What "har" archive does is, it saves the metainfo separately so that these 
metainfo query (like liststatus) could run relatively fast.


> tar.gz compression doesn't produce correct output
> -------------------------------------------------
>
>                 Key: PIG-4599
>                 URL: https://issues.apache.org/jira/browse/PIG-4599
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.1
>            Reporter: Tomas Hudik
>              Labels: compression, easytest
>
> I'm not completely sure whether this is the right place to put this issue 
> since Pig is involved, however, Pig leave decompression of tar.gz to   
> hadoop-common.
> How to reproduce the issue: 
> # simple file (file1) with arbitrary text lines put into in1 in HDFS
> # same file (file1) compressed by tar -cvzf file1.tar.gz file put into in2 in 
> HDFS
> # issue simple pig commands in pig:
> {quote}
> raw = load 'in1/' USING TextLoader AS (line: bytearray);
> dump raw;
> {quote}
> run for both (compressed and uncompressed file)
> # in case of compressed version you will get strange 1st line
> {quote}
> a0000644000570000001440000000002512534073736011260 0ustar loadhadoopusersa
> ...
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to