[ 
https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547882
 ] 

Benjamin Reed commented on PIG-42:
----------------------------------

There are two reasons I use an empty file with a comment:

1) It allows me to test that a gzip file is infact splittable. We need to know 
up front that we can split the gzip file. If the gzip isn't split at regular 
intervals, it's going to waste a lot of time! The signature is more than a 
marker, it is meta-data that indicates that it can be split. You will also 
notice that if you do 'head' on the file you can see that it is splittable.

2) It gives you a much more reliable signature. (20 bytes instead of 4)

You can still use standard tools without using Pig:

cat signature.gz > test.gz; gzip -c test1 >> test.gz; cat signature.gz >> 
test.gz; gzip -c test2 >> test.gz

You use standard gunzip to decompress. You can also easily find the split 
boundaries outside of pig by looking for the signature.gz sequence.

This also allows you to better control the grouping. If your gzip file is 
bigger than 4G, it will be a concatenation, so there may be time that you want 
to process concatenated gzip files together without splitting. Using the empty 
signature file allows you to do that.

Now that I think about it more, it might also be good to reserve some bytes in 
the signature.gz to put a block size. That way when can do intelligent splits 
when the fs blocksize doesn't correspond to the gzip blocksize or the number of 
requested splits are very high.

> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
>                 Key: PIG-42
>                 URL: https://issues.apache.org/jira/browse/PIG-42
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Benjamin Reed
>         Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files. 
> Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When 
> gzipped files are concatenated together they are treated as a single file. So 
> to make a gzipped file splittable we can used an empty compressed file with 
> some salt in the headers as a sync signature. Then we can make the gzip file 
> splittable by using this sync signature between compressed segments of the 
> file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to