[
https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629166#action_12629166
]
Tom White commented on PIG-42:
------------------------------
It would be nice if the format could be generated using standard tools. By
modifying the gzip flag header so that it refers to the file name (which the
gzip tool can set), rather than a comment (which it cannot) we can generate
compatible files using the following:
{noformat}
touch -mt 197007130719.25 Split
gzip -c Split file1 Split file2 > file.gz
{noformat}
Then the first split file has the following hexdump:
{noformat}
hexdump -n 26 -C file.gz
00000000 1f 8b 08 08 6d ca fe 00 00 03 53 70 6c 69 74 00 |....m.....Split.|
00000010 03 00 00 00 00 00 00 00 00 00 |..........|
0000001a
{noformat}
Note that the OS flag is 03 (Unix) rather than FF (unknown), but that should be
OK as the code doesn't use it when searching for the signature.
> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
> Key: PIG-42
> URL: https://issues.apache.org/jira/browse/PIG-42
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Benjamin Reed
> Assignee: Benjamin Reed
> Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files.
> Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When
> gzipped files are concatenated together they are treated as a single file. So
> to make a gzipped file splittable we can used an empty compressed file with
> some salt in the headers as a sync signature. Then we can make the gzip file
> splittable by using this sync signature between compressed segments of the
> file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.