[jira] [Commented] (HADOOP-7076) Splittable Gzip

Niels Basjes (Commented) (JIRA) Fri, 09 Dec 2011 04:53:08 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13166134#comment-13166134
 ]


Niels Basjes commented on HADOOP-7076:
--------------------------------------

@Luke: Thanks for your feedback.

Regarding the class name;
The other direction of making Gzip input files splittable (HADOOP-6153 ... 
seems quite dead at this moment) is called "RAGzip" (Random Access Gzip) and 
looks like it was implemented as an extension within the regular GzipCodec 
class.

Because my implementation is based upon the GzipCodec class and the 
SplittableCompressionCodec interface I chose the most sensible name I could 
think of: SplittableGzipCodec.

This codec will and should be disabled by default. The only way you can enable 
it is by reading the documentation and following the instructions described 
there. This way I think users are confronted with the things to consider when 
using this; including the alternative approaches to processing the data in 
parallel.  So from that point I do not see the benefit of the different name.

Overall I still prefer the classname that is in the patch at this moment: 
SplittableGzipCodec

Do you agree?
                
> Splittable Gzip
> ---------------
>
>                 Key: HADOOP-7076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7076
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Niels Basjes
>            Assignee: Niels Basjes
>         Attachments: HADOOP-7076-2011-01-26.patch, 
> HADOOP-7076-2011-01-29.patch, HADOOP-7076-2011-02-05.patch, 
> HADOOP-7076-2011-02-06.patch, HADOOP-7076-2011-05-18.patch, 
> HADOOP-7076-2011-08-05-2255.patch, HADOOP-7076-2011-08-05-2315.patch, 
> HADOOP-7076-2011-12-04-2332.patch, HADOOP-7076-branch-0.22.patch, 
> HADOOP-7076.patch
>
>
> Files compressed with the gzip codec are not splittable due to the nature of 
> the codec.
> This limits the options you have scaling out when reading large gzipped input 
> files.
> Given the fact that gunzipping a 1GiB file usually takes only 2 minutes I 
> figured that for some use cases wasting some resources may result in a 
> shorter job time under certain conditions.
> So reading the entire input file from the start for each split (wasting 
> resources!!) may lead to additional scalability.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-7076) Splittable Gzip

Reply via email to