[ 
https://issues.apache.org/jira/browse/HADOOP-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Broberg updated HADOOP-7909:
--------------------------------

    Summary: Implement a generic splittable signature-based compression format  
(was: Implement Splittable Gzip based on a signature in a gzip header field)

Renaming to reflect abandonment of gzip compatibility as a goal.
                
> Implement a generic splittable signature-based compression format
> -----------------------------------------------------------------
>
>                 Key: HADOOP-7909
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7909
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Tim Broberg
>            Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> I propose to take the suggestion of PIG-42 extend it to
>  - add a more robust header such that false matches are vanishingly unlikely
>  - repeat initial bytes of the header for very fast split searching
>  - break down the stream into modest size chunks (~64k?) for rapid parallel 
> encode and decode
>  - provide length information on the blocks in advance to make block decode 
> possible in hardware
> An optional extra header would be added to the gzip header, adding 36 bytes.
> <sh> := <version><signature><uncompressedDataLength><compressedRecordLength>
> <version> := 1 byte version field allowing us to later adjust the deader 
> definition
> <signature> := 23 byte signature of the form aaaaaaabcdefghijklmnopr where 
> each letter represents a randomly generated byte
> <uncompressedDataLength> := 32-bit length of the data compressed into this 
> record
> <compressedRecordLength> := 32-bit length of this record as compressed, 
> including all headers, trailers
> If multiple extra headers are present and the split header is not the first 
> header, the initial implementation will not recognize the split.
> Input streams would be broken down into blocks which are appended, much as 
> BlockCompressorStream does. Non-split-aware decoders will ignore this header 
> and decode the appended blocks without ever noticing the difference.
> The signature has >= 132 bits of entropy which is sufficient for 80+ years of 
> Moore's law before collisions become a significant concern.
> The first 7 bytes are repeated for speed. When splitting, the signature 
> search will look for the 32-bit value aaaa every 4 bytes until a hit is 
> found, then the next 4 bytes identify the alignment of the header mod 4 to 
> identify a potential header match, then the whole header is validated at that 
> offset. So, there is a load, compare, branch, and increment per 4 bytes 
> searched.
> The existing gzip implementations do not provide access to the optional 
> header fields (nor comment nor filename), so the entire gzip header will have 
> to be reimplemented and compression will need to be done using the raw 
> deflate options of the native library / built in deflater.
> There will be some degradation when using splittable gzip:
>  - The gzip headers will each be 36 bytes larger. (4 byte extra header 
> header, 32 byte extra header)
>  - There will be one gzip header per block.
>  - History will have to be reset with each block to allow starting from 
> scratch at that offset resulting in some uncompressed bytes that would 
> otherwise have been strings.
> Issues to consider:
>  - Is the searching fast enough without the repeating 7 bytes in the 
> signature?
>  - Should this be a patch to the existing gzip classes to add a switch, or 
> should this be a whole new class?
>  - Which level does this belong at? CompressionStream? Compressor?
>  - Is it more advantageous to encode the signature into the less dense 
> comment field?
>  - Optimum block size? Smaller splits faster and may conserve memory, larger 
> provides slightly better compression ratio.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to