[ 
https://issues.apache.org/jira/browse/HADOOP-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167741#comment-13167741
 ] 

Tim Broberg commented on HADOOP-7909:
-------------------------------------

Still doing mental experiments here, I'm not advocating a particular approach 
yet...

1 - Is there a fallback if we "fail" to split? What happens if a split "fails?" 
Consider the second and subsequent splits. They searches from the start offset 
to then end and find no signatures, so they output 0 bytes. The first split 
starts output from offset 0 and extends his end until he finds a signature or 
hits EOF, so he outputs the whole thing. In this case, by using the splittable 
decode for a non-splittable stream, we waste time searching for split headers 
that don't exist, but the output is correct. Not ideal, but not fatal either. 
If it were a rare phenomenon to mix splittable and non-splittable, this might 
make sense, but probably not.

2 - Can we failover to HADOOP-7076? What if the first split "succeeds" and the 
seonds "fails." For HADOOP-7076, I imagine handling this by attempting to 
decode each split with the signature approach, and searching up to some 
reasonable max block size (1MB? 2MB?) for a max search time of ~1us. If we are 
unable to find a splittable gzip header at this point, or at any subsequent 
point when we look at the next gzip header, we fail over from the block-mode 
regime to the stream-mode regime of the HADOOP-7076 decoder for the remainder 
of that split.

I guess my current problem with approach #2 here is a mismatch in buffering / 
caching needs between the two approaches. The block approach wants nice big 
buffers, and the signature search will walk over a significant amount of data 
before failing. We don't want to read this data twice. For each fail, it's 
wasteful to load up O(1MB) of data just to feed it into a stream decompressor 
that is perfectly happy to process the data in 4k hunks. Multiply that 1MB by 
the number of open compression streams, and this could add up to some pretty 
significant overhead for no real gain.

I'm trying to avoid adding to the complexity of the interface with more codecs 
and formats, but the costs aren't seeming like they are worth the benefits to 
me at this point.

I'm inclined to call it a different format as Niels originally suggested, 
despite the fact that it is read-compatible with .gz's.

                
> Implement Splittable Gzip based on a signature in a gzip header field
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-7909
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7909
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Tim Broberg
>            Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> I propose to take the suggestion of PIG-42 extend it to
>  - add a more robust header such that false matches are vanishingly unlikely
>  - repeat initial bytes of the header for very fast split searching
>  - break down the stream into modest size chunks (~64k?) for rapid parallel 
> encode and decode
>  - provide length information on the blocks in advance to make block decode 
> possible in hardware
> An optional extra header would be added to the gzip header, adding 36 bytes.
> <sh> := <version><signature><uncompressedDataLength><compressedRecordLength>
> <version> := 1 byte version field allowing us to later adjust the deader 
> definition
> <signature> := 23 byte signature of the form aaaaaaabcdefghijklmnopr where 
> each letter represents a randomly generated byte
> <uncompressedDataLength> := 32-bit length of the data compressed into this 
> record
> <compressedRecordLength> := 32-bit length of this record as compressed, 
> including all headers, trailers
> If multiple extra headers are present and the split header is not the first 
> header, the initial implementation will not recognize the split.
> Input streams would be broken down into blocks which are appended, much as 
> BlockCompressorStream does. Non-split-aware decoders will ignore this header 
> and decode the appended blocks without ever noticing the difference.
> The signature has >= 132 bits of entropy which is sufficient for 80+ years of 
> Moore's law before collisions become a significant concern.
> The first 7 bytes are repeated for speed. When splitting, the signature 
> search will look for the 32-bit value aaaa every 4 bytes until a hit is 
> found, then the next 4 bytes identify the alignment of the header mod 4 to 
> identify a potential header match, then the whole header is validated at that 
> offset. So, there is a load, compare, branch, and increment per 4 bytes 
> searched.
> The existing gzip implementations do not provide access to the optional 
> header fields (nor comment nor filename), so the entire gzip header will have 
> to be reimplemented and compression will need to be done using the raw 
> deflate options of the native library / built in deflater.
> There will be some degradation when using splittable gzip:
>  - The gzip headers will each be 36 bytes larger. (4 byte extra header 
> header, 32 byte extra header)
>  - There will be one gzip header per block.
>  - History will have to be reset with each block to allow starting from 
> scratch at that offset resulting in some uncompressed bytes that would 
> otherwise have been strings.
> Issues to consider:
>  - Is the searching fast enough without the repeating 7 bytes in the 
> signature?
>  - Should this be a patch to the existing gzip classes to add a switch, or 
> should this be a whole new class?
>  - Which level does this belong at? CompressionStream? Compressor?
>  - Is it more advantageous to encode the signature into the less dense 
> comment field?
>  - Optimum block size? Smaller splits faster and may conserve memory, larger 
> provides slightly better compression ratio.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to