[ https://issues.apache.org/jira/browse/HADOOP-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Broberg updated HADOOP-7909: -------------------------------- Summary: Implement a generic splittable signature-based compression format (was: Implement Splittable Gzip based on a signature in a gzip header field) Renaming to reflect abandonment of gzip compatibility as a goal. > Implement a generic splittable signature-based compression format > ----------------------------------------------------------------- > > Key: HADOOP-7909 > URL: https://issues.apache.org/jira/browse/HADOOP-7909 > Project: Hadoop Common > Issue Type: New Feature > Components: io > Reporter: Tim Broberg > Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > I propose to take the suggestion of PIG-42 extend it to > - add a more robust header such that false matches are vanishingly unlikely > - repeat initial bytes of the header for very fast split searching > - break down the stream into modest size chunks (~64k?) for rapid parallel > encode and decode > - provide length information on the blocks in advance to make block decode > possible in hardware > An optional extra header would be added to the gzip header, adding 36 bytes. > <sh> := <version><signature><uncompressedDataLength><compressedRecordLength> > <version> := 1 byte version field allowing us to later adjust the deader > definition > <signature> := 23 byte signature of the form aaaaaaabcdefghijklmnopr where > each letter represents a randomly generated byte > <uncompressedDataLength> := 32-bit length of the data compressed into this > record > <compressedRecordLength> := 32-bit length of this record as compressed, > including all headers, trailers > If multiple extra headers are present and the split header is not the first > header, the initial implementation will not recognize the split. > Input streams would be broken down into blocks which are appended, much as > BlockCompressorStream does. Non-split-aware decoders will ignore this header > and decode the appended blocks without ever noticing the difference. > The signature has >= 132 bits of entropy which is sufficient for 80+ years of > Moore's law before collisions become a significant concern. > The first 7 bytes are repeated for speed. When splitting, the signature > search will look for the 32-bit value aaaa every 4 bytes until a hit is > found, then the next 4 bytes identify the alignment of the header mod 4 to > identify a potential header match, then the whole header is validated at that > offset. So, there is a load, compare, branch, and increment per 4 bytes > searched. > The existing gzip implementations do not provide access to the optional > header fields (nor comment nor filename), so the entire gzip header will have > to be reimplemented and compression will need to be done using the raw > deflate options of the native library / built in deflater. > There will be some degradation when using splittable gzip: > - The gzip headers will each be 36 bytes larger. (4 byte extra header > header, 32 byte extra header) > - There will be one gzip header per block. > - History will have to be reset with each block to allow starting from > scratch at that offset resulting in some uncompressed bytes that would > otherwise have been strings. > Issues to consider: > - Is the searching fast enough without the repeating 7 bytes in the > signature? > - Should this be a patch to the existing gzip classes to add a switch, or > should this be a whole new class? > - Which level does this belong at? CompressionStream? Compressor? > - Is it more advantageous to encode the signature into the less dense > comment field? > - Optimum block size? Smaller splits faster and may conserve memory, larger > provides slightly better compression ratio. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira