ppkarwasz commented on PR #701:
URL: https://github.com/apache/commons-compress/pull/701#issuecomment-3335182540
@garydgregory,
I had completely forgotten about this one, but it’s now in good shape. The
PR is complete: both **BZip2** and **DEFLATE64** now use the shared
`HuffmanDecoder`. The only adjustment needed was handling bit ordering within a
byte:
* **BZip2** reads bits starting from the most significant bit.
* **DEFLATE64** reads bits starting from the least significant bit.
The tricky part shows up when `BitInputStream` retrieves multiple bits at
once: regardless of whether it’s constructed with `ByteOrder.LITTLE_ENDIAN` or
`ByteOrder.BIG_ENDIAN`, it always delivers the bits in the order they appear in
the byte. For DEFLATE64, that means we need to swap the retrieved bits.
Looking more broadly, Commons Compress probably would benefits from having
two families of prefix-code decoders:
1. **Canonical Huffman decoder** (this PR)
* Uses a very compact representation of canonical Huffman codes.
* Needs only:
* One table for the alphabet, ordered by increasing code length.
* Two small tables (up to “max code length” in size) to store:
* The last code for each length.
* The bias between code values and their table index.
* Supports large code lengths (up to 20 bits, required by BZip2).
2. **General binary tree decoder**
(`o.a.c.compress.archivers.zip.BinaryTree`)
* Stores a full binary tree in an array.
* Can decode not only canonical Huffman codes but any prefix code,
including the Shannon–Fano codes used by the **IMPLODE** ZIP method.
* If we want to make this public, I’d prefer the refreshed implementation
suggested by @fkjellberg in #690.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]