ppkarwasz opened a new pull request, #699: URL: https://github.com/apache/commons-compress/pull/699
This PR improves `BZip2CompressorInputStream` by enforcing the mandated Huffman code-length bound and right-sizing related tables. It aligns behavior with the reference C implementation’s intent while avoiding its historical over-allocation. ## Why The reference C code deliberately **over-sizes** its Huffman structures as a defensive programming technique: e.g., it defines `BZ_MAX_CODE_LEN = 23` and sizes length-indexed tables by `BZ_MAX_ALPHA_SIZE` rather than the code-length bound to provide safety margins. However, the bzip2 bitstream only permits code lengths ≤ 20 (and the 1.0.6 implementation effectively tightens this to 17), so carrying those defensive constants forward in our code obscures the true limit and keeps arrays larger than necessary. ## What’s changed * **Set the true limit**: `MAX_CODE_LEN` now equals **20** (the format maximum). * **Validate inputs**: code lengths are checked explicitly; any value **outside `[1, 20]`** triggers a clear exception early in decoding. * **Right-size arrays**: Huffman tables and auxiliary arrays are sized to the minimum required for the 20-bit limit, reducing footprint and clarifying invariants. * **Tests**: add a unit test that exercises the maximum alphabet size across boundary code lengths. ## Maintenance & performance * Clearer invariants (the limit you see is the limit you enforce). * Slightly lower memory usage due to smaller tables. * Easier reasoning about bounds and indexing during review and future changes. ## References * [bzip2/huffman.c](https://github.com/libarchive/bzip2/blob/master/huffman.c) * [bzip2/bzlib_private.h](https://github.com/libarchive/bzip2/blob/master/bzlib_private.h) The reference implementation uses over-sized constants for safety but enforces an effective 20-bit maximum. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
