ppkarwasz opened a new pull request, #699:
URL: https://github.com/apache/commons-compress/pull/699

   This PR improves `BZip2CompressorInputStream` by enforcing the mandated 
Huffman code-length bound and right-sizing related tables. It aligns behavior 
with the reference C implementation’s intent while avoiding its historical 
over-allocation.
   
   ## Why
   
   The reference C code deliberately **over-sizes** its Huffman structures as a 
defensive programming technique: e.g., it defines `BZ_MAX_CODE_LEN = 23` and 
sizes length-indexed tables by `BZ_MAX_ALPHA_SIZE` rather than the code-length 
bound to provide safety margins. However, the bzip2 bitstream only permits code 
lengths ≤ 20 (and the 1.0.6 implementation effectively tightens this to 17), so 
carrying those defensive constants forward in our code obscures the true limit 
and keeps arrays larger than necessary.
   
   ## What’s changed
   
   * **Set the true limit**: `MAX_CODE_LEN` now equals **20** (the format 
maximum).
   
   * **Validate inputs**: code lengths are checked explicitly; any value 
**outside `[1, 20]`** triggers a clear exception early in decoding.
   
   * **Right-size arrays**: Huffman tables and auxiliary arrays are sized to 
the minimum required for the 20-bit limit, reducing footprint and clarifying 
invariants.
   
   * **Tests**: add a unit test that exercises the maximum alphabet size across 
boundary code lengths.
   
   ## Maintenance & performance
   
   * Clearer invariants (the limit you see is the limit you enforce).
   * Slightly lower memory usage due to smaller tables.
   * Easier reasoning about bounds and indexing during review and future 
changes.
   
   ## References
   
   * 
[bzip2/huffman.c](https://github.com/libarchive/bzip2/blob/master/huffman.c)
   * 
[bzip2/bzlib_private.h](https://github.com/libarchive/bzip2/blob/master/bzlib_private.h)
    
   The reference implementation uses over-sized constants for safety but 
enforces an effective 20-bit maximum.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to