Piotr Karwasz created COMPRESS-726:
--------------------------------------
Summary: Build decompression-bomb protection into
CompressorInputStream
Key: COMPRESS-726
URL: https://issues.apache.org/jira/browse/COMPRESS-726
Project: Commons Compress
Issue Type: Improvement
Reporter: Piotr Karwasz
Every concrete {{CompressorInputStream}} already tracks both the compressed
bytes it has consumed and the uncompressed bytes it has emitted, through
{{InputStreamStatistics}}. The per-format differences in how the compressed
count is sourced (raw-input counter vs. internal bit/byte reader) amount only
to a bounded read-ahead skew (at most one decompressor buffer), which is
irrelevant for decompression-bomb detection.
We could therefore rely on {{InputStreamStatistics}} to introduce
decompression-bomb protection into each {{CompressorInputStream}}. Concretely:
# Make {{CompressorInputStream}} implement {{InputStreamStatistics}}, so the
compressed/uncompressed counts are part of the base type for every format.
# Let users configure decompression-bomb protection parameters on the streams
the builders / {{CompressorStreamFactory}} return, enforced centrally in the
base class.
h2. Background: InputStreamStatistics is sufficient
{{getUncompressedCount()}} is uniformly exact: it is the base
{{CompressorInputStream.getBytesRead()}} counter, fed by {{count(...)}} on
every {{read}}, so it equals the number of bytes actually handed to the caller.
{{getCompressedCount()}} is a live counter in every implementation. Some source
it from a {{BoundedInputStream}} wrapping the raw input (so it includes a fixed
amount of decompressor read-ahead), others from the format's own bit/byte
reader (read-ahead corrected). Because the read-ahead is bounded by a constant
buffer size, the ratio {{uncompressedCount / compressedCount}} converges to the
true ratio as the stream grows, and the absolute output count is always exact.
Both protection strategies below are robust to this skew once a small grace
period has elapsed.
h2. Proposed change 1: CompressorInputStream implements InputStreamStatistics
The base class already defines {{getUncompressedCount()}}. Add
{{getCompressedCount()}} and declare {{implements InputStreamStatistics}}.
All concrete subclasses except {{Pack200CompressorInputStream}} already
override {{getCompressedCount()}}, so the base declaration should provide a
default (return {{-1}} meaning "unknown") rather than be {{abstract}}, to
preserve source and binary compatibility for external subclasses. Subclasses
that have a real count keep overriding it.
h2. Proposed change 2: configurable bomb protection on the returned streams
Expose two independent, composable limits, settable through the existing
builders and through {{CompressorStreamFactory}} (which already carries
{{memoryLimitInKb}}, so this is a natural companion):
* *Absolute output cap*: throw once {{getUncompressedCount()}} exceeds a
configured maximum. Exact and format independent.
* *Mean ratio guard*: once output passes a configurable grace threshold, throw
if {{getUncompressedCount() / getCompressedCount()}} exceeds a configured
maximum ratio. The grace period tolerates legitimately high local ratios at the
start (headers, dictionaries, long runs) while still enforcing a sane mean
ratio.
h2. Gap: Pack200CompressorInputStream
{{Pack200CompressorInputStream}} is the one exception. It overrides
{{read(...)}} to delegate straight to an internal bridge stream, never calls
{{count(...)}} (its Javadoc states {{getBytesRead()}} always returns 0), and
does not implement {{InputStreamStatistics}}. To bring it under the same
protection it must route output through {{count(...)}} and provide a real
{{getCompressedCount()}} (counting bytes pulled from the original input).
Otherwise it should be documented as unprotected, with {{getCompressedCount()}}
returning {{-1}} and the ratio guard skipped (the absolute cap also cannot fire
while output is uncounted).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)