yadavay-amzn opened a new pull request, #3556: URL: https://github.com/apache/parquet-java/pull/3556
## Problem `FallbackValuesWriter` calls `isCompressionSatisfying()` after the first page to decide whether dictionary encoding is worthwhile. With modern page-index defaults (~20k rows per page), this check fires too early for moderate-cardinality columns — dictionary encoding gets abandoned before enough data has accumulated to show its benefit, resulting in significantly larger files. As reported in #3479, a column with 1M int64 values mod 32768 produces 8.4MB with the premature fallback vs 2.2MB when dictionary encoding is preserved. ## Fix Add a configurable property `ParquetProperties.isDictionaryEarlyCheckEnabled()` (default: `true` for backward compatibility) that controls whether the first-page compression check is performed in `FallbackValuesWriter.getBytes()`. When disabled, dictionary encoding is only abandoned when the dictionary itself exceeds size limits (`shouldFallBack()`), not based on the first-page compression ratio. ## Changes - `ParquetProperties`: added `dictionaryEarlyCheckEnabled` field, getter, and builder method - `FallbackValuesWriter`: added overloaded `of()` factory and constructor accepting the flag; guarded the `isCompressionSatisfying` call - `DefaultValuesWriterFactory`: passes the config through to `FallbackValuesWriter.of()` - New test `TestFallbackValuesWriter`: verifies dictionary encoding is preserved when the check is disabled ## Testing - New unit tests pass (2/2) - Existing `parquet-column` tests unaffected (default `true` preserves existing behavior) ## Generative AI Generated-by: Claude Opus 4.7 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
