yadavay-amzn opened a new pull request, #3556:
URL: https://github.com/apache/parquet-java/pull/3556

   ## Problem
   
   `FallbackValuesWriter` calls `isCompressionSatisfying()` after the first 
page to decide whether dictionary encoding is worthwhile. With modern 
page-index defaults (~20k rows per page), this check fires too early for 
moderate-cardinality columns — dictionary encoding gets abandoned before enough 
data has accumulated to show its benefit, resulting in significantly larger 
files.
   
   As reported in #3479, a column with 1M int64 values mod 32768 produces 8.4MB 
with the premature fallback vs 2.2MB when dictionary encoding is preserved.
   
   ## Fix
   
   Add a configurable property 
`ParquetProperties.isDictionaryEarlyCheckEnabled()` (default: `true` for 
backward compatibility) that controls whether the first-page compression check 
is performed in `FallbackValuesWriter.getBytes()`.
   
   When disabled, dictionary encoding is only abandoned when the dictionary 
itself exceeds size limits (`shouldFallBack()`), not based on the first-page 
compression ratio.
   
   ## Changes
   
   - `ParquetProperties`: added `dictionaryEarlyCheckEnabled` field, getter, 
and builder method
   - `FallbackValuesWriter`: added overloaded `of()` factory and constructor 
accepting the flag; guarded the `isCompressionSatisfying` call
   - `DefaultValuesWriterFactory`: passes the config through to 
`FallbackValuesWriter.of()`
   - New test `TestFallbackValuesWriter`: verifies dictionary encoding is 
preserved when the check is disabled
   
   ## Testing
   
   - New unit tests pass (2/2)
   - Existing `parquet-column` tests unaffected (default `true` preserves 
existing behavior)
   
   ## Generative AI
   
   Generated-by: Claude Opus 4.7
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to