etseidl opened a new issue, #3479:
URL: https://github.com/apache/parquet-java/issues/3479

   ### Describe the enhancement requested
   
   An issue recently was brought up in arrow-rs 
(https://github.com/apache/arrow-rs/pull/9700) which brought to my attention 
the existence of `isCompressionSatisfying` in the `RequiresFallback` interface. 
In short, after accumulating a page worth of data, `isCompressionSatisfying` is 
called to see if dictionary encoding is actually compressing the data at all, 
and if not, then the encoder falls back immediately to the fallback encoder. As 
far as I could determine, this behavior was introduced very early on, before 
the advent of the page indexes, so IIRC the page size would have been 
significantly larger. With page indexes, however, this function is now called 
after only 20000 rows have been processed. A column with a moderate cardinality 
might not yet have produced enough repeating values to lead this function to 
conclude it's best to continue using a dictionary.
   
   For example, a dataframe with an int64 column consisting of one million 
values mod'd with 32768 will end up ditching dictionary encoding completely, 
and produce a column chunk of 8.4MB. If the page row count is bumped up to 
128k, then dictionary encoding is used throughout and the resultant column 
chunk is only 2.2MB.
   
   Sadly, it does not appear that this behavior is configurable, so short of 
increasing the page row count, its behavior cannot be modified.
   
   I can see the need for this type of heuristic, but I think it needs to be 
modified in light of the current defaults resulting in far too few samples with 
which to determine if dictionary encoding is beneficial or not. If collecting 
more samples before falling back is not practical, there should at least be a 
configuration setting to disable this check.
   
   ### Component(s)
   
   Core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to