JigaoLuo commented on PR #9826:
URL: https://github.com/apache/arrow-rs/pull/9826#issuecomment-4346531272

   Thanks for having me. This PR looks good to me.
   
   Out of this PR's scope, I am interested in the question of what a rational 
default value should be, and what value we should recommend to users. I may be 
too pedantic here, since I have a paper that mentions this threshold. Let me 
start a broader discussion: 
   - For an archival Parquet use case, `1.0` may be a reasonable threshold, 
since the file is not expected to be read frequently once written. 
   - For Parquet for GPU-reading , however, the preferred threshold may differ 
from Parquet for CPU-reading, because decompression on GPU is much faster than 
on CPU. (CPU and CPU are also different tbh.)
   - So the right threshold depends first on the use case of Parquet, and then 
on the trade-off between size reduction and decompression cost. That trade-off 
also varies across hardware platforms.
   - More generally, this threshold is not only a Parquet issue. It is a 
broader file format question: when is compression actually worth it? I mention 
this last point because criticism of Parquet compression does not come from the 
Parquet specification itself, but from applying compression too blindly, 
without a suitable threshold from this PR :heart: .
   
   One last thing to share one review from my paper. My threshold is the 
size-reduction percentage, so rather a low value: 
   > ... recommendations look heuristic (the choice of 10% threshold to enable 
compression, why not 5% or 20%?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to