Hello people,

there has been discussion in the Apache Parquet mailing list on adding a new 
encoder for FP data.
The reason for this is that the supported compressors by Apache Parquet (zstd, 
gzip, etc) do not compress well raw FP data.


In my investigation it turns out that a very simple simple technique, named 
stream splitting, can improve the compression ratio and even speed for some of 
the compressors.

You can read about the results here: 
https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view


I went through the developer guide for Apache Arrow and wrote a patch to add 
the new encoding and test coverage for it.

I will polish my patch and work in parallel to extend the Apache Parquet format 
for the new encoding.


If you have any concerns, please let me know.


Regards,

Martin

Reply via email to