[ https://issues.apache.org/jira/browse/PARQUET-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney resolved PARQUET-1716. ----------------------------------- Fix Version/s: cpp-1.6.0 Resolution: Fixed Issue resolved by pull request 6005 [https://github.com/apache/arrow/pull/6005] > [C++] Add support for BYTE_STREAM_SPLIT encoding > ------------------------------------------------ > > Key: PARQUET-1716 > URL: https://issues.apache.org/jira/browse/PARQUET-1716 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp > Reporter: Martin Radev > Assignee: Martin Radev > Priority: Minor > Labels: pull-request-available > Fix For: cpp-1.6.0 > > Original Estimate: 72h > Time Spent: 14h > Remaining Estimate: 58h > > *From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 > ):* > Apache Parquet does not have any encodings suitable for FP data and the > available text compressors (zstd, gzip, etc) do not handle FP data very well. > It is possible to apply a simple data transformation named "stream > splitting". Such could be "byte stream splitting" which creates K streams of > length N where K is the number of bytes in the data type (4 for floats, 8 for > doubles) and N is the number of elements in the sequence. > The transformed data compresses significantly better on average than the > original data and for some cases there is a performance improvement in > compression and decompression speed. > You can read a more detailed report here: > [https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view > *Apache Arrow can benefit from the reduced requirements for storing FP > parquet column data and improvements in decompression speed.* -- This message was sent by Atlassian Jira (v8.3.4#803005)