[ 
https://issues.apache.org/jira/browse/PARQUET-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1716.
-----------------------------------
    Fix Version/s: cpp-1.6.0
       Resolution: Fixed

Issue resolved by pull request 6005
[https://github.com/apache/arrow/pull/6005]

> [C++] Add support for BYTE_STREAM_SPLIT encoding
> ------------------------------------------------
>
>                 Key: PARQUET-1716
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1716
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-cpp
>            Reporter: Martin Radev
>            Assignee: Martin Radev
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: cpp-1.6.0
>
>   Original Estimate: 72h
>          Time Spent: 14h
>  Remaining Estimate: 58h
>
> *From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 
> ):*
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
>  [https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> *Apache Arrow can benefit from the reduced requirements for storing FP 
> parquet column data and improvements in decompression speed.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to