[jira] [Created] (ARROW-7951) Expose BYTE_STREAM_SPLIT to pyarrow

2020-02-26 Thread Martin Radev (Jira)
Martin Radev created ARROW-7951:
---

 Summary: Expose BYTE_STREAM_SPLIT to pyarrow
 Key: ARROW-7951
 URL: https://issues.apache.org/jira/browse/ARROW-7951
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Martin Radev
Assignee: Martin Radev


The Parquet writer now supports the option of selecting the BYTE_STREAMS_SPLIT 
encoding. It could be nice to have it exposed in pyarrow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6216) Allow user to select the ZSTD compression level

2019-08-12 Thread Martin Radev (JIRA)
Martin Radev created ARROW-6216:
---

 Summary: Allow user to select the ZSTD compression level
 Key: ARROW-6216
 URL: https://issues.apache.org/jira/browse/ARROW-6216
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Martin Radev


The compression level selected in Arrow for ZSTD is 1 which is the minimal 
compression level for the compressor. This leads to very high compression speed 
at the sacrifice of compression ratio.

The user should be allowed to select the compression level as both speed and 
ratio are data specific.

The proposed solution is to expose the knob via an environment variable such as 
ARROW_ZSTD_COMPRESSION_LEVEL.
Example:
export ARROW_ZSTD_COMPRESSION_LEVEL=10
./my_parquet_app



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5913) Add support for Parquet's BYTE_STREAM_SPLIT encoding

2019-07-11 Thread Martin Radev (JIRA)
Martin Radev created ARROW-5913:
---

 Summary: Add support for Parquet's BYTE_STREAM_SPLIT encoding
 Key: ARROW-5913
 URL: https://issues.apache.org/jira/browse/ARROW-5913
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Martin Radev


*From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 ):*

Apache Parquet does not have any encodings suitable for FP data and the 
available text compressors (zstd, gzip, etc) do not handle FP data very well.

It is possible to apply a simple data transformation named "stream splitting". 
Such could be "byte stream splitting" which creates K streams of length N where 
K is the number of bytes in the data type (4 for floats, 8 for doubles) and N 
is the number of elements in the sequence.

The transformed data compresses significantly better on average than the 
original data and for some cases there is a performance improvement in 
compression and decompression speed.

You can read a more detailed report here:
[https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
]

*Apache Arrow can benefit from the reduced requirements for storing FP parquet 
column data and improvements in decompression speed.*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)