Hi Nikhil,

> I would like to know if pyarrow has support for writing parquet files with
> run-length encoding? There is mention of this in the Python Docs under the
> compression section.


The C++ API might not have enough validation around it to be properly
exposed to high level APIs.  The parquet spec clarifies this further [1]:

> Note that the RLE encoding method is only supported for the following
> types of data:


Repetition and definition levels
> Dictionary indices
> Boolean values in data pages, as an alternative to PLAIN encoding


IIRC, The way the writing works, for pyarrow and C++ is they will try to
dictionary encode values and use RLE until the dictionary grows too large.
You can verify encodings by using pyarrow to see what encodings were used
for a column [2].

The Arrow specification recently adopted Run end encoding which is very
similar to RLE encoding [3] if you don't want to transfer parquet files
this might be a good fit for your use-case.

Thanks,
Micah

[1] https://parquet.apache.org/docs/file-format/data-pages/encodings/
[2]
https://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata
[3]
https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout



On Wed, Jan 11, 2023 at 4:04 PM Nikhil Makan <[email protected]>
wrote:

> Hi Team,
>
> Question 1:
> I would like to know if pyarrow has support for writing parquet files with
> run-length encoding? There is mention of this in the Python Docs under the
> compression section.
>
> 'can be compressed after the encoding passes (dictionary, RLE encoding)'
>
> https://arrow.apache.org/docs/python/parquet.html#compression-encoding-and-file-compatibility
>
> However I am not seeing the option in the API reference:
>
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
>
> I do note it's covered off in the C++ documentation, anyway we can access
> this in python?
> https://arrow.apache.org/docs/cpp/parquet.html
>
> Question 2:
> In addition to the above, I am interested to know if there are any methods
> to apply this type of encoding to data in transit over a network. Our
> actual use case has a large amount of data and would GREATLY benefit
> from run-length encoding due to the repetition (sensors not changing values
> that often). We are trying to send this data from a warehouse (the
> warehouse has not been selected as yet) to an application back end, which
> ultimately gets sent onto an application front end to visualise.
>
> Kind regards
> Nikhil Makan
>

Reply via email to