Thanks Micah,

This was really helpful and I think I will give arrow files a go and see
how that works.

Kind regards
Nikhil Makan

On Mon, Feb 13, 2023 at 8:45 PM Micah Kornfield <[email protected]>
wrote:

> Hi Nikhil,
>
>> I would like to know if pyarrow has support for writing parquet files
>> with run-length encoding? There is mention of this in the Python Docs under
>> the compression section.
>
>
> The C++ API might not have enough validation around it to be properly
> exposed to high level APIs.  The parquet spec clarifies this further [1]:
>
>> Note that the RLE encoding method is only supported for the following
>> types of data:
>
>
> Repetition and definition levels
>> Dictionary indices
>> Boolean values in data pages, as an alternative to PLAIN encoding
>
>
> IIRC, The way the writing works, for pyarrow and C++ is they will try to
> dictionary encode values and use RLE until the dictionary grows too large.
> You can verify encodings by using pyarrow to see what encodings were used
> for a column [2].
>
> The Arrow specification recently adopted Run end encoding which is very
> similar to RLE encoding [3] if you don't want to transfer parquet files
> this might be a good fit for your use-case.
>
> Thanks,
> Micah
>
> [1] https://parquet.apache.org/docs/file-format/data-pages/encodings/
> [2]
> https://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata
> [3]
> https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout
>
>
>
> On Wed, Jan 11, 2023 at 4:04 PM Nikhil Makan <[email protected]>
> wrote:
>
>> Hi Team,
>>
>> Question 1:
>> I would like to know if pyarrow has support for writing parquet files
>> with run-length encoding? There is mention of this in the Python Docs under
>> the compression section.
>>
>> 'can be compressed after the encoding passes (dictionary, RLE encoding)'
>>
>> https://arrow.apache.org/docs/python/parquet.html#compression-encoding-and-file-compatibility
>>
>> However I am not seeing the option in the API reference:
>>
>> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
>>
>> I do note it's covered off in the C++ documentation, anyway we can access
>> this in python?
>> https://arrow.apache.org/docs/cpp/parquet.html
>>
>> Question 2:
>> In addition to the above, I am interested to know if there are any
>> methods to apply this type of encoding to data in transit over a network.
>> Our actual use case has a large amount of data and would GREATLY benefit
>> from run-length encoding due to the repetition (sensors not changing values
>> that often). We are trying to send this data from a warehouse (the
>> warehouse has not been selected as yet) to an application back end, which
>> ultimately gets sent onto an application front end to visualise.
>>
>> Kind regards
>> Nikhil Makan
>>
>

Reply via email to