Thanks Micah, This was really helpful and I think I will give arrow files a go and see how that works.
Kind regards Nikhil Makan On Mon, Feb 13, 2023 at 8:45 PM Micah Kornfield <[email protected]> wrote: > Hi Nikhil, > >> I would like to know if pyarrow has support for writing parquet files >> with run-length encoding? There is mention of this in the Python Docs under >> the compression section. > > > The C++ API might not have enough validation around it to be properly > exposed to high level APIs. The parquet spec clarifies this further [1]: > >> Note that the RLE encoding method is only supported for the following >> types of data: > > > Repetition and definition levels >> Dictionary indices >> Boolean values in data pages, as an alternative to PLAIN encoding > > > IIRC, The way the writing works, for pyarrow and C++ is they will try to > dictionary encode values and use RLE until the dictionary grows too large. > You can verify encodings by using pyarrow to see what encodings were used > for a column [2]. > > The Arrow specification recently adopted Run end encoding which is very > similar to RLE encoding [3] if you don't want to transfer parquet files > this might be a good fit for your use-case. > > Thanks, > Micah > > [1] https://parquet.apache.org/docs/file-format/data-pages/encodings/ > [2] > https://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata > [3] > https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout > > > > On Wed, Jan 11, 2023 at 4:04 PM Nikhil Makan <[email protected]> > wrote: > >> Hi Team, >> >> Question 1: >> I would like to know if pyarrow has support for writing parquet files >> with run-length encoding? There is mention of this in the Python Docs under >> the compression section. >> >> 'can be compressed after the encoding passes (dictionary, RLE encoding)' >> >> https://arrow.apache.org/docs/python/parquet.html#compression-encoding-and-file-compatibility >> >> However I am not seeing the option in the API reference: >> >> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table >> >> I do note it's covered off in the C++ documentation, anyway we can access >> this in python? >> https://arrow.apache.org/docs/cpp/parquet.html >> >> Question 2: >> In addition to the above, I am interested to know if there are any >> methods to apply this type of encoding to data in transit over a network. >> Our actual use case has a large amount of data and would GREATLY benefit >> from run-length encoding due to the repetition (sensors not changing values >> that often). We are trying to send this data from a warehouse (the >> warehouse has not been selected as yet) to an application back end, which >> ultimately gets sent onto an application front end to visualise. >> >> Kind regards >> Nikhil Makan >> >
