I believe by byte_stream_split encoding is supported now in C++ (at least reading is, I would need to double check on the writing).
On Mon, Nov 16, 2020 at 3:05 PM Andrew Lamb <[email protected]> wrote: > For what it is worth, when we were testing with timeseries data (that also > many sequential values that are very close in absolute value), the parquet > BYTE_STREAM_SPLIT[1] encoding was also quite effective (20% better > compression). However, this wasn't supported in C++ (and thus supported in > Pandas) at that time. > > [1] > https://github.com/apache/parquet-format/blob/ee02ef8c8f33bd3d5ed0582ded7e20439e12d933/Encodings.md#byte-stream-split-byte_stream_split--9 > > On Mon, Nov 16, 2020 at 5:01 PM Jason Sachs <[email protected]> wrote: > >> ah .. got it. >> >> Thanks, I found >> https://github.com/apache/parquet-format/blob/ee02ef8c8f33bd3d5ed0582ded7e20439e12d933/Encodings.md >> >> On 2020/11/16 20:33:38, Micah Kornfield <[email protected]> wrote: >> > Delta encoding hasn't been implemented in the C++ code that pyarrow >> binds >> > to. It is supported in the Parquet specification. >> > >> > On Mon, Nov 16, 2020 at 12:30 PM Jason Sachs <[email protected]> wrote: >> > >> > > Does Arrow / Parquet have any support for delta encoding? >> > > >> > > Some data series compress better when their differences are stored >> rather >> > > than the values themselves. >> > > >> > > Here's an example where the differences are mostly equal to 7 but >> > > occasionally more: >> > > >> > > import numpy as np >> > > import pyarrow as pa >> > > import pyarrow.parquet as pq >> > > >> > > N = 500000 >> > > delta_r = np.full(N,7) >> > > np.random.seed(123) >> > > for _ in range(10): >> > > delta_r[np.random.randint(N,size=N//100)] += 1 >> > > r = np.cumsum(delta_r) >> > > drcheck = np.diff(r,prepend=0) >> > > assert (delta_r == drcheck).all() >> > > >> > > a = pa.array(r) >> > > adiff = pa.array(delta_r) >> > > t = pa.Table.from_arrays([a],['r']) >> > > tdiff = pa.Table.from_arrays([adiff],['delta_r']) >> > > pq.write_table(t,'t.pq') >> > > pq.write_table(tdiff,'tdiff.pq') >> > > >> > > ===== >> > > >> > > and when I look at the resulting files: >> > > >> > > -rw-rw-rw- 1 user group 2591101 Nov 16 13:29 t.pq >> > > -rw-rw-rw- 1 user group 81049 Nov 16 13:29 tdiff.pq >> > > >> > > >> > >> >
