Re: [DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

Ed Seidl Wed, 29 May 2024 12:29:02 -0700

Maybe this is putting the cart too far in front of the horse, but I'd bewilling to implement an encoding like this to see if is a betteralternative to PLAIN and DELTA_LENGTH_BYTE_ARRAY as a dictionaryfallback for byte arrays, at least for GPU decoding. We might want tochange the name since it wouldn't be used exclusively for random accessany longer. Maybe LENGTH_BYTE_ARRAY? Or PLAIN_BYTE_ARRAY?

I'll also raise my hand as interested in participating in all 5 of thetasks outlined, as time permits.


Cheers,
Ed

On 5/28/24 11:05 PM, Micah Kornfield wrote:

BTW, I did propose a new RANDOM_ACCESS_BYTE_ARRAY encoding (effectively
Arrow's representation) as part footer improvements [1] to help allow for
O(1) access to particular column metadata, once a column is identified.

[1] https://github.com/apache/parquet-format/pull/250

On Mon, May 27, 2024 at 11:16 PM Micah Kornfield <[email protected]>
wrote:

As a follow-up to the "V3" Discussions [1][2] I wanted to start a thread
on improvements to encodings.

There are several areas to pursue here:
1.  Curating a standard set of benchmarks and criteria for determining if
a new encoding is worth adding.
2.  Developing new encodings
3.  Better implementations to select existing encodings.
4.  Better support for encodings with point/indexed lookups.
5.  Benchmarking frameworks that allow assessing trade-off of encodings on
storage systems with different latency/throughput.

Realistically, given my current commitments, I don't think I have
bandwidth to help with this track in the near term. If someone else would
like to help drive this and make concrete proposals in these areas it would
be greatly appreciated.

Thanks,
Micah


[1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
[2]
https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit

Re: [DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

Reply via email to