Maybe this is putting the cart too far in front of the horse, but I'd be willing to implement an encoding like this to see if is a better alternative to PLAIN and DELTA_LENGTH_BYTE_ARRAY as a dictionary fallback for byte arrays, at least for GPU decoding. We might want to change the name since it wouldn't be used exclusively for random access any longer. Maybe LENGTH_BYTE_ARRAY? Or PLAIN_BYTE_ARRAY?

I'll also raise my hand as interested in participating in all 5 of the tasks outlined, as time permits.

Cheers,
Ed

On 5/28/24 11:05 PM, Micah Kornfield wrote:
BTW, I did propose a new RANDOM_ACCESS_BYTE_ARRAY encoding (effectively
Arrow's representation) as part footer improvements [1] to help allow for
O(1) access to particular column metadata, once a column is identified.

[1] https://github.com/apache/parquet-format/pull/250

On Mon, May 27, 2024 at 11:16 PM Micah Kornfield <[email protected]>
wrote:

As a follow-up to the "V3" Discussions [1][2] I wanted to start a thread
on improvements to encodings.

There are several areas to pursue here:
1.  Curating a standard set of benchmarks and criteria for determining if
a new encoding is worth adding.
2.  Developing new encodings
3.  Better implementations to select existing encodings.
4.  Better support for encodings with point/indexed lookups.
5.  Benchmarking frameworks that allow assessing trade-off of encodings on
storage systems with different latency/throughput.

Realistically, given my current commitments, I don't think I have
bandwidth to help with this track in the near term. If someone else would
like to help drive this and make concrete proposals in these areas it would
be greatly appreciated.

Thanks,
Micah


[1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
[2]
https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit


Reply via email to