Maybe this is putting the cart too far in front of the horse, but I'd be
willing to implement an encoding like this to see if is a better
alternative to PLAIN and DELTA_LENGTH_BYTE_ARRAY as a dictionary
fallback for byte arrays, at least for GPU decoding. We might want to
change the name since it wouldn't be used exclusively for random access
any longer. Maybe LENGTH_BYTE_ARRAY? Or PLAIN_BYTE_ARRAY?
I'll also raise my hand as interested in participating in all 5 of the
tasks outlined, as time permits.
Cheers,
Ed
On 5/28/24 11:05 PM, Micah Kornfield wrote:
BTW, I did propose a new RANDOM_ACCESS_BYTE_ARRAY encoding (effectively
Arrow's representation) as part footer improvements [1] to help allow for
O(1) access to particular column metadata, once a column is identified.
[1] https://github.com/apache/parquet-format/pull/250
On Mon, May 27, 2024 at 11:16 PM Micah Kornfield <[email protected]>
wrote:
As a follow-up to the "V3" Discussions [1][2] I wanted to start a thread
on improvements to encodings.
There are several areas to pursue here:
1. Curating a standard set of benchmarks and criteria for determining if
a new encoding is worth adding.
2. Developing new encodings
3. Better implementations to select existing encodings.
4. Better support for encodings with point/indexed lookups.
5. Benchmarking frameworks that allow assessing trade-off of encodings on
storage systems with different latency/throughput.
Realistically, given my current commitments, I don't think I have
bandwidth to help with this track in the near term. If someone else would
like to help drive this and make concrete proposals in these areas it would
be greatly appreciated.
Thanks,
Micah
[1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
[2]
https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit