Great. BTW, I removed the encoding I referenced above from the PR to avoid
putting too much into at once:
I'm pasting the description below for posterity.
/** Encoding for variable length binary data that allows random access of
values.
*
* This encoding is designed for random access of BYTE_ARRAY values. It
is mostly useful in cases
* for non-nullable BYTE_ARRAY columns where determining the exact
offset of the value does not require
* parsing definition levels.
*
* The layout consists of the following elements:
* 1. byte_arrays - Byte Array values layed out contiguously. The
BYTE_ARRAYs are immediately contiguous to the cumulative
* offsets.
* 2. offsets: A contiguous set of signed N-byte little-endian
unsigned integers
* representing the end byte offset (exclusive) of a BYTE_ARRAY
value from
* the the beginning of the page. For simplicity of implementation
the 0 index is
* always as zero.
* 3. The last byte indicates the number of bytes used for offsets
(valid values are 1,2,3 and 4).
* Implementations SHOULD try to use the smallest byte value that
meets the length requirements.
*
* Note the order of lengths is reversed from DELTA_BINARY_PACKED to
allow for byte array values to
* potentially allow for incremental compression in the case of Data
Page V2 or other future data pages
* where values are compressed separately from nesting information.
*
* The beginning offset of the offsets can be determined using the
final offset element.
*
* An individual byte array element can be found at an index using the
following pseudo-code
* (real implementations SHOULD do bounds checking):
*
* return byte_arrays[offsets[index] : offsets[index+1]]
*
*
* Example encoding of "f", "oo", "bar1" (square brackets delimit the
components listed):
* [foobar1][0,1,3,7][1]
*/
On Wed, May 29, 2024 at 10:57 PM Gang Wu <[email protected]> wrote:
> I'm interested in experimenting and implementing new encodings.
> Will follow up with concrete proposals or findings.
>
> Best,
> Gang
>
> On Thu, May 30, 2024 at 3:29 AM Ed Seidl <[email protected]> wrote:
>
> > Maybe this is putting the cart too far in front of the horse, but I'd be
> > willing to implement an encoding like this to see if is a better
> > alternative to PLAIN and DELTA_LENGTH_BYTE_ARRAY as a dictionary
> > fallback for byte arrays, at least for GPU decoding. We might want to
> > change the name since it wouldn't be used exclusively for random access
> > any longer. Maybe LENGTH_BYTE_ARRAY? Or PLAIN_BYTE_ARRAY?
> >
> > I'll also raise my hand as interested in participating in all 5 of the
> > tasks outlined, as time permits.
> >
> > Cheers,
> > Ed
> >
> > On 5/28/24 11:05 PM, Micah Kornfield wrote:
> > > BTW, I did propose a new RANDOM_ACCESS_BYTE_ARRAY encoding (effectively
> > > Arrow's representation) as part footer improvements [1] to help allow
> for
> > > O(1) access to particular column metadata, once a column is identified.
> > >
> > > [1] https://github.com/apache/parquet-format/pull/250
> > >
> > > On Mon, May 27, 2024 at 11:16 PM Micah Kornfield <
> [email protected]>
> > > wrote:
> > >
> > >> As a follow-up to the "V3" Discussions [1][2] I wanted to start a
> thread
> > >> on improvements to encodings.
> > >>
> > >> There are several areas to pursue here:
> > >> 1. Curating a standard set of benchmarks and criteria for determining
> > if
> > >> a new encoding is worth adding.
> > >> 2. Developing new encodings
> > >> 3. Better implementations to select existing encodings.
> > >> 4. Better support for encodings with point/indexed lookups.
> > >> 5. Benchmarking frameworks that allow assessing trade-off of
> encodings
> > on
> > >> storage systems with different latency/throughput.
> > >>
> > >> Realistically, given my current commitments, I don't think I have
> > >> bandwidth to help with this track in the near term. If someone else
> > would
> > >> like to help drive this and make concrete proposals in these areas it
> > would
> > >> be greatly appreciated.
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >>
> > >> [1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
> > >> [2]
> > >>
> >
> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
> > >>
> >
> >
>