Great. BTW, I removed the encoding I referenced above from the PR to avoid
putting too much into at once:

I'm pasting the description below for posterity.

/** Encoding for variable length binary data that allows random access of
values.
    *
    * This encoding is designed for random access of BYTE_ARRAY values. It
is mostly useful in cases
    * for non-nullable BYTE_ARRAY columns where determining the exact
offset of the value does not require
    * parsing definition levels.
    *
    * The layout consists of the following  elements:
    *   1.  byte_arrays - Byte Array values layed out contiguously.  The
BYTE_ARRAYs are immediately contiguous to the cumulative
    *       offsets.
    *   2.  offsets: A contiguous set of signed N-byte little-endian
unsigned integers
    *       representing the end byte offset (exclusive) of a BYTE_ARRAY
value from
    *       the the beginning of the page. For simplicity of implementation
the 0 index is
    *       always as zero.
    *   3.  The last byte indicates the number of bytes used for offsets
(valid values are 1,2,3 and 4).
    *       Implementations SHOULD try to use the smallest byte value that
meets the length requirements.
    *
    *   Note the order of lengths is reversed from DELTA_BINARY_PACKED to
allow for byte array values to
    *   potentially allow for incremental compression in the case of Data
Page V2 or other future data pages
    *   where values are compressed separately from nesting information.
    *
    *   The beginning offset of the offsets can be determined using the
final offset element.
    *
    *   An individual byte array element can be found at an index using the
following pseudo-code
    *   (real implementations SHOULD do bounds checking):
    *
    *      return byte_arrays[offsets[index] : offsets[index+1]]
    *
    *
    * Example encoding of "f", "oo", "bar1" (square brackets delimit the
components listed):
    *   [foobar1][0,1,3,7][1]
    */

On Wed, May 29, 2024 at 10:57 PM Gang Wu <[email protected]> wrote:

> I'm interested in experimenting and implementing new encodings.
> Will follow up with concrete proposals or findings.
>
> Best,
> Gang
>
> On Thu, May 30, 2024 at 3:29 AM Ed Seidl <[email protected]> wrote:
>
> > Maybe this is putting the cart too far in front of the horse, but I'd be
> > willing to implement an encoding like this to see if is a better
> > alternative to PLAIN and DELTA_LENGTH_BYTE_ARRAY as a dictionary
> > fallback for byte arrays, at least for GPU decoding. We might want to
> > change the name since it wouldn't be used exclusively for random access
> > any longer. Maybe LENGTH_BYTE_ARRAY? Or PLAIN_BYTE_ARRAY?
> >
> > I'll also raise my hand as interested in participating in all 5 of the
> > tasks outlined, as time permits.
> >
> > Cheers,
> > Ed
> >
> > On 5/28/24 11:05 PM, Micah Kornfield wrote:
> > > BTW, I did propose a new RANDOM_ACCESS_BYTE_ARRAY encoding (effectively
> > > Arrow's representation) as part footer improvements [1] to help allow
> for
> > > O(1) access to particular column metadata, once a column is identified.
> > >
> > > [1] https://github.com/apache/parquet-format/pull/250
> > >
> > > On Mon, May 27, 2024 at 11:16 PM Micah Kornfield <
> [email protected]>
> > > wrote:
> > >
> > >> As a follow-up to the "V3" Discussions [1][2] I wanted to start a
> thread
> > >> on improvements to encodings.
> > >>
> > >> There are several areas to pursue here:
> > >> 1.  Curating a standard set of benchmarks and criteria for determining
> > if
> > >> a new encoding is worth adding.
> > >> 2.  Developing new encodings
> > >> 3.  Better implementations to select existing encodings.
> > >> 4.  Better support for encodings with point/indexed lookups.
> > >> 5.  Benchmarking frameworks that allow assessing trade-off of
> encodings
> > on
> > >> storage systems with different latency/throughput.
> > >>
> > >> Realistically, given my current commitments, I don't think I have
> > >> bandwidth to help with this track in the near term. If someone else
> > would
> > >> like to help drive this and make concrete proposals in these areas it
> > would
> > >> be greatly appreciated.
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >>
> > >> [1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
> > >> [2]
> > >>
> >
> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
> > >>
> >
> >
>

Reply via email to