Hey Arnav,

Thank you for working on this and collecting the data, which looks quite
exciting.

Kind regards,
Fokko

Op di 9 dec 2025 om 19:48 schreef Arnav Balyan <[email protected]>:

> Thanks Micah!
> Agreed, thanks for the review!
> Since this is a large proposal, we should be able to land FSST before
> landing composite encoding. Delta + RLE would be a good initial milestone
> (without a dependency on FSST) and newer encodings can be added in the
> future.
> The proposed design for composite encodings makes it simple to add newer
> encodings, once the right plumbing is baked in. With stage level
> encoding/decoding, adding a new encoding is a matter of adding a few lines
> of code in the validator and wiring up to the actual encoding
> implementation. Newer encodings will still provide logic for the
> non-composite version, and optional code to support it as composite
> dependency in the composite encoder/decoder.
> Would love to discuss more in sync.
>
> Regards,
> Arnav
>
> On Tue, Dec 9, 2025 at 11:14 PM Micah Kornfield <[email protected]>
> wrote:
>
> > I think cascaded encodings would be a good idea in the long run.  I
> worry a
> > little bit that there are dependencies on in-flight encoding proposals a
> > little and it would be nice to focus on landing those before moving to
> > something more complex.
> >
> > On Mon, Dec 8, 2025 at 11:31 PM Arnav Balyan <[email protected]>
> > wrote:
> >
> > > Hi Antoine,
> > > Thanks for the review, I'll add this data shortly.
> > >
> > > On Mon, Dec 8, 2025 at 4:18 PM Antoine Pitrou <[email protected]>
> > wrote:
> > >
> > > >
> > > > Hello Arnav,
> > > >
> > > > Was any additional compression applied? I could not find any
> > > > information in the document.
> > > >
> > > > Ideally, for numerical columns I think the following configurations
> > > > should be compared:
> > > >
> > > > - PLAIN
> > > > - PLAIN + ZSTD
> > > > - BYTE_STREAM_SPLIT + ZSTD
> > > > - DELTA + RLE
> > > > - DELTA + ZSTD
> > > >
> > > > For strings you might want to compare the following:
> > > >
> > > > - PLAIN
> > > > - PLAIN + ZSTD
> > > > - DELTA_BYTE_ARRAY
> > > > - DELTA_BYTE_ARRAY + ZSTD
> > > > - DICT
> > > > - DICT + FSST
> > > > - DICT + ZSTD
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > On Mon, 8 Dec 2025 15:14:20 +0530
> > > > Arnav Balyan <[email protected]>
> > > > wrote:
> > > > > Hi team, thanks to the very valuable reviews and feedback from
> > Juliean,
> > > > > Micah, Adnrew and others, the FSST proposal is in the PoC stage,
> and
> > > will
> > > > > be worked upon in the coming weeks.
> > > > >
> > > > > I just wanted to start a discussion on Composite encodings for
> > Parquet
> > > > and
> > > > > get the community's thoughts, feedback and suggestions on nested
> > > > encodings.
> > > > >
> > > > > Nested/Composite/Hierarchical encodings are supported in Vortex,
> > > > Fastlanes
> > > > > etc, and partly supported in Parquet (with Dict + RLE). This
> > > > > proposal discusses formalizing the same and paving way for future
> > > > encodings
> > > > > like Dict + FSST, Delta + RLE and others.
> > > > >
> > > > > Several benchmarks were run on some well recognized nested
> encodings,
> > > and
> > > > > show significant compression gains (order of 10x improvements)
> which
> > > are
> > > > > further detailed in the doc.
> > > > >
> > > > > Would love to get your thoughts and feedback!
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1Yi5JwpKEsRFw7D8-iETguRDPtjlyiKITCguYUrrzEVY
> > > > >
> > > > > Regards,
> > > > >  - Arnav
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Reply via email to