Hey Arnav, Thank you for working on this and collecting the data, which looks quite exciting.
Kind regards, Fokko Op di 9 dec 2025 om 19:48 schreef Arnav Balyan <[email protected]>: > Thanks Micah! > Agreed, thanks for the review! > Since this is a large proposal, we should be able to land FSST before > landing composite encoding. Delta + RLE would be a good initial milestone > (without a dependency on FSST) and newer encodings can be added in the > future. > The proposed design for composite encodings makes it simple to add newer > encodings, once the right plumbing is baked in. With stage level > encoding/decoding, adding a new encoding is a matter of adding a few lines > of code in the validator and wiring up to the actual encoding > implementation. Newer encodings will still provide logic for the > non-composite version, and optional code to support it as composite > dependency in the composite encoder/decoder. > Would love to discuss more in sync. > > Regards, > Arnav > > On Tue, Dec 9, 2025 at 11:14 PM Micah Kornfield <[email protected]> > wrote: > > > I think cascaded encodings would be a good idea in the long run. I > worry a > > little bit that there are dependencies on in-flight encoding proposals a > > little and it would be nice to focus on landing those before moving to > > something more complex. > > > > On Mon, Dec 8, 2025 at 11:31 PM Arnav Balyan <[email protected]> > > wrote: > > > > > Hi Antoine, > > > Thanks for the review, I'll add this data shortly. > > > > > > On Mon, Dec 8, 2025 at 4:18 PM Antoine Pitrou <[email protected]> > > wrote: > > > > > > > > > > > Hello Arnav, > > > > > > > > Was any additional compression applied? I could not find any > > > > information in the document. > > > > > > > > Ideally, for numerical columns I think the following configurations > > > > should be compared: > > > > > > > > - PLAIN > > > > - PLAIN + ZSTD > > > > - BYTE_STREAM_SPLIT + ZSTD > > > > - DELTA + RLE > > > > - DELTA + ZSTD > > > > > > > > For strings you might want to compare the following: > > > > > > > > - PLAIN > > > > - PLAIN + ZSTD > > > > - DELTA_BYTE_ARRAY > > > > - DELTA_BYTE_ARRAY + ZSTD > > > > - DICT > > > > - DICT + FSST > > > > - DICT + ZSTD > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > On Mon, 8 Dec 2025 15:14:20 +0530 > > > > Arnav Balyan <[email protected]> > > > > wrote: > > > > > Hi team, thanks to the very valuable reviews and feedback from > > Juliean, > > > > > Micah, Adnrew and others, the FSST proposal is in the PoC stage, > and > > > will > > > > > be worked upon in the coming weeks. > > > > > > > > > > I just wanted to start a discussion on Composite encodings for > > Parquet > > > > and > > > > > get the community's thoughts, feedback and suggestions on nested > > > > encodings. > > > > > > > > > > Nested/Composite/Hierarchical encodings are supported in Vortex, > > > > Fastlanes > > > > > etc, and partly supported in Parquet (with Dict + RLE). This > > > > > proposal discusses formalizing the same and paving way for future > > > > encodings > > > > > like Dict + FSST, Delta + RLE and others. > > > > > > > > > > Several benchmarks were run on some well recognized nested > encodings, > > > and > > > > > show significant compression gains (order of 10x improvements) > which > > > are > > > > > further detailed in the doc. > > > > > > > > > > Would love to get your thoughts and feedback! > > > > > > > > > > > > > > > https://docs.google.com/document/d/1Yi5JwpKEsRFw7D8-iETguRDPtjlyiKITCguYUrrzEVY > > > > > > > > > > Regards, > > > > > - Arnav > > > > > > > > > > > > > > > > > > > > > > > > > > >
