Hi, I just skimmed the FSST paper and GitHub repo. While the algorithm is relatively straightforward compared to general compression codecs like ZSTD, achieving high quality and interoperability for a custom implementation will take significant effort.
Considering this, vendoring the existing FSST code and improving it based on our needs seems like a practical approach. However, the current FSST code uses hand-written AVX512 instructions, which is inconsistent with Arrow C++'s use of xsimd. This inconsistency will need to be resolved, likely by modifying the vendored FSST code to align with xsimd. Best, Gang On Thu, Dec 11, 2025 at 8:12 PM Uwe L. Korn <[email protected]> wrote: > Hi, > > Adding my input here with a distribution maintainer hat on: We dislike > stuff that is vendored and prefer that things can be built independently. > If a dependency doesn't have releases or has an unstable API, however, > vendoring is the better alternative. > > Best > Uwe > > On Thu, Dec 11, 2025, at 10:08 AM, Antoine Pitrou wrote: > > The point of vendoring is to make maintenance easier. If we then have to > > backport bugfixes on a regular basis, it will actually make maintenance > > harder. > > > > In any case, it is a bit premature to discuss implementation details of > > a feature that hasn't been formally decided on yet. > > > > Regards > > > > Antoine. > > > > > > Le 11/12/2025 à 07:13, Arnav Balyan a écrit : > >> Thanks everyone for the detailed input and the pointers to prior > experience > >> on DuckDB, Arrow C++ and licensing. > >> > >> Looking at DuckDB’s FSST history and the bug fixes they upstreamed, it > >> seems reasonable to expect that FSST will continue to evolve. That’s a > >> point in favor of keeping an upstream dependency rather than > maintaining an > >> internal fork. > >> > >> On the other hand, as Raul highlighted, the amount of code we need in > >> Parquet is quite small, and the library does not currently ship official > >> releases. Vendoring would avoid adding a direct dependency. Since we may > >> not have API guarantees due to the lack of a release process. > >> > >> It also looks like DuckDB was able to fix issues internally and upstream > >> them later (roughly 1 month lag between internal fix and upstream > merge). > >> Vendoring may give a similar flexibility to apply fixes and contribute > back > >> without waiting on upstream. > >> > >> Given these considerations, my initial thought was to: > >> 1. Vendor initially, to avoid direct dependency on FSST. > >> 2. And re-evaluate once the project becomes more active or starts > shipping > >> releases, switching to a proper CMake dependency. > >> > >> Would love to discuss if there are other preferences. > >> > >> Thanks, > >> Arnav > >> > >> On Wed, Dec 10, 2025 at 10:14 PM Andrew Lamb <[email protected]> > wrote: > >> > >>> touche > >>> > >>> One thing that may actually be relevant from the duckdb fork is to > examine > >>> what they have changed: > >>> https://github.com/duckdb/duckdb/commits/main/third_party/fsst > >>> > >>> Specifically it looks like they have fixed bugs such as [1], and then > >>> contributed the fixes back upstream[2]. > >>> > >>> Andrew > >>> > >>> > >>> > >>> [1]: > >>> > >>> > https://github.com/duckdb/duckdb/commit/078e96ac75c77f03302dadd7b0488cdb9f04fc9a > >>> [2]: https://github.com/cwida/fsst/pull/31 > >>> > >>> > >>> On Wed, Dec 10, 2025 at 10:22 AM Antoine Pitrou <[email protected]> > >>> wrote: > >>> > >>>> > >>>> Ok, but DuckDB vendors everything (*), so it's not very relevant as a > >>>> reference :-) > >>>> > >>>> My opinion on this is that vendoring should be limited to libraries > that > >>>> are either very small or very stable. I haven't taken a look at FSST > >>>> yet, so I don't have a specific opinion about it, but being a very > >>>> recent compression/encoding algorithm my intuition is that it may > >>>> receive quite a bit of maintenance (bug fixes, improvements) in the > >>>> coming years. > >>>> > >>>> (*) Really: https://github.com/duckdb/duckdb/tree/main/third_party > >>>> > >>>> Regards > >>>> > >>>> Antoine. > >>>> > >>>> > >>>> > >>>> Le 10/12/2025 à 14:52, Andrew Lamb a écrit : > >>>>> FWIW vendoring is the approach that DuckDB seems to have taken with > the > >>>>> fsst code[1] > >>>>> > >>>>> [1]: https://github.com/duckdb/duckdb/tree/main/third_party/fsst > >>>>> > >>>>> On Wed, Dec 10, 2025 at 8:47 AM Raúl Cumplido <[email protected]> > >>> wrote: > >>>>> > >>>>>> Thanks Arnav for working on this! > >>>>>> > >>>>>> I've taken a look at the current draft PR on the Arrow repository > [1]. > >>>>>> > >>>>>> Given the small amount of code required to vendor it, I am +1 on > >>>>>> vendoring it. In general the vendored third party dependencies on > >>>>>> Arrow C++ require less maintenance. In this case where there are no > >>>>>> official releases of FSST and it's not distributed via other > channels > >>>>>> it probably is a similar effort. > >>>>>> > >>>>>> Regards, > >>>>>> Raúl > >>>>>> > >>>>>> [1] https://github.com/apache/arrow/pull/48232 > >>>>>> > >>>>>> El mié, 10 dic 2025 a las 14:14, Fokko Driesprong (< > [email protected] > >>>> ) > >>>>>> escribió: > >>>>>>> > >>>>>>> Hey Arnav, > >>>>>>> > >>>>>>> Thanks for raising this. Could you add a link to the brief initial > >>>>>>> discussion? > >>>>>>> > >>>>>>> If we don't need to make any modifications to the external source, > I > >>>>>> would > >>>>>>> prefer to pull it in, as that would likely be the easiest > >>>>>> maintenance-wise. > >>>>>>> Upon reviewing the repository, it appears to be MIT-licensed > >>>>>>> <https://github.com/cwida/fsst/blob/master/LICENSE>. This is > >>>> compatible > >>>>>>> with the ASF license < > >>>>>> https://www.apache.org/legal/resolved.html#category-a>, > >>>>>>> so we can ship it as part of the Parquet project. We must ensure > that > >>>> we > >>>>>>> correctly mention the dependency in the license. > >>>>>>> > >>>>>>> Kind regards, > >>>>>>> Fokko > >>>>>>> > >>>>>>> > >>>>>>> Op wo 10 dec 2025 om 09:26 schreef Arnav Balyan < > >>>> [email protected] > >>>>>>> : > >>>>>>> > >>>>>>>> Hi team, > >>>>>>>> We recently proposed the support for FSST for Parquet. There are > two > >>>>>> main > >>>>>>>> options to take the FSST dependency: > >>>>>>>> > >>>>>>>> 1. CMake dependency on fsst GitHub: > >>>>>>>> > >>>>>>>> - Pull FSST as an external dependency via CMake > >>>>>>>> - Adds an external dependency to the build > >>>>>>>> > >>>>>>>> 2. Vendor the code: > >>>>>>>> > >>>>>>>> - Need to copy 3-4 required source files directly into the > repo > >>>>>>>> - No external dependency > >>>>>>>> > >>>>>>>> > >>>>>>>> There was a brief initial discussion on pr, and I just wanted to > >>> start > >>>>>> a > >>>>>>>> thread to discuss further. > >>>>>>>> Overall this is a lightweight dependency, with a couple of commits > >>>>>> upstream > >>>>>>>> every few months, vendoring looks like a safe option. It may avoid > >>>>>> external > >>>>>>>> dependency while keeping low maintenance overhead. However, we may > >>>>>> have to > >>>>>>>> pull any major changes in the future. > >>>>>>>> > >>>>>>>> Would love to know what folks think. Are there any concerns with > >>>> either > >>>>>>>> approach, or a preference on how we have handled similar > situations > >>> in > >>>>>> the > >>>>>>>> past? > >>>>>>>> > >>>>>>>> > >>>>>>>> Thanks and Regards, > >>>>>>>> Arnav > >>>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>>> > >>> > >> >
