Thanks everyone for the detailed input and the pointers to prior experience on DuckDB, Arrow C++ and licensing.
Looking at DuckDB’s FSST history and the bug fixes they upstreamed, it seems reasonable to expect that FSST will continue to evolve. That’s a point in favor of keeping an upstream dependency rather than maintaining an internal fork. On the other hand, as Raul highlighted, the amount of code we need in Parquet is quite small, and the library does not currently ship official releases. Vendoring would avoid adding a direct dependency. Since we may not have API guarantees due to the lack of a release process. It also looks like DuckDB was able to fix issues internally and upstream them later (roughly 1 month lag between internal fix and upstream merge). Vendoring may give a similar flexibility to apply fixes and contribute back without waiting on upstream. Given these considerations, my initial thought was to: 1. Vendor initially, to avoid direct dependency on FSST. 2. And re-evaluate once the project becomes more active or starts shipping releases, switching to a proper CMake dependency. Would love to discuss if there are other preferences. Thanks, Arnav On Wed, Dec 10, 2025 at 10:14 PM Andrew Lamb <[email protected]> wrote: > touche > > One thing that may actually be relevant from the duckdb fork is to examine > what they have changed: > https://github.com/duckdb/duckdb/commits/main/third_party/fsst > > Specifically it looks like they have fixed bugs such as [1], and then > contributed the fixes back upstream[2]. > > Andrew > > > > [1]: > > https://github.com/duckdb/duckdb/commit/078e96ac75c77f03302dadd7b0488cdb9f04fc9a > [2]: https://github.com/cwida/fsst/pull/31 > > > On Wed, Dec 10, 2025 at 10:22 AM Antoine Pitrou <[email protected]> > wrote: > > > > > Ok, but DuckDB vendors everything (*), so it's not very relevant as a > > reference :-) > > > > My opinion on this is that vendoring should be limited to libraries that > > are either very small or very stable. I haven't taken a look at FSST > > yet, so I don't have a specific opinion about it, but being a very > > recent compression/encoding algorithm my intuition is that it may > > receive quite a bit of maintenance (bug fixes, improvements) in the > > coming years. > > > > (*) Really: https://github.com/duckdb/duckdb/tree/main/third_party > > > > Regards > > > > Antoine. > > > > > > > > Le 10/12/2025 à 14:52, Andrew Lamb a écrit : > > > FWIW vendoring is the approach that DuckDB seems to have taken with the > > > fsst code[1] > > > > > > [1]: https://github.com/duckdb/duckdb/tree/main/third_party/fsst > > > > > > On Wed, Dec 10, 2025 at 8:47 AM Raúl Cumplido <[email protected]> > wrote: > > > > > >> Thanks Arnav for working on this! > > >> > > >> I've taken a look at the current draft PR on the Arrow repository [1]. > > >> > > >> Given the small amount of code required to vendor it, I am +1 on > > >> vendoring it. In general the vendored third party dependencies on > > >> Arrow C++ require less maintenance. In this case where there are no > > >> official releases of FSST and it's not distributed via other channels > > >> it probably is a similar effort. > > >> > > >> Regards, > > >> Raúl > > >> > > >> [1] https://github.com/apache/arrow/pull/48232 > > >> > > >> El mié, 10 dic 2025 a las 14:14, Fokko Driesprong (<[email protected] > >) > > >> escribió: > > >>> > > >>> Hey Arnav, > > >>> > > >>> Thanks for raising this. Could you add a link to the brief initial > > >>> discussion? > > >>> > > >>> If we don't need to make any modifications to the external source, I > > >> would > > >>> prefer to pull it in, as that would likely be the easiest > > >> maintenance-wise. > > >>> Upon reviewing the repository, it appears to be MIT-licensed > > >>> <https://github.com/cwida/fsst/blob/master/LICENSE>. This is > > compatible > > >>> with the ASF license < > > >> https://www.apache.org/legal/resolved.html#category-a>, > > >>> so we can ship it as part of the Parquet project. We must ensure that > > we > > >>> correctly mention the dependency in the license. > > >>> > > >>> Kind regards, > > >>> Fokko > > >>> > > >>> > > >>> Op wo 10 dec 2025 om 09:26 schreef Arnav Balyan < > > [email protected] > > >>> : > > >>> > > >>>> Hi team, > > >>>> We recently proposed the support for FSST for Parquet. There are two > > >> main > > >>>> options to take the FSST dependency: > > >>>> > > >>>> 1. CMake dependency on fsst GitHub: > > >>>> > > >>>> - Pull FSST as an external dependency via CMake > > >>>> - Adds an external dependency to the build > > >>>> > > >>>> 2. Vendor the code: > > >>>> > > >>>> - Need to copy 3-4 required source files directly into the repo > > >>>> - No external dependency > > >>>> > > >>>> > > >>>> There was a brief initial discussion on pr, and I just wanted to > start > > >> a > > >>>> thread to discuss further. > > >>>> Overall this is a lightweight dependency, with a couple of commits > > >> upstream > > >>>> every few months, vendoring looks like a safe option. It may avoid > > >> external > > >>>> dependency while keeping low maintenance overhead. However, we may > > >> have to > > >>>> pull any major changes in the future. > > >>>> > > >>>> Would love to know what folks think. Are there any concerns with > > either > > >>>> approach, or a preference on how we have handled similar situations > in > > >> the > > >>>> past? > > >>>> > > >>>> > > >>>> Thanks and Regards, > > >>>> Arnav > > >>>> > > >> > > > > > > > > > >
