touche One thing that may actually be relevant from the duckdb fork is to examine what they have changed: https://github.com/duckdb/duckdb/commits/main/third_party/fsst
Specifically it looks like they have fixed bugs such as [1], and then contributed the fixes back upstream[2]. Andrew [1]: https://github.com/duckdb/duckdb/commit/078e96ac75c77f03302dadd7b0488cdb9f04fc9a [2]: https://github.com/cwida/fsst/pull/31 On Wed, Dec 10, 2025 at 10:22 AM Antoine Pitrou <[email protected]> wrote: > > Ok, but DuckDB vendors everything (*), so it's not very relevant as a > reference :-) > > My opinion on this is that vendoring should be limited to libraries that > are either very small or very stable. I haven't taken a look at FSST > yet, so I don't have a specific opinion about it, but being a very > recent compression/encoding algorithm my intuition is that it may > receive quite a bit of maintenance (bug fixes, improvements) in the > coming years. > > (*) Really: https://github.com/duckdb/duckdb/tree/main/third_party > > Regards > > Antoine. > > > > Le 10/12/2025 à 14:52, Andrew Lamb a écrit : > > FWIW vendoring is the approach that DuckDB seems to have taken with the > > fsst code[1] > > > > [1]: https://github.com/duckdb/duckdb/tree/main/third_party/fsst > > > > On Wed, Dec 10, 2025 at 8:47 AM Raúl Cumplido <[email protected]> wrote: > > > >> Thanks Arnav for working on this! > >> > >> I've taken a look at the current draft PR on the Arrow repository [1]. > >> > >> Given the small amount of code required to vendor it, I am +1 on > >> vendoring it. In general the vendored third party dependencies on > >> Arrow C++ require less maintenance. In this case where there are no > >> official releases of FSST and it's not distributed via other channels > >> it probably is a similar effort. > >> > >> Regards, > >> Raúl > >> > >> [1] https://github.com/apache/arrow/pull/48232 > >> > >> El mié, 10 dic 2025 a las 14:14, Fokko Driesprong (<[email protected]>) > >> escribió: > >>> > >>> Hey Arnav, > >>> > >>> Thanks for raising this. Could you add a link to the brief initial > >>> discussion? > >>> > >>> If we don't need to make any modifications to the external source, I > >> would > >>> prefer to pull it in, as that would likely be the easiest > >> maintenance-wise. > >>> Upon reviewing the repository, it appears to be MIT-licensed > >>> <https://github.com/cwida/fsst/blob/master/LICENSE>. This is > compatible > >>> with the ASF license < > >> https://www.apache.org/legal/resolved.html#category-a>, > >>> so we can ship it as part of the Parquet project. We must ensure that > we > >>> correctly mention the dependency in the license. > >>> > >>> Kind regards, > >>> Fokko > >>> > >>> > >>> Op wo 10 dec 2025 om 09:26 schreef Arnav Balyan < > [email protected] > >>> : > >>> > >>>> Hi team, > >>>> We recently proposed the support for FSST for Parquet. There are two > >> main > >>>> options to take the FSST dependency: > >>>> > >>>> 1. CMake dependency on fsst GitHub: > >>>> > >>>> - Pull FSST as an external dependency via CMake > >>>> - Adds an external dependency to the build > >>>> > >>>> 2. Vendor the code: > >>>> > >>>> - Need to copy 3-4 required source files directly into the repo > >>>> - No external dependency > >>>> > >>>> > >>>> There was a brief initial discussion on pr, and I just wanted to start > >> a > >>>> thread to discuss further. > >>>> Overall this is a lightweight dependency, with a couple of commits > >> upstream > >>>> every few months, vendoring looks like a safe option. It may avoid > >> external > >>>> dependency while keeping low maintenance overhead. However, we may > >> have to > >>>> pull any major changes in the future. > >>>> > >>>> Would love to know what folks think. Are there any concerns with > either > >>>> approach, or a preference on how we have handled similar situations in > >> the > >>>> past? > >>>> > >>>> > >>>> Thanks and Regards, > >>>> Arnav > >>>> > >> > > > > >
