Hi, Adding my input here with a distribution maintainer hat on: We dislike stuff that is vendored and prefer that things can be built independently. If a dependency doesn't have releases or has an unstable API, however, vendoring is the better alternative.
Best Uwe On Thu, Dec 11, 2025, at 10:08 AM, Antoine Pitrou wrote: > The point of vendoring is to make maintenance easier. If we then have to > backport bugfixes on a regular basis, it will actually make maintenance > harder. > > In any case, it is a bit premature to discuss implementation details of > a feature that hasn't been formally decided on yet. > > Regards > > Antoine. > > > Le 11/12/2025 à 07:13, Arnav Balyan a écrit : >> Thanks everyone for the detailed input and the pointers to prior experience >> on DuckDB, Arrow C++ and licensing. >> >> Looking at DuckDB’s FSST history and the bug fixes they upstreamed, it >> seems reasonable to expect that FSST will continue to evolve. That’s a >> point in favor of keeping an upstream dependency rather than maintaining an >> internal fork. >> >> On the other hand, as Raul highlighted, the amount of code we need in >> Parquet is quite small, and the library does not currently ship official >> releases. Vendoring would avoid adding a direct dependency. Since we may >> not have API guarantees due to the lack of a release process. >> >> It also looks like DuckDB was able to fix issues internally and upstream >> them later (roughly 1 month lag between internal fix and upstream merge). >> Vendoring may give a similar flexibility to apply fixes and contribute back >> without waiting on upstream. >> >> Given these considerations, my initial thought was to: >> 1. Vendor initially, to avoid direct dependency on FSST. >> 2. And re-evaluate once the project becomes more active or starts shipping >> releases, switching to a proper CMake dependency. >> >> Would love to discuss if there are other preferences. >> >> Thanks, >> Arnav >> >> On Wed, Dec 10, 2025 at 10:14 PM Andrew Lamb <[email protected]> wrote: >> >>> touche >>> >>> One thing that may actually be relevant from the duckdb fork is to examine >>> what they have changed: >>> https://github.com/duckdb/duckdb/commits/main/third_party/fsst >>> >>> Specifically it looks like they have fixed bugs such as [1], and then >>> contributed the fixes back upstream[2]. >>> >>> Andrew >>> >>> >>> >>> [1]: >>> >>> https://github.com/duckdb/duckdb/commit/078e96ac75c77f03302dadd7b0488cdb9f04fc9a >>> [2]: https://github.com/cwida/fsst/pull/31 >>> >>> >>> On Wed, Dec 10, 2025 at 10:22 AM Antoine Pitrou <[email protected]> >>> wrote: >>> >>>> >>>> Ok, but DuckDB vendors everything (*), so it's not very relevant as a >>>> reference :-) >>>> >>>> My opinion on this is that vendoring should be limited to libraries that >>>> are either very small or very stable. I haven't taken a look at FSST >>>> yet, so I don't have a specific opinion about it, but being a very >>>> recent compression/encoding algorithm my intuition is that it may >>>> receive quite a bit of maintenance (bug fixes, improvements) in the >>>> coming years. >>>> >>>> (*) Really: https://github.com/duckdb/duckdb/tree/main/third_party >>>> >>>> Regards >>>> >>>> Antoine. >>>> >>>> >>>> >>>> Le 10/12/2025 à 14:52, Andrew Lamb a écrit : >>>>> FWIW vendoring is the approach that DuckDB seems to have taken with the >>>>> fsst code[1] >>>>> >>>>> [1]: https://github.com/duckdb/duckdb/tree/main/third_party/fsst >>>>> >>>>> On Wed, Dec 10, 2025 at 8:47 AM Raúl Cumplido <[email protected]> >>> wrote: >>>>> >>>>>> Thanks Arnav for working on this! >>>>>> >>>>>> I've taken a look at the current draft PR on the Arrow repository [1]. >>>>>> >>>>>> Given the small amount of code required to vendor it, I am +1 on >>>>>> vendoring it. In general the vendored third party dependencies on >>>>>> Arrow C++ require less maintenance. In this case where there are no >>>>>> official releases of FSST and it's not distributed via other channels >>>>>> it probably is a similar effort. >>>>>> >>>>>> Regards, >>>>>> Raúl >>>>>> >>>>>> [1] https://github.com/apache/arrow/pull/48232 >>>>>> >>>>>> El mié, 10 dic 2025 a las 14:14, Fokko Driesprong (<[email protected] >>>> ) >>>>>> escribió: >>>>>>> >>>>>>> Hey Arnav, >>>>>>> >>>>>>> Thanks for raising this. Could you add a link to the brief initial >>>>>>> discussion? >>>>>>> >>>>>>> If we don't need to make any modifications to the external source, I >>>>>> would >>>>>>> prefer to pull it in, as that would likely be the easiest >>>>>> maintenance-wise. >>>>>>> Upon reviewing the repository, it appears to be MIT-licensed >>>>>>> <https://github.com/cwida/fsst/blob/master/LICENSE>. This is >>>> compatible >>>>>>> with the ASF license < >>>>>> https://www.apache.org/legal/resolved.html#category-a>, >>>>>>> so we can ship it as part of the Parquet project. We must ensure that >>>> we >>>>>>> correctly mention the dependency in the license. >>>>>>> >>>>>>> Kind regards, >>>>>>> Fokko >>>>>>> >>>>>>> >>>>>>> Op wo 10 dec 2025 om 09:26 schreef Arnav Balyan < >>>> [email protected] >>>>>>> : >>>>>>> >>>>>>>> Hi team, >>>>>>>> We recently proposed the support for FSST for Parquet. There are two >>>>>> main >>>>>>>> options to take the FSST dependency: >>>>>>>> >>>>>>>> 1. CMake dependency on fsst GitHub: >>>>>>>> >>>>>>>> - Pull FSST as an external dependency via CMake >>>>>>>> - Adds an external dependency to the build >>>>>>>> >>>>>>>> 2. Vendor the code: >>>>>>>> >>>>>>>> - Need to copy 3-4 required source files directly into the repo >>>>>>>> - No external dependency >>>>>>>> >>>>>>>> >>>>>>>> There was a brief initial discussion on pr, and I just wanted to >>> start >>>>>> a >>>>>>>> thread to discuss further. >>>>>>>> Overall this is a lightweight dependency, with a couple of commits >>>>>> upstream >>>>>>>> every few months, vendoring looks like a safe option. It may avoid >>>>>> external >>>>>>>> dependency while keeping low maintenance overhead. However, we may >>>>>> have to >>>>>>>> pull any major changes in the future. >>>>>>>> >>>>>>>> Would love to know what folks think. Are there any concerns with >>>> either >>>>>>>> approach, or a preference on how we have handled similar situations >>> in >>>>>> the >>>>>>>> past? >>>>>>>> >>>>>>>> >>>>>>>> Thanks and Regards, >>>>>>>> Arnav >>>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >>
