Hi,

Adding my input here with a distribution maintainer hat on: We dislike stuff 
that is vendored and prefer that things can be built independently. If a 
dependency doesn't have releases or has an unstable API, however, vendoring is 
the better alternative.

Best
Uwe

On Thu, Dec 11, 2025, at 10:08 AM, Antoine Pitrou wrote:
> The point of vendoring is to make maintenance easier. If we then have to 
> backport bugfixes on a regular basis, it will actually make maintenance 
> harder.
>
> In any case, it is a bit premature to discuss implementation details of 
> a feature that hasn't been formally decided on yet.
>
> Regards
>
> Antoine.
>
>
> Le 11/12/2025 à 07:13, Arnav Balyan a écrit :
>> Thanks everyone for the detailed input and the pointers to prior experience
>> on DuckDB, Arrow C++ and licensing.
>> 
>> Looking at DuckDB’s FSST history and the bug fixes they upstreamed, it
>> seems reasonable to expect that FSST will continue to evolve. That’s a
>> point in favor of keeping an upstream dependency rather than maintaining an
>> internal fork.
>> 
>> On the other hand, as Raul highlighted, the amount of code we need in
>> Parquet is quite small, and the library does not currently ship official
>> releases. Vendoring would avoid adding a direct dependency. Since we may
>> not have API guarantees due to the lack of a release process.
>> 
>> It also looks like DuckDB was able to fix issues internally and upstream
>> them later (roughly 1 month lag between internal fix and upstream merge).
>> Vendoring may give a similar flexibility to apply fixes and contribute back
>> without waiting on upstream.
>> 
>> Given these considerations, my initial thought was to:
>> 1. Vendor initially, to avoid direct dependency on FSST.
>> 2. And re-evaluate once the project becomes more active or starts shipping
>> releases, switching to a proper CMake dependency.
>> 
>> Would love to discuss if there are other preferences.
>> 
>> Thanks,
>> Arnav
>> 
>> On Wed, Dec 10, 2025 at 10:14 PM Andrew Lamb <[email protected]> wrote:
>> 
>>> touche
>>>
>>> One thing that may actually be relevant  from the duckdb fork is to examine
>>> what they have changed:
>>> https://github.com/duckdb/duckdb/commits/main/third_party/fsst
>>>
>>> Specifically it looks like they have fixed bugs such as [1], and then
>>> contributed the fixes back upstream[2].
>>>
>>> Andrew
>>>
>>>
>>>
>>> [1]:
>>>
>>> https://github.com/duckdb/duckdb/commit/078e96ac75c77f03302dadd7b0488cdb9f04fc9a
>>> [2]: https://github.com/cwida/fsst/pull/31
>>>
>>>
>>> On Wed, Dec 10, 2025 at 10:22 AM Antoine Pitrou <[email protected]>
>>> wrote:
>>>
>>>>
>>>> Ok, but DuckDB vendors everything (*), so it's not very relevant as a
>>>> reference :-)
>>>>
>>>> My opinion on this is that vendoring should be limited to libraries that
>>>> are either very small or very stable. I haven't taken a look at FSST
>>>> yet, so I don't have a specific opinion about it, but being a very
>>>> recent compression/encoding algorithm my intuition is that it may
>>>> receive quite a bit of maintenance (bug fixes, improvements) in the
>>>> coming years.
>>>>
>>>> (*) Really: https://github.com/duckdb/duckdb/tree/main/third_party
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>>
>>>> Le 10/12/2025 à 14:52, Andrew Lamb a écrit :
>>>>> FWIW vendoring is the approach that DuckDB seems to have taken with the
>>>>> fsst code[1]
>>>>>
>>>>> [1]: https://github.com/duckdb/duckdb/tree/main/third_party/fsst
>>>>>
>>>>> On Wed, Dec 10, 2025 at 8:47 AM Raúl Cumplido <[email protected]>
>>> wrote:
>>>>>
>>>>>> Thanks Arnav for working on this!
>>>>>>
>>>>>> I've taken a look at the current draft PR on the Arrow repository [1].
>>>>>>
>>>>>> Given the small amount of code required to vendor it, I am +1 on
>>>>>> vendoring it. In general the vendored third party dependencies on
>>>>>> Arrow C++ require less maintenance. In this case where there are no
>>>>>> official releases of FSST and it's not distributed via other channels
>>>>>> it probably is a similar effort.
>>>>>>
>>>>>> Regards,
>>>>>> Raúl
>>>>>>
>>>>>> [1] https://github.com/apache/arrow/pull/48232
>>>>>>
>>>>>> El mié, 10 dic 2025 a las 14:14, Fokko Driesprong (<[email protected]
>>>> )
>>>>>> escribió:
>>>>>>>
>>>>>>> Hey Arnav,
>>>>>>>
>>>>>>> Thanks for raising this. Could you add a link to the brief initial
>>>>>>> discussion?
>>>>>>>
>>>>>>> If we don't need to make any modifications to the external source, I
>>>>>> would
>>>>>>> prefer to pull it in, as that would likely be the easiest
>>>>>> maintenance-wise.
>>>>>>> Upon reviewing the repository, it appears to be MIT-licensed
>>>>>>> <https://github.com/cwida/fsst/blob/master/LICENSE>. This is
>>>> compatible
>>>>>>> with the ASF license <
>>>>>> https://www.apache.org/legal/resolved.html#category-a>,
>>>>>>> so we can ship it as part of the Parquet project. We must ensure that
>>>> we
>>>>>>> correctly mention the dependency in the license.
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Fokko
>>>>>>>
>>>>>>>
>>>>>>> Op wo 10 dec 2025 om 09:26 schreef Arnav Balyan <
>>>> [email protected]
>>>>>>> :
>>>>>>>
>>>>>>>> Hi team,
>>>>>>>> We recently proposed the support for FSST for Parquet. There are two
>>>>>> main
>>>>>>>> options to take the FSST dependency:
>>>>>>>>
>>>>>>>> 1. CMake dependency on fsst GitHub:
>>>>>>>>
>>>>>>>>      - Pull FSST as an external dependency via CMake
>>>>>>>>      - Adds an external dependency to the build
>>>>>>>>
>>>>>>>> 2. Vendor the code:
>>>>>>>>
>>>>>>>>      - Need to copy 3-4 required source files directly into the repo
>>>>>>>>      - No external dependency
>>>>>>>>
>>>>>>>>
>>>>>>>> There was a brief initial discussion on pr, and I just wanted to
>>> start
>>>>>> a
>>>>>>>> thread to discuss further.
>>>>>>>> Overall this is a lightweight dependency, with a couple of commits
>>>>>> upstream
>>>>>>>> every few months, vendoring looks like a safe option. It may avoid
>>>>>> external
>>>>>>>> dependency while keeping low maintenance overhead. However, we may
>>>>>> have to
>>>>>>>> pull any major changes in the future.
>>>>>>>>
>>>>>>>> Would love to know what folks think. Are there any concerns with
>>>> either
>>>>>>>> approach, or a preference on how we have handled similar situations
>>> in
>>>>>> the
>>>>>>>> past?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks and Regards,
>>>>>>>> Arnav
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>

Reply via email to