The point of vendoring is to make maintenance easier. If we then have to
backport bugfixes on a regular basis, it will actually make maintenance
harder.
In any case, it is a bit premature to discuss implementation details of
a feature that hasn't been formally decided on yet.
Regards
Antoine.
Le 11/12/2025 à 07:13, Arnav Balyan a écrit :
Thanks everyone for the detailed input and the pointers to prior experience
on DuckDB, Arrow C++ and licensing.
Looking at DuckDB’s FSST history and the bug fixes they upstreamed, it
seems reasonable to expect that FSST will continue to evolve. That’s a
point in favor of keeping an upstream dependency rather than maintaining an
internal fork.
On the other hand, as Raul highlighted, the amount of code we need in
Parquet is quite small, and the library does not currently ship official
releases. Vendoring would avoid adding a direct dependency. Since we may
not have API guarantees due to the lack of a release process.
It also looks like DuckDB was able to fix issues internally and upstream
them later (roughly 1 month lag between internal fix and upstream merge).
Vendoring may give a similar flexibility to apply fixes and contribute back
without waiting on upstream.
Given these considerations, my initial thought was to:
1. Vendor initially, to avoid direct dependency on FSST.
2. And re-evaluate once the project becomes more active or starts shipping
releases, switching to a proper CMake dependency.
Would love to discuss if there are other preferences.
Thanks,
Arnav
On Wed, Dec 10, 2025 at 10:14 PM Andrew Lamb <[email protected]> wrote:
touche
One thing that may actually be relevant from the duckdb fork is to examine
what they have changed:
https://github.com/duckdb/duckdb/commits/main/third_party/fsst
Specifically it looks like they have fixed bugs such as [1], and then
contributed the fixes back upstream[2].
Andrew
[1]:
https://github.com/duckdb/duckdb/commit/078e96ac75c77f03302dadd7b0488cdb9f04fc9a
[2]: https://github.com/cwida/fsst/pull/31
On Wed, Dec 10, 2025 at 10:22 AM Antoine Pitrou <[email protected]>
wrote:
Ok, but DuckDB vendors everything (*), so it's not very relevant as a
reference :-)
My opinion on this is that vendoring should be limited to libraries that
are either very small or very stable. I haven't taken a look at FSST
yet, so I don't have a specific opinion about it, but being a very
recent compression/encoding algorithm my intuition is that it may
receive quite a bit of maintenance (bug fixes, improvements) in the
coming years.
(*) Really: https://github.com/duckdb/duckdb/tree/main/third_party
Regards
Antoine.
Le 10/12/2025 à 14:52, Andrew Lamb a écrit :
FWIW vendoring is the approach that DuckDB seems to have taken with the
fsst code[1]
[1]: https://github.com/duckdb/duckdb/tree/main/third_party/fsst
On Wed, Dec 10, 2025 at 8:47 AM Raúl Cumplido <[email protected]>
wrote:
Thanks Arnav for working on this!
I've taken a look at the current draft PR on the Arrow repository [1].
Given the small amount of code required to vendor it, I am +1 on
vendoring it. In general the vendored third party dependencies on
Arrow C++ require less maintenance. In this case where there are no
official releases of FSST and it's not distributed via other channels
it probably is a similar effort.
Regards,
Raúl
[1] https://github.com/apache/arrow/pull/48232
El mié, 10 dic 2025 a las 14:14, Fokko Driesprong (<[email protected]
)
escribió:
Hey Arnav,
Thanks for raising this. Could you add a link to the brief initial
discussion?
If we don't need to make any modifications to the external source, I
would
prefer to pull it in, as that would likely be the easiest
maintenance-wise.
Upon reviewing the repository, it appears to be MIT-licensed
<https://github.com/cwida/fsst/blob/master/LICENSE>. This is
compatible
with the ASF license <
https://www.apache.org/legal/resolved.html#category-a>,
so we can ship it as part of the Parquet project. We must ensure that
we
correctly mention the dependency in the license.
Kind regards,
Fokko
Op wo 10 dec 2025 om 09:26 schreef Arnav Balyan <
[email protected]
:
Hi team,
We recently proposed the support for FSST for Parquet. There are two
main
options to take the FSST dependency:
1. CMake dependency on fsst GitHub:
- Pull FSST as an external dependency via CMake
- Adds an external dependency to the build
2. Vendor the code:
- Need to copy 3-4 required source files directly into the repo
- No external dependency
There was a brief initial discussion on pr, and I just wanted to
start
a
thread to discuss further.
Overall this is a lightweight dependency, with a couple of commits
upstream
every few months, vendoring looks like a safe option. It may avoid
external
dependency while keeping low maintenance overhead. However, we may
have to
pull any major changes in the future.
Would love to know what folks think. Are there any concerns with
either
approach, or a preference on how we have handled similar situations
in
the
past?
Thanks and Regards,
Arnav