Hi,

I just skimmed the FSST paper and GitHub repo. While the algorithm is
relatively straightforward compared to general compression codecs like
ZSTD, achieving high quality and interoperability for a custom
implementation will take significant effort.

Considering this, vendoring the existing FSST code and improving it based
on our needs seems like a practical approach.

However, the current FSST code uses hand-written AVX512 instructions, which
is inconsistent with Arrow C++'s use of xsimd. This inconsistency will need
to be resolved, likely by modifying the vendored FSST code to align with
xsimd.

Best,
Gang

On Thu, Dec 11, 2025 at 8:12 PM Uwe L. Korn <[email protected]> wrote:

> Hi,
>
> Adding my input here with a distribution maintainer hat on: We dislike
> stuff that is vendored and prefer that things can be built independently.
> If a dependency doesn't have releases or has an unstable API, however,
> vendoring is the better alternative.
>
> Best
> Uwe
>
> On Thu, Dec 11, 2025, at 10:08 AM, Antoine Pitrou wrote:
> > The point of vendoring is to make maintenance easier. If we then have to
> > backport bugfixes on a regular basis, it will actually make maintenance
> > harder.
> >
> > In any case, it is a bit premature to discuss implementation details of
> > a feature that hasn't been formally decided on yet.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 11/12/2025 à 07:13, Arnav Balyan a écrit :
> >> Thanks everyone for the detailed input and the pointers to prior
> experience
> >> on DuckDB, Arrow C++ and licensing.
> >>
> >> Looking at DuckDB’s FSST history and the bug fixes they upstreamed, it
> >> seems reasonable to expect that FSST will continue to evolve. That’s a
> >> point in favor of keeping an upstream dependency rather than
> maintaining an
> >> internal fork.
> >>
> >> On the other hand, as Raul highlighted, the amount of code we need in
> >> Parquet is quite small, and the library does not currently ship official
> >> releases. Vendoring would avoid adding a direct dependency. Since we may
> >> not have API guarantees due to the lack of a release process.
> >>
> >> It also looks like DuckDB was able to fix issues internally and upstream
> >> them later (roughly 1 month lag between internal fix and upstream
> merge).
> >> Vendoring may give a similar flexibility to apply fixes and contribute
> back
> >> without waiting on upstream.
> >>
> >> Given these considerations, my initial thought was to:
> >> 1. Vendor initially, to avoid direct dependency on FSST.
> >> 2. And re-evaluate once the project becomes more active or starts
> shipping
> >> releases, switching to a proper CMake dependency.
> >>
> >> Would love to discuss if there are other preferences.
> >>
> >> Thanks,
> >> Arnav
> >>
> >> On Wed, Dec 10, 2025 at 10:14 PM Andrew Lamb <[email protected]>
> wrote:
> >>
> >>> touche
> >>>
> >>> One thing that may actually be relevant  from the duckdb fork is to
> examine
> >>> what they have changed:
> >>> https://github.com/duckdb/duckdb/commits/main/third_party/fsst
> >>>
> >>> Specifically it looks like they have fixed bugs such as [1], and then
> >>> contributed the fixes back upstream[2].
> >>>
> >>> Andrew
> >>>
> >>>
> >>>
> >>> [1]:
> >>>
> >>>
> https://github.com/duckdb/duckdb/commit/078e96ac75c77f03302dadd7b0488cdb9f04fc9a
> >>> [2]: https://github.com/cwida/fsst/pull/31
> >>>
> >>>
> >>> On Wed, Dec 10, 2025 at 10:22 AM Antoine Pitrou <[email protected]>
> >>> wrote:
> >>>
> >>>>
> >>>> Ok, but DuckDB vendors everything (*), so it's not very relevant as a
> >>>> reference :-)
> >>>>
> >>>> My opinion on this is that vendoring should be limited to libraries
> that
> >>>> are either very small or very stable. I haven't taken a look at FSST
> >>>> yet, so I don't have a specific opinion about it, but being a very
> >>>> recent compression/encoding algorithm my intuition is that it may
> >>>> receive quite a bit of maintenance (bug fixes, improvements) in the
> >>>> coming years.
> >>>>
> >>>> (*) Really: https://github.com/duckdb/duckdb/tree/main/third_party
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>>
> >>>>
> >>>> Le 10/12/2025 à 14:52, Andrew Lamb a écrit :
> >>>>> FWIW vendoring is the approach that DuckDB seems to have taken with
> the
> >>>>> fsst code[1]
> >>>>>
> >>>>> [1]: https://github.com/duckdb/duckdb/tree/main/third_party/fsst
> >>>>>
> >>>>> On Wed, Dec 10, 2025 at 8:47 AM Raúl Cumplido <[email protected]>
> >>> wrote:
> >>>>>
> >>>>>> Thanks Arnav for working on this!
> >>>>>>
> >>>>>> I've taken a look at the current draft PR on the Arrow repository
> [1].
> >>>>>>
> >>>>>> Given the small amount of code required to vendor it, I am +1 on
> >>>>>> vendoring it. In general the vendored third party dependencies on
> >>>>>> Arrow C++ require less maintenance. In this case where there are no
> >>>>>> official releases of FSST and it's not distributed via other
> channels
> >>>>>> it probably is a similar effort.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Raúl
> >>>>>>
> >>>>>> [1] https://github.com/apache/arrow/pull/48232
> >>>>>>
> >>>>>> El mié, 10 dic 2025 a las 14:14, Fokko Driesprong (<
> [email protected]
> >>>> )
> >>>>>> escribió:
> >>>>>>>
> >>>>>>> Hey Arnav,
> >>>>>>>
> >>>>>>> Thanks for raising this. Could you add a link to the brief initial
> >>>>>>> discussion?
> >>>>>>>
> >>>>>>> If we don't need to make any modifications to the external source,
> I
> >>>>>> would
> >>>>>>> prefer to pull it in, as that would likely be the easiest
> >>>>>> maintenance-wise.
> >>>>>>> Upon reviewing the repository, it appears to be MIT-licensed
> >>>>>>> <https://github.com/cwida/fsst/blob/master/LICENSE>. This is
> >>>> compatible
> >>>>>>> with the ASF license <
> >>>>>> https://www.apache.org/legal/resolved.html#category-a>,
> >>>>>>> so we can ship it as part of the Parquet project. We must ensure
> that
> >>>> we
> >>>>>>> correctly mention the dependency in the license.
> >>>>>>>
> >>>>>>> Kind regards,
> >>>>>>> Fokko
> >>>>>>>
> >>>>>>>
> >>>>>>> Op wo 10 dec 2025 om 09:26 schreef Arnav Balyan <
> >>>> [email protected]
> >>>>>>> :
> >>>>>>>
> >>>>>>>> Hi team,
> >>>>>>>> We recently proposed the support for FSST for Parquet. There are
> two
> >>>>>> main
> >>>>>>>> options to take the FSST dependency:
> >>>>>>>>
> >>>>>>>> 1. CMake dependency on fsst GitHub:
> >>>>>>>>
> >>>>>>>>      - Pull FSST as an external dependency via CMake
> >>>>>>>>      - Adds an external dependency to the build
> >>>>>>>>
> >>>>>>>> 2. Vendor the code:
> >>>>>>>>
> >>>>>>>>      - Need to copy 3-4 required source files directly into the
> repo
> >>>>>>>>      - No external dependency
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> There was a brief initial discussion on pr, and I just wanted to
> >>> start
> >>>>>> a
> >>>>>>>> thread to discuss further.
> >>>>>>>> Overall this is a lightweight dependency, with a couple of commits
> >>>>>> upstream
> >>>>>>>> every few months, vendoring looks like a safe option. It may avoid
> >>>>>> external
> >>>>>>>> dependency while keeping low maintenance overhead. However, we may
> >>>>>> have to
> >>>>>>>> pull any major changes in the future.
> >>>>>>>>
> >>>>>>>> Would love to know what folks think. Are there any concerns with
> >>>> either
> >>>>>>>> approach, or a preference on how we have handled similar
> situations
> >>> in
> >>>>>> the
> >>>>>>>> past?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks and Regards,
> >>>>>>>> Arnav
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
>

Reply via email to