Thanks for bringing this up, Anton.

I am not entirely certain if your option 2 meant "project" in the "Apache
project" sense or the "gradle project" sense -- it sounds like you mean
"apache project".

If so, I'd propose Option 3:

Create a "spark-common" gradle project, which builds against the lowest
spark version we plan to support (3.0 for now, I guess) and also creates
interfaces for everything specific to different spark versions.  Also
create "spark-3.x" gradle projects, which only build against specific
gradle versions, and contain implementations for the interface in
"spark-common"

Pros:
* Can support as many Spark versions as needed, with each version getting
as much as it can from its spark version
* Spark support still integrated into the existing build & release process
(I guess this could also be a con)

Cons:
* work to setup the builds
* multiple binaries, setup becomes more complicated for users
* testing becomes tough as we increase the mix of supported versions



The "multiple binaries" could be solved with an "Option 4: put it all in
one binary and use reflection", though imo this is really painful.

On Mon, Sep 13, 2021 at 9:39 PM Anton Okolnychyi
<aokolnyc...@apple.com.invalid> wrote:

> Hey folks,
>
> I want to discuss our Spark version support strategy.
>
> So far, we have tried to support both 3.0 and 3.1. It is great to support
> older versions but because we compile against 3.0, we cannot use any Spark
> features that are offered in newer versions.
> Spark 3.2 is just around the corner and it brings a lot of important
> features such dynamic filtering for v2 tables, required distribution and
> ordering for writes, etc. These features are too important to ignore them.
>
> Apart from that, I have an end-to-end prototype for merge-on-read with
> Spark that actually leverages some of the 3.2 features. I’ll be
> implementing all new Spark DSv2 APIs for us internally and would love to
> share that with the rest of the community.
>
> I see two options to move forward:
>
> Option 1
>
> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing
> minor versions with bug fixes.
>
> Pros: almost no changes to the build configuration, no extra work on our
> side as just a single Spark version is actively maintained.
> Cons: some new features that we will be adding to master could also work
> with older Spark versions but all 0.12 releases will only contain bug
> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume
> any new Spark or format features.
>
> Option 2
>
> Move our Spark integration into a separate project and introduce branches
> for 3.0, 3.1 and 3.2.
>
> Pros: decouples the format version from Spark, we can support as many
> Spark versions as needed.
> Cons: more work initially to set everything up, more work to release, will
> need a new release of the core format to consume any changes in the Spark
> integration.
>
> Overall, I think option 2 seems better for the user but my main worry is
> that we will have to release the format more frequently (which is a good
> thing but requires more work and time) and the overall Spark development
> may be slower.
>
> I’d love to hear what everybody thinks about this matter.
>
> Thanks,
> Anton

Reply via email to