Thanks for bringing this up, Anton. I am not entirely certain if your option 2 meant "project" in the "Apache project" sense or the "gradle project" sense -- it sounds like you mean "apache project".
If so, I'd propose Option 3: Create a "spark-common" gradle project, which builds against the lowest spark version we plan to support (3.0 for now, I guess) and also creates interfaces for everything specific to different spark versions. Also create "spark-3.x" gradle projects, which only build against specific gradle versions, and contain implementations for the interface in "spark-common" Pros: * Can support as many Spark versions as needed, with each version getting as much as it can from its spark version * Spark support still integrated into the existing build & release process (I guess this could also be a con) Cons: * work to setup the builds * multiple binaries, setup becomes more complicated for users * testing becomes tough as we increase the mix of supported versions The "multiple binaries" could be solved with an "Option 4: put it all in one binary and use reflection", though imo this is really painful. On Mon, Sep 13, 2021 at 9:39 PM Anton Okolnychyi <aokolnyc...@apple.com.invalid> wrote: > Hey folks, > > I want to discuss our Spark version support strategy. > > So far, we have tried to support both 3.0 and 3.1. It is great to support > older versions but because we compile against 3.0, we cannot use any Spark > features that are offered in newer versions. > Spark 3.2 is just around the corner and it brings a lot of important > features such dynamic filtering for v2 tables, required distribution and > ordering for writes, etc. These features are too important to ignore them. > > Apart from that, I have an end-to-end prototype for merge-on-read with > Spark that actually leverages some of the 3.2 features. I’ll be > implementing all new Spark DSv2 APIs for us internally and would love to > share that with the rest of the community. > > I see two options to move forward: > > Option 1 > > Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing > minor versions with bug fixes. > > Pros: almost no changes to the build configuration, no extra work on our > side as just a single Spark version is actively maintained. > Cons: some new features that we will be adding to master could also work > with older Spark versions but all 0.12 releases will only contain bug > fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume > any new Spark or format features. > > Option 2 > > Move our Spark integration into a separate project and introduce branches > for 3.0, 3.1 and 3.2. > > Pros: decouples the format version from Spark, we can support as many > Spark versions as needed. > Cons: more work initially to set everything up, more work to release, will > need a new release of the core format to consume any changes in the Spark > integration. > > Overall, I think option 2 seems better for the user but my main worry is > that we will have to release the format more frequently (which is a good > thing but requires more work and time) and the overall Spark development > may be slower. > > I’d love to hear what everybody thinks about this matter. > > Thanks, > Anton