Hey Imran, I don’t know why I forgot to mention this option too. It is definitely a solution to consider. We used this approach to support Spark 2 and Spark 3. Right now, this would mean having iceberg-spark (common code for all versions), iceberg-spark2, iceberg-spark-3 (common code for all Spark 3 versions), and having iceberg-spark-3.1 and iceberg-spark-3.2. We would also need to move our extensions into each module respectively as they differ.
The main reason to even consider Option 2 is the number of modules we will generate inside the main repo and the extra testing time for non-Spark PRs. If we decide in the future to support multiple versions for Hive/Flink/etc, this may get out of hand. But like we all agree, Option 2 has a substantial limitation that will require us to release the core before it can be consumed in engine integrations. That’s why Option 3 you mention could be a way to go. - Anton > On 13 Sep 2021, at 21:04, Imran Rashid <iras...@cloudera.com.INVALID> wrote: > > Thanks for bringing this up, Anton. > > I am not entirely certain if your option 2 meant "project" in the "Apache > project" sense or the "gradle project" sense -- it sounds like you mean > "apache project". > > If so, I'd propose Option 3: > > Create a "spark-common" gradle project, which builds against the lowest spark > version we plan to support (3.0 for now, I guess) and also creates interfaces > for everything specific to different spark versions. Also create "spark-3.x" > gradle projects, which only build against specific gradle versions, and > contain implementations for the interface in "spark-common" > > Pros: > * Can support as many Spark versions as needed, with each version getting as > much as it can from its spark version > * Spark support still integrated into the existing build & release process (I > guess this could also be a con) > > Cons: > * work to setup the builds > * multiple binaries, setup becomes more complicated for users > * testing becomes tough as we increase the mix of supported versions > > > > The "multiple binaries" could be solved with an "Option 4: put it all in one > binary and use reflection", though imo this is really painful. > > On Mon, Sep 13, 2021 at 9:39 PM Anton Okolnychyi > <aokolnyc...@apple.com.invalid> wrote: > Hey folks, > > I want to discuss our Spark version support strategy. > > So far, we have tried to support both 3.0 and 3.1. It is great to support > older versions but because we compile against 3.0, we cannot use any Spark > features that are offered in newer versions. > Spark 3.2 is just around the corner and it brings a lot of important features > such dynamic filtering for v2 tables, required distribution and ordering for > writes, etc. These features are too important to ignore them. > > Apart from that, I have an end-to-end prototype for merge-on-read with Spark > that actually leverages some of the 3.2 features. I’ll be implementing all > new Spark DSv2 APIs for us internally and would love to share that with the > rest of the community. > > I see two options to move forward: > > Option 1 > > Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing minor > versions with bug fixes. > > Pros: almost no changes to the build configuration, no extra work on our side > as just a single Spark version is actively maintained. > Cons: some new features that we will be adding to master could also work with > older Spark versions but all 0.12 releases will only contain bug fixes. > Therefore, users will be forced to migrate to Spark 3.2 to consume any new > Spark or format features. > > Option 2 > > Move our Spark integration into a separate project and introduce branches for > 3.0, 3.1 and 3.2. > > Pros: decouples the format version from Spark, we can support as many Spark > versions as needed. > Cons: more work initially to set everything up, more work to release, will > need a new release of the core format to consume any changes in the Spark > integration. > > Overall, I think option 2 seems better for the user but my main worry is that > we will have to release the format more frequently (which is a good thing but > requires more work and time) and the overall Spark development may be slower. > > I’d love to hear what everybody thinks about this matter. > > Thanks, > Anton