Re: [DISCUSS] Spark version support strategy

Anton Okolnychyi Tue, 14 Sep 2021 09:43:55 -0700

Hey Imran,

I don’t know why I forgot to mention this option too. It is definitely a 
solution to consider. We used this approach to support Spark 2 and Spark 3.
Right now, this would mean having iceberg-spark (common code for all versions), 
iceberg-spark2, iceberg-spark-3 (common code for all Spark 3 versions), and 
having iceberg-spark-3.1 and iceberg-spark-3.2.
We would also need to move our extensions into each module respectively as they 
differ.


The main reason to even consider Option 2 is the number of modules we will 
generate inside the main repo and the extra testing time for non-Spark PRs. If 
we decide in the future to support multiple versions for Hive/Flink/etc, this 
may get out of hand.

But like we all agree, Option 2 has a substantial limitation that will require 
us to release the core before it can be consumed in engine integrations. That’s 
why Option 3 you mention could be a way to go.

- Anton


> On 13 Sep 2021, at 21:04, Imran Rashid <[email protected]> wrote:
> 
> Thanks for bringing this up, Anton.
> 
> I am not entirely certain if your option 2 meant "project" in the "Apache 
> project" sense or the "gradle project" sense -- it sounds like you mean 
> "apache project".
> 
> If so, I'd propose Option 3:
> 
> Create a "spark-common" gradle project, which builds against the lowest spark 
> version we plan to support (3.0 for now, I guess) and also creates interfaces 
> for everything specific to different spark versions.  Also create "spark-3.x" 
> gradle projects, which only build against specific gradle versions, and 
> contain implementations for the interface in "spark-common"
> 
> Pros:
> * Can support as many Spark versions as needed, with each version getting as 
> much as it can from its spark version
> * Spark support still integrated into the existing build & release process (I 
> guess this could also be a con)
> 
> Cons:
> * work to setup the builds
> * multiple binaries, setup becomes more complicated for users
> * testing becomes tough as we increase the mix of supported versions
> 
> 
> 
> The "multiple binaries" could be solved with an "Option 4: put it all in one 
> binary and use reflection", though imo this is really painful.
> 
> On Mon, Sep 13, 2021 at 9:39 PM Anton Okolnychyi 
> <[email protected]> wrote:
> Hey folks,
> 
> I want to discuss our Spark version support strategy.
> 
> So far, we have tried to support both 3.0 and 3.1. It is great to support 
> older versions but because we compile against 3.0, we cannot use any Spark 
> features that are offered in newer versions.
> Spark 3.2 is just around the corner and it brings a lot of important features 
> such dynamic filtering for v2 tables, required distribution and ordering for 
> writes, etc. These features are too important to ignore them.
> 
> Apart from that, I have an end-to-end prototype for merge-on-read with Spark 
> that actually leverages some of the 3.2 features. I’ll be implementing all 
> new Spark DSv2 APIs for us internally and would love to share that with the 
> rest of the community.
> 
> I see two options to move forward:
> 
> Option 1
> 
> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing minor 
> versions with bug fixes.
> 
> Pros: almost no changes to the build configuration, no extra work on our side 
> as just a single Spark version is actively maintained.
> Cons: some new features that we will be adding to master could also work with 
> older Spark versions but all 0.12 releases will only contain bug fixes. 
> Therefore, users will be forced to migrate to Spark 3.2 to consume any new 
> Spark or format features.
> 
> Option 2
> 
> Move our Spark integration into a separate project and introduce branches for 
> 3.0, 3.1 and 3.2.
> 
> Pros: decouples the format version from Spark, we can support as many Spark 
> versions as needed.
> Cons: more work initially to set everything up, more work to release, will 
> need a new release of the core format to consume any changes in the Spark 
> integration.
> 
> Overall, I think option 2 seems better for the user but my main worry is that 
> we will have to release the format more frequently (which is a good thing but 
> requires more work and time) and the overall Spark development may be slower.
> 
> I’d love to hear what everybody thinks about this matter.
> 
> Thanks,
> Anton

Re: [DISCUSS] Spark version support strategy

Reply via email to