Hey folks, I want to discuss our Spark version support strategy.
So far, we have tried to support both 3.0 and 3.1. It is great to support older versions but because we compile against 3.0, we cannot use any Spark features that are offered in newer versions. Spark 3.2 is just around the corner and it brings a lot of important features such dynamic filtering for v2 tables, required distribution and ordering for writes, etc. These features are too important to ignore them. Apart from that, I have an end-to-end prototype for merge-on-read with Spark that actually leverages some of the 3.2 features. I’ll be implementing all new Spark DSv2 APIs for us internally and would love to share that with the rest of the community. I see two options to move forward: Option 1 Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing minor versions with bug fixes. Pros: almost no changes to the build configuration, no extra work on our side as just a single Spark version is actively maintained. Cons: some new features that we will be adding to master could also work with older Spark versions but all 0.12 releases will only contain bug fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume any new Spark or format features. Option 2 Move our Spark integration into a separate project and introduce branches for 3.0, 3.1 and 3.2. Pros: decouples the format version from Spark, we can support as many Spark versions as needed. Cons: more work initially to set everything up, more work to release, will need a new release of the core format to consume any changes in the Spark integration. Overall, I think option 2 seems better for the user but my main worry is that we will have to release the format more frequently (which is a good thing but requires more work and time) and the overall Spark development may be slower. I’d love to hear what everybody thinks about this matter. Thanks, Anton