Hey folks,

I want to discuss our Spark version support strategy.

So far, we have tried to support both 3.0 and 3.1. It is great to support older 
versions but because we compile against 3.0, we cannot use any Spark features 
that are offered in newer versions.
Spark 3.2 is just around the corner and it brings a lot of important features 
such dynamic filtering for v2 tables, required distribution and ordering for 
writes, etc. These features are too important to ignore them.

Apart from that, I have an end-to-end prototype for merge-on-read with Spark 
that actually leverages some of the 3.2 features. I’ll be implementing all new 
Spark DSv2 APIs for us internally and would love to share that with the rest of 
the community.

I see two options to move forward:

Option 1

Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing minor 
versions with bug fixes.

Pros: almost no changes to the build configuration, no extra work on our side 
as just a single Spark version is actively maintained.
Cons: some new features that we will be adding to master could also work with 
older Spark versions but all 0.12 releases will only contain bug fixes. 
Therefore, users will be forced to migrate to Spark 3.2 to consume any new 
Spark or format features.

Option 2

Move our Spark integration into a separate project and introduce branches for 
3.0, 3.1 and 3.2.

Pros: decouples the format version from Spark, we can support as many Spark 
versions as needed.
Cons: more work initially to set everything up, more work to release, will need 
a new release of the core format to consume any changes in the Spark 
integration.

Overall, I think option 2 seems better for the user but my main worry is that 
we will have to release the format more frequently (which is a good thing but 
requires more work and time) and the overall Spark development may be slower.

I’d love to hear what everybody thinks about this matter.

Thanks,
Anton

Reply via email to