First of all, is option 2 a viable option? We discussed separating the
python module outside of the project a few weeks ago, and decided to not do
that because it's beneficial for code cross reference and more intuitive
for new developers to see everything in the same repository. I would expect
the same argument to also hold here.

Overall I would personally prefer us to not support all the minor versions,
but instead support maybe just 2-3 latest versions in a major version. This
avoids the problem that some users are unwilling to move to a newer version
and keep patching old Spark version branches. If there are some features
requiring a newer version, it makes sense to move that newer version in
master.

In addition, because currently Spark is considered the most
feature-complete reference implementation compared to all other engines, I
think we should not add artificial barriers that would slow down its
development speed.

So my thinking is closer to option 1.

Best,
Jack Ye


On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi
<aokolnyc...@apple.com.invalid> wrote:

> Hey folks,
>
> I want to discuss our Spark version support strategy.
>
> So far, we have tried to support both 3.0 and 3.1. It is great to support
> older versions but because we compile against 3.0, we cannot use any Spark
> features that are offered in newer versions.
> Spark 3.2 is just around the corner and it brings a lot of important
> features such dynamic filtering for v2 tables, required distribution and
> ordering for writes, etc. These features are too important to ignore them.
>
> Apart from that, I have an end-to-end prototype for merge-on-read with
> Spark that actually leverages some of the 3.2 features. I’ll be
> implementing all new Spark DSv2 APIs for us internally and would love to
> share that with the rest of the community.
>
> I see two options to move forward:
>
> Option 1
>
> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing
> minor versions with bug fixes.
>
> Pros: almost no changes to the build configuration, no extra work on our
> side as just a single Spark version is actively maintained.
> Cons: some new features that we will be adding to master could also work
> with older Spark versions but all 0.12 releases will only contain bug
> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume
> any new Spark or format features.
>
> Option 2
>
> Move our Spark integration into a separate project and introduce branches
> for 3.0, 3.1 and 3.2.
>
> Pros: decouples the format version from Spark, we can support as many
> Spark versions as needed.
> Cons: more work initially to set everything up, more work to release, will
> need a new release of the core format to consume any changes in the Spark
> integration.
>
> Overall, I think option 2 seems better for the user but my main worry is
> that we will have to release the format more frequently (which is a good
> thing but requires more work and time) and the overall Spark development
> may be slower.
>
> I’d love to hear what everybody thinks about this matter.
>
> Thanks,
> Anton

Reply via email to