Re: [DISCUSS] Spark version support strategy

Russell Spitzer Tue, 14 Sep 2021 09:44:56 -0700

I think we should go for option 1. I already am not a big fan of having runtime 
errors for unsupported things based on versions and I don't think minor version 
upgrades are a large issue for users.  I'm especially not looking forward to 
supporting interfaces that only exist in Spark 3.2 in a multiple Spark version 
support future.


> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi 
> <aokolnyc...@apple.com.INVALID> wrote:
> 
>> First of all, is option 2 a viable option? We discussed separating the 
>> python module outside of the project a few weeks ago, and decided to not do 
>> that because it's beneficial for code cross reference and more intuitive for 
>> new developers to see everything in the same repository. I would expect the 
>> same argument to also hold here. 
> 
> That’s exactly the concern I have about Option 2 at this moment.
> 
>> Overall I would personally prefer us to not support all the minor versions, 
>> but instead support maybe just 2-3 latest versions in a major version. 
> 
> This is when it gets a bit complicated. If we want to support both Spark 3.1 
> and Spark 3.2 with a single module, it means we have to compile against 3.1. 
> The problem is that we rely on DSv2 that is being actively developed. 3.2 and 
> 3.1 have substantial differences. On top of that, we have our extensions that 
> are extremely low-level and may break not only between minor versions but 
> also between patch releases.
> 
>> f there are some features requiring a newer version, it makes sense to move 
>> that newer version in master.
> 
> Internally, we don’t deliver new features to older Spark versions as it 
> requires a lot of effort to port things. Personally, I don’t think it is too 
> bad to require users to upgrade if they want new features. At the same time, 
> there are valid concerns with this approach too that we mentioned during the 
> sync. For example, certain new features would also work fine with older Spark 
> versions. I generally agree with that and that not supporting recent versions 
> is not ideal. However, I want to find a balance between the complexity on our 
> side and ease of use for the users. Ideally, supporting a few recent versions 
> would be sufficient but our Spark integration is too low-level to do that 
> with a single module.
>  
> 
>> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com 
>> <mailto:yezhao...@gmail.com>> wrote:
>> 
>> First of all, is option 2 a viable option? We discussed separating the 
>> python module outside of the project a few weeks ago, and decided to not do 
>> that because it's beneficial for code cross reference and more intuitive for 
>> new developers to see everything in the same repository. I would expect the 
>> same argument to also hold here. 
>> 
>> Overall I would personally prefer us to not support all the minor versions, 
>> but instead support maybe just 2-3 latest versions in a major version. This 
>> avoids the problem that some users are unwilling to move to a newer version 
>> and keep patching old Spark version branches. If there are some features 
>> requiring a newer version, it makes sense to move that newer version in 
>> master.
>> 
>> In addition, because currently Spark is considered the most feature-complete 
>> reference implementation compared to all other engines, I think we should 
>> not add artificial barriers that would slow down its development speed.
>> 
>> So my thinking is closer to option 1.
>> 
>> Best,
>> Jack Ye
>> 
>> 
>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi 
>> <aokolnyc...@apple.com.invalid <mailto:aokolnyc...@apple.com.invalid>> wrote:
>> Hey folks,
>> 
>> I want to discuss our Spark version support strategy.
>> 
>> So far, we have tried to support both 3.0 and 3.1. It is great to support 
>> older versions but because we compile against 3.0, we cannot use any Spark 
>> features that are offered in newer versions.
>> Spark 3.2 is just around the corner and it brings a lot of important 
>> features such dynamic filtering for v2 tables, required distribution and 
>> ordering for writes, etc. These features are too important to ignore them.
>> 
>> Apart from that, I have an end-to-end prototype for merge-on-read with Spark 
>> that actually leverages some of the 3.2 features. I’ll be implementing all 
>> new Spark DSv2 APIs for us internally and would love to share that with the 
>> rest of the community.
>> 
>> I see two options to move forward:
>> 
>> Option 1
>> 
>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing minor 
>> versions with bug fixes.
>> 
>> Pros: almost no changes to the build configuration, no extra work on our 
>> side as just a single Spark version is actively maintained.
>> Cons: some new features that we will be adding to master could also work 
>> with older Spark versions but all 0.12 releases will only contain bug fixes. 
>> Therefore, users will be forced to migrate to Spark 3.2 to consume any new 
>> Spark or format features.
>> 
>> Option 2
>> 
>> Move our Spark integration into a separate project and introduce branches 
>> for 3.0, 3.1 and 3.2.
>> 
>> Pros: decouples the format version from Spark, we can support as many Spark 
>> versions as needed.
>> Cons: more work initially to set everything up, more work to release, will 
>> need a new release of the core format to consume any changes in the Spark 
>> integration.
>> 
>> Overall, I think option 2 seems better for the user but my main worry is 
>> that we will have to release the format more frequently (which is a good 
>> thing but requires more work and time) and the overall Spark development may 
>> be slower.
>> 
>> I’d love to hear what everybody thinks about this matter.
>> 
>> Thanks,
>> Anton
>

Re: [DISCUSS] Spark version support strategy

Reply via email to