I think we should go for option 1. I already am not a big fan of having runtime errors for unsupported things based on versions and I don't think minor version upgrades are a large issue for users. I'm especially not looking forward to supporting interfaces that only exist in Spark 3.2 in a multiple Spark version support future.
> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi > <aokolnyc...@apple.com.INVALID> wrote: > >> First of all, is option 2 a viable option? We discussed separating the >> python module outside of the project a few weeks ago, and decided to not do >> that because it's beneficial for code cross reference and more intuitive for >> new developers to see everything in the same repository. I would expect the >> same argument to also hold here. > > That’s exactly the concern I have about Option 2 at this moment. > >> Overall I would personally prefer us to not support all the minor versions, >> but instead support maybe just 2-3 latest versions in a major version. > > This is when it gets a bit complicated. If we want to support both Spark 3.1 > and Spark 3.2 with a single module, it means we have to compile against 3.1. > The problem is that we rely on DSv2 that is being actively developed. 3.2 and > 3.1 have substantial differences. On top of that, we have our extensions that > are extremely low-level and may break not only between minor versions but > also between patch releases. > >> f there are some features requiring a newer version, it makes sense to move >> that newer version in master. > > Internally, we don’t deliver new features to older Spark versions as it > requires a lot of effort to port things. Personally, I don’t think it is too > bad to require users to upgrade if they want new features. At the same time, > there are valid concerns with this approach too that we mentioned during the > sync. For example, certain new features would also work fine with older Spark > versions. I generally agree with that and that not supporting recent versions > is not ideal. However, I want to find a balance between the complexity on our > side and ease of use for the users. Ideally, supporting a few recent versions > would be sufficient but our Spark integration is too low-level to do that > with a single module. > > >> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com >> <mailto:yezhao...@gmail.com>> wrote: >> >> First of all, is option 2 a viable option? We discussed separating the >> python module outside of the project a few weeks ago, and decided to not do >> that because it's beneficial for code cross reference and more intuitive for >> new developers to see everything in the same repository. I would expect the >> same argument to also hold here. >> >> Overall I would personally prefer us to not support all the minor versions, >> but instead support maybe just 2-3 latest versions in a major version. This >> avoids the problem that some users are unwilling to move to a newer version >> and keep patching old Spark version branches. If there are some features >> requiring a newer version, it makes sense to move that newer version in >> master. >> >> In addition, because currently Spark is considered the most feature-complete >> reference implementation compared to all other engines, I think we should >> not add artificial barriers that would slow down its development speed. >> >> So my thinking is closer to option 1. >> >> Best, >> Jack Ye >> >> >> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi >> <aokolnyc...@apple.com.invalid <mailto:aokolnyc...@apple.com.invalid>> wrote: >> Hey folks, >> >> I want to discuss our Spark version support strategy. >> >> So far, we have tried to support both 3.0 and 3.1. It is great to support >> older versions but because we compile against 3.0, we cannot use any Spark >> features that are offered in newer versions. >> Spark 3.2 is just around the corner and it brings a lot of important >> features such dynamic filtering for v2 tables, required distribution and >> ordering for writes, etc. These features are too important to ignore them. >> >> Apart from that, I have an end-to-end prototype for merge-on-read with Spark >> that actually leverages some of the 3.2 features. I’ll be implementing all >> new Spark DSv2 APIs for us internally and would love to share that with the >> rest of the community. >> >> I see two options to move forward: >> >> Option 1 >> >> Migrate to Spark 3.2 in master, maintain 0.12 for a while by releasing minor >> versions with bug fixes. >> >> Pros: almost no changes to the build configuration, no extra work on our >> side as just a single Spark version is actively maintained. >> Cons: some new features that we will be adding to master could also work >> with older Spark versions but all 0.12 releases will only contain bug fixes. >> Therefore, users will be forced to migrate to Spark 3.2 to consume any new >> Spark or format features. >> >> Option 2 >> >> Move our Spark integration into a separate project and introduce branches >> for 3.0, 3.1 and 3.2. >> >> Pros: decouples the format version from Spark, we can support as many Spark >> versions as needed. >> Cons: more work initially to set everything up, more work to release, will >> need a new release of the core format to consume any changes in the Spark >> integration. >> >> Overall, I think option 2 seems better for the user but my main worry is >> that we will have to release the format more frequently (which is a good >> thing but requires more work and time) and the overall Spark development may >> be slower. >> >> I’d love to hear what everybody thinks about this matter. >> >> Thanks, >> Anton >