Thanks for pointing this out, Dongjoon. To clarify, I’m not suggesting that we can break compatibility. I’m suggesting that we make a 2.5 release that uses the same DSv2 API as 3.0.
These APIs are marked unstable, so we could make changes to them if we needed — as we have done in the 2.x line — but I don’t see a reason why we would break compatibility in the 3.x line. On Fri, Sep 20, 2019 at 8:46 PM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Do you mean you want to have a breaking API change between 3.0 and 3.1? > I believe we follow Semantic Versioning ( > https://spark.apache.org/versioning-policy.html ). > > > We just won’t add any breaking changes before 3.1. > > Bests, > Dongjoon. > > > On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> I don’t think we need to gate a 3.0 release on making a more stable >> version of InternalRow >> >> Sounds like we agree, then. We will use it for 3.0, but there are known >> problems with it. >> >> Thinking we’d have dsv2 working in both 3.x (which will change and >> progress towards more stable, but will have to break certain APIs) and 2.x >> seems like a false premise. >> >> Why do you think we will need to break certain APIs before 3.0? >> >> I’m only suggesting that we release the same support in a 2.5 release >> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it >> seems like we can certainly do that. We just won’t add any breaking changes >> before 3.1. >> >> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <r...@databricks.com> wrote: >> >>> I don't think we need to gate a 3.0 release on making a more stable >>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x >>> (which will change and progress towards more stable, but will have to break >>> certain APIs) and 2.x seems like a false premise. >>> >>> To point out some problems with InternalRow that you think are already >>> pragmatic and stable: >>> >>> The class is in catalyst, which states: >>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala >>> >>> /** >>> * Catalyst is a library for manipulating relational query plans. All >>> classes in catalyst are >>> * considered an internal API to Spark SQL and are subject to change >>> between minor releases. >>> */ >>> >>> There is no even any annotation on the interface. >>> >>> The entire dependency chain were created to be private, and tightly >>> coupled with internal implementations. For example, >>> >>> >>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java >>> >>> /** >>> * A UTF-8 String for internal Spark use. >>> * <p> >>> * A String encoded in UTF-8 as an Array[Byte], which can be used for >>> comparison, >>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details. >>> * <p> >>> * Note: This is not designed for general use cases, should not be used >>> outside SQL. >>> */ >>> >>> >>> >>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala >>> >>> (which again is in catalyst package) >>> >>> >>> If you want to argue this way, you might as well argue we should make >>> the entire catalyst package public to be pragmatic and not allow any >>> changes. >>> >>> >>> >>> >>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote: >>> >>>> When you created the PR to make InternalRow public >>>> >>>> This isn’t quite accurate. The change I made was to use InternalRow >>>> instead of UnsafeRow, which is a specific implementation of InternalRow. >>>> Exposing this API has always been a part of DSv2 and while both you and I >>>> did some work to avoid this, we are still in the phase of starting with >>>> that API. >>>> >>>> Note that any change to InternalRow would be very costly to implement >>>> because this interface is widely used. That is why I think we can certainly >>>> consider it stable enough to use here, and that’s probably why >>>> UnsafeRow was part of the original proposal. >>>> >>>> In any case, the goal for 3.0 was not to replace the use of InternalRow, >>>> it was to get the majority of SQL working on top of the interface added >>>> after 2.4. That’s done and stable, so I think a 2.5 release with it is also >>>> reasonable. >>>> >>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>> To push back, while I agree we should not drastically change >>>> "InternalRow", there are a lot of changes that need to happen to make it >>>> stable. For example, none of the publicly exposed interfaces should be in >>>> the Catalyst package or the unsafe package. External implementations should >>>> be decoupled from the internal implementations, with cheap ways to convert >>>> back and forth. >>>> >>>> When you created the PR to make InternalRow public, the understanding >>>> was to work towards making it stable in the future, assuming we will start >>>> with an unstable API temporarily. You can't just make a bunch internal APIs >>>> tightly coupled with other internal pieces public and stable and call it a >>>> day, just because it happen to satisfy some use cases temporarily assuming >>>> the rest of Spark doesn't change. >>>> >>>> >>>> >>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote: >>>> >>>> > DSv2 is far from stable right? >>>> >>>> No, I think it is reasonably stable and very close to being ready for a >>>> release. >>>> >>>> > All the actual data types are unstable and you guys have completely >>>> ignored that. >>>> >>>> I think what you're referring to is the use of `InternalRow`. That's a >>>> stable API and there has been no work to avoid using it. In any case, I >>>> don't think that anyone is suggesting that we delay 3.0 until a replacement >>>> for `InternalRow` is added, right? >>>> >>>> While I understand the motivation for a better solution here, I think >>>> the pragmatic solution is to continue using `InternalRow`. >>>> >>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too >>>> invasive of a change to backport once you consider the parts needed to make >>>> dsv2 stable. >>>> >>>> I believe that those of us working on DSv2 are confident about the >>>> current stability. We set goals for what to get into the 3.0 release months >>>> ago and have very nearly reached the point where we are ready for that >>>> release. >>>> >>>> I don't think instability would be a problem in maintaining >>>> compatibility between the 2.5 version and the 3.0 version. If we find that >>>> we need to make API changes (other than additions) then we can make those >>>> in the 3.1 release. Because the goals we set for the 3.0 release have been >>>> reached with the current API and if we are ready to release 3.0, we can >>>> release a 2.5 with the same API. >>>> >>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>> DSv2 is far from stable right? All the actual data types are unstable >>>> and you guys have completely ignored that. We'd need to work on that and >>>> that will be a breaking change. If the goal is to make DSv2 work across 3.x >>>> and 2.x, that seems too invasive of a change to backport once you consider >>>> the parts needed to make dsv2 stable. >>>> >>>> >>>> >>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid> >>>> wrote: >>>> >>>> Hi everyone, >>>> >>>> In the DSv2 sync this week, we talked about a possible Spark 2.5 >>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support >>>> added. >>>> >>>> A Spark 2.5 release with these two additions will help people migrate >>>> to Spark 3.0 when it is released because they will be able to use a single >>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, >>>> upgrading to 3.0 won't also require also updating to Java 11 because users >>>> could update to Java 11 with the 2.5 release and have fewer major changes. >>>> >>>> Another reason to consider a 2.5 release is that many people are >>>> interested in a release with the latest DSv2 API and support for DSv2 SQL. >>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so >>>> it makes sense to share this work with the community. >>>> >>>> This release line would just consist of backports like DSv2 and Java 11 >>>> that assist compatibility, to keep the scope of the release small. The >>>> purpose is to assist people moving to 3.0 and not distract from the 3.0 >>>> release. >>>> >>>> Would a Spark 2.5 release help anyone else? Are there any concerns >>>> about this plan? >>>> >>>> >>>> rb >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix