small correction: confusion -> conflict, so I had to go through and understand parts of the changes
On Sat, Sep 21, 2019 at 1:25 PM Jungtaek Lim <kabh...@gmail.com> wrote: > Just 2 cents, I haven't tracked the change of DSv2 (though I needed to > deal with this as the change made confusion on my PRs...), but my bet is > that DSv2 would be already changed in incompatible way, at least who works > for custom DataSource. Making downstream to diverge their implementation > heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good > experience - especially we are not completely closed the chance to further > modify DSv2, and the change could be backward incompatible. > > If we really want to bring the DSv2 change to 2.x version line to let end > users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say > preparation of Spark 2.5 should be started after Spark 3.0 is officially > released, honestly even later than that, say, getting some reports from > Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark > 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to > upgrade to next minor version. > > Btw, do we have any specific target users for this? Personally DSv2 change > would be the major backward incompatibility which Spark 2.x users may > hesitate to upgrade, so they might be already prepared to migrate to Spark > 3.0 if they are prepared to migrate to new DSv2. > > On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> Do you mean you want to have a breaking API change between 3.0 and 3.1? >> I believe we follow Semantic Versioning ( >> https://spark.apache.org/versioning-policy.html ). >> >> > We just won’t add any breaking changes before 3.1. >> >> Bests, >> Dongjoon. >> >> >> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> I don’t think we need to gate a 3.0 release on making a more stable >>> version of InternalRow >>> >>> Sounds like we agree, then. We will use it for 3.0, but there are known >>> problems with it. >>> >>> Thinking we’d have dsv2 working in both 3.x (which will change and >>> progress towards more stable, but will have to break certain APIs) and 2.x >>> seems like a false premise. >>> >>> Why do you think we will need to break certain APIs before 3.0? >>> >>> I’m only suggesting that we release the same support in a 2.5 release >>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it >>> seems like we can certainly do that. We just won’t add any breaking changes >>> before 3.1. >>> >>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> I don't think we need to gate a 3.0 release on making a more stable >>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x >>>> (which will change and progress towards more stable, but will have to break >>>> certain APIs) and 2.x seems like a false premise. >>>> >>>> To point out some problems with InternalRow that you think are already >>>> pragmatic and stable: >>>> >>>> The class is in catalyst, which states: >>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala >>>> >>>> /** >>>> * Catalyst is a library for manipulating relational query plans. All >>>> classes in catalyst are >>>> * considered an internal API to Spark SQL and are subject to change >>>> between minor releases. >>>> */ >>>> >>>> There is no even any annotation on the interface. >>>> >>>> The entire dependency chain were created to be private, and tightly >>>> coupled with internal implementations. For example, >>>> >>>> >>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java >>>> >>>> /** >>>> * A UTF-8 String for internal Spark use. >>>> * <p> >>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for >>>> comparison, >>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details. >>>> * <p> >>>> * Note: This is not designed for general use cases, should not be used >>>> outside SQL. >>>> */ >>>> >>>> >>>> >>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala >>>> >>>> (which again is in catalyst package) >>>> >>>> >>>> If you want to argue this way, you might as well argue we should make >>>> the entire catalyst package public to be pragmatic and not allow any >>>> changes. >>>> >>>> >>>> >>>> >>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote: >>>> >>>>> When you created the PR to make InternalRow public >>>>> >>>>> This isn’t quite accurate. The change I made was to use InternalRow >>>>> instead of UnsafeRow, which is a specific implementation of >>>>> InternalRow. Exposing this API has always been a part of DSv2 and >>>>> while both you and I did some work to avoid this, we are still in the >>>>> phase >>>>> of starting with that API. >>>>> >>>>> Note that any change to InternalRow would be very costly to implement >>>>> because this interface is widely used. That is why I think we can >>>>> certainly >>>>> consider it stable enough to use here, and that’s probably why >>>>> UnsafeRow was part of the original proposal. >>>>> >>>>> In any case, the goal for 3.0 was not to replace the use of >>>>> InternalRow, it was to get the majority of SQL working on top of the >>>>> interface added after 2.4. That’s done and stable, so I think a 2.5 >>>>> release >>>>> with it is also reasonable. >>>>> >>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <r...@databricks.com> >>>>> wrote: >>>>> >>>>> To push back, while I agree we should not drastically change >>>>> "InternalRow", there are a lot of changes that need to happen to make it >>>>> stable. For example, none of the publicly exposed interfaces should be in >>>>> the Catalyst package or the unsafe package. External implementations >>>>> should >>>>> be decoupled from the internal implementations, with cheap ways to convert >>>>> back and forth. >>>>> >>>>> When you created the PR to make InternalRow public, the understanding >>>>> was to work towards making it stable in the future, assuming we will start >>>>> with an unstable API temporarily. You can't just make a bunch internal >>>>> APIs >>>>> tightly coupled with other internal pieces public and stable and call it a >>>>> day, just because it happen to satisfy some use cases temporarily assuming >>>>> the rest of Spark doesn't change. >>>>> >>>>> >>>>> >>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote: >>>>> >>>>> > DSv2 is far from stable right? >>>>> >>>>> No, I think it is reasonably stable and very close to being ready for >>>>> a release. >>>>> >>>>> > All the actual data types are unstable and you guys have completely >>>>> ignored that. >>>>> >>>>> I think what you're referring to is the use of `InternalRow`. That's a >>>>> stable API and there has been no work to avoid using it. In any case, I >>>>> don't think that anyone is suggesting that we delay 3.0 until a >>>>> replacement >>>>> for `InternalRow` is added, right? >>>>> >>>>> While I understand the motivation for a better solution here, I think >>>>> the pragmatic solution is to continue using `InternalRow`. >>>>> >>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too >>>>> invasive of a change to backport once you consider the parts needed to >>>>> make >>>>> dsv2 stable. >>>>> >>>>> I believe that those of us working on DSv2 are confident about the >>>>> current stability. We set goals for what to get into the 3.0 release >>>>> months >>>>> ago and have very nearly reached the point where we are ready for that >>>>> release. >>>>> >>>>> I don't think instability would be a problem in maintaining >>>>> compatibility between the 2.5 version and the 3.0 version. If we find that >>>>> we need to make API changes (other than additions) then we can make those >>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been >>>>> reached with the current API and if we are ready to release 3.0, we can >>>>> release a 2.5 with the same API. >>>>> >>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <r...@databricks.com> >>>>> wrote: >>>>> >>>>> DSv2 is far from stable right? All the actual data types are unstable >>>>> and you guys have completely ignored that. We'd need to work on that and >>>>> that will be a breaking change. If the goal is to make DSv2 work across >>>>> 3.x >>>>> and 2.x, that seems too invasive of a change to backport once you consider >>>>> the parts needed to make dsv2 stable. >>>>> >>>>> >>>>> >>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid >>>>> > wrote: >>>>> >>>>> Hi everyone, >>>>> >>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5 >>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support >>>>> added. >>>>> >>>>> A Spark 2.5 release with these two additions will help people migrate >>>>> to Spark 3.0 when it is released because they will be able to use a single >>>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly, >>>>> upgrading to 3.0 won't also require also updating to Java 11 because users >>>>> could update to Java 11 with the 2.5 release and have fewer major changes. >>>>> >>>>> Another reason to consider a 2.5 release is that many people are >>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL. >>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so >>>>> it makes sense to share this work with the community. >>>>> >>>>> This release line would just consist of backports like DSv2 and Java >>>>> 11 that assist compatibility, to keep the scope of the release small. The >>>>> purpose is to assist people moving to 3.0 and not distract from the 3.0 >>>>> release. >>>>> >>>>> Would a Spark 2.5 release help anyone else? Are there any concerns >>>>> about this plan? >>>>> >>>>> >>>>> rb >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Name : Jungtaek Lim > Blog : http://medium.com/@heartsavior > Twitter : http://twitter.com/heartsavior > LinkedIn : http://www.linkedin.com/in/heartsavior > -- Name : Jungtaek Lim Blog : http://medium.com/@heartsavior Twitter : http://twitter.com/heartsavior LinkedIn : http://www.linkedin.com/in/heartsavior