Re: [DISCUSS] Spark 2.5 release

Ryan Blue Mon, 23 Sep 2019 09:17:15 -0700

My understanding is that 3.0-preview is not going to be a production-ready
release. For those of us that have been using backports of DSv2 in
production, that doesn't help.


It also doesn't help as a stepping stone because users would need to handle
all of the incompatible changes in 3.0. Using 3.0-preview would be an
unstable release with breaking changes instead of a stable release without
the breaking changes.

I'm offering to help build a stable release without breaking changes. But
if there is no community interest in it, I'm happy to drop this.

On Sun, Sep 22, 2019 at 6:39 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:

> +1 for Matei's as well.
>
> On Sun, 22 Sep 2019, 14:59 Marco Gaido, <marcogaid...@gmail.com> wrote:
>
>> I agree with Matei too.
>>
>> Thanks,
>> Marco
>>
>> Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun <
>> dongjoon.h...@gmail.com> ha scritto:
>>
>>> +1 for Matei's suggestion!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia <matei.zaha...@gmail.com>
>>> wrote:
>>>
>>>> If the goal is to get people to try the DSv2 API and build DSv2 data
>>>> sources, can we recommend the 3.0-preview release for this? That would get
>>>> people shifting to 3.0 faster, which is probably better overall compared to
>>>> maintaining two major versions. There’s not that much else changing in 3.0
>>>> if you already want to update your Java version.
>>>>
>>>> On Sep 21, 2019, at 2:45 PM, Ryan Blue <rb...@netflix.com.INVALID>
>>>> wrote:
>>>>
>>>> > If you insist we shouldn't change the unstable temporary API in 3.x .
>>>> . .
>>>>
>>>> Not what I'm saying at all. I said we should carefully consider whether
>>>> a breaking change is the right decision in the 3.x line.
>>>>
>>>> All I'm suggesting is that we can make a 2.5 release with the feature
>>>> and an API that is the same as the one in 3.0.
>>>>
>>>> > I also don't get this backporting a giant feature to 2.x line
>>>>
>>>> I am planning to do this so we can use DSv2 before 3.0 is released.
>>>> Then we can have a source implementation that works in both 2.x and 3.0 to
>>>> make the transition easier. Since I'm already doing the work, I'm offering
>>>> to share it with the community.
>>>>
>>>>
>>>> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Because for example we'd need to move the location of InternalRow,
>>>>> breaking the package name. If you insist we shouldn't change the unstable
>>>>> temporary API in 3.x to maintain compatibility with 3.0, which is totally
>>>>> different from my understanding of the situation when you exposed it, then
>>>>> I'd say we should gate 3.0 on having a stable row interface.
>>>>>
>>>>> I also don't get this backporting a giant feature to 2.x line ... as
>>>>> suggested by others in the thread, DSv2 would be one of the main reasons
>>>>> people upgrade to 3.0. What's so special about DSv2 that we are doing 
>>>>> this?
>>>>> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> Why would that require an incompatible change?
>>>>>>
>>>>>> We *could* make an incompatible change and remove support for
>>>>>> InternalRow, but I think we would want to carefully consider whether that
>>>>>> is the right decision. And in any case, we would be able to keep 2.5 and
>>>>>> 3.0 compatible, which is the main goal.
>>>>>>
>>>>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>> How would you not make incompatible changes in 3.x? As discussed the
>>>>>> InternalRow API is not stable and needs to change.
>>>>>>
>>>>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>> > Making downstream to diverge their implementation heavily between
>>>>>> minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>>>>>
>>>>>> You're right that the API has been evolving in the 2.x line. But, it
>>>>>> is now reasonably stable with respect to the current feature set and we
>>>>>> should not need to break compatibility in the 3.x line. Because we have
>>>>>> reached our goals for the 3.0 release, we can backport at least those
>>>>>> features to 2.x and confidently have an API that works in both a 2.x
>>>>>> release and is compatible with 3.0, if not 3.1 and later releases as 
>>>>>> well.
>>>>>>
>>>>>> > I'd rather say preparation of Spark 2.5 should be started after
>>>>>> Spark 3.0 is officially released
>>>>>>
>>>>>> The reason I'm suggesting this is that I'm already going to do the
>>>>>> work to backport the 3.0 release features to 2.4. I've been asked by
>>>>>> several people when DSv2 will be released, so I know there is a lot of
>>>>>> interest in making this available sooner than 3.0. If I'm already doing 
>>>>>> the
>>>>>> work, then I'd be happy to share that with the community.
>>>>>>
>>>>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on
>>>>>> 2.5 while preparing the 3.0 preview and fixing bugs. For DSv2, the work 
>>>>>> is
>>>>>> about complete so we can easily release the same set of features and API 
>>>>>> in
>>>>>> 2.5 and 3.0.
>>>>>>
>>>>>> If we decide for some reason to wait until after 3.0 is released, I
>>>>>> don't know that there is much value in a 2.5. The purpose is to be a step
>>>>>> toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me.
>>>>>> It also wouldn't get these features out any sooner than 3.0, as a 2.5
>>>>>> release probably would, given the work needed to validate the 
>>>>>> incompatible
>>>>>> changes in 3.0.
>>>>>>
>>>>>> > DSv2 change would be the major backward incompatibility which Spark
>>>>>> 2.x users may hesitate to upgrade
>>>>>>
>>>>>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>>>>>> expected. I don't think it will need incompatible changes in the 3.x 
>>>>>> line.
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim <kabh...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed
>>>>>> to deal with this as the change made confusion on my PRs...), but my bet 
>>>>>> is
>>>>>> that DSv2 would be already changed in incompatible way, at least who 
>>>>>> works
>>>>>> for custom DataSource. Making downstream to diverge their implementation
>>>>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>>>>>> experience - especially we are not completely closed the chance to 
>>>>>> further
>>>>>> modify DSv2, and the change could be backward incompatible.
>>>>>>
>>>>>> If we really want to bring the DSv2 change to 2.x version line to let
>>>>>> end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd 
>>>>>> rather
>>>>>> say preparation of Spark 2.5 should be started after Spark 3.0 is
>>>>>> officially released, honestly even later than that, say, getting some
>>>>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>>>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users 
>>>>>> may
>>>>>> be frustrated to upgrade to next minor version.
>>>>>>
>>>>>> Btw, do we have any specific target users for this? Personally DSv2
>>>>>> change would be the major backward incompatibility which Spark 2.x users
>>>>>> may hesitate to upgrade, so they might be already prepared to migrate to
>>>>>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>>>>>
>>>>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <
>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>
>>>>>> Do you mean you want to have a breaking API change between 3.0 and
>>>>>> 3.1?
>>>>>> I believe we follow Semantic Versioning (
>>>>>> https://spark.apache.org/versioning-policy.html ).
>>>>>>
>>>>>> > We just won’t add any breaking changes before 3.1.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>> I don’t think we need to gate a 3.0 release on making a more stable
>>>>>> version of InternalRow
>>>>>>
>>>>>> Sounds like we agree, then. We will use it for 3.0, but there are
>>>>>> known problems with it.
>>>>>>
>>>>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>>>>> progress towards more stable, but will have to break certain APIs) and 
>>>>>> 2.x
>>>>>> seems like a false premise.
>>>>>>
>>>>>> Why do you think we will need to break certain APIs before 3.0?
>>>>>>
>>>>>> I’m only suggesting that we release the same support in a 2.5 release
>>>>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>>>>> seems like we can certainly do that. We just won’t add any breaking 
>>>>>> changes
>>>>>> before 3.1.
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>> I don't think we need to gate a 3.0 release on making a more stable
>>>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>>>>> (which will change and progress towards more stable, but will have to 
>>>>>> break
>>>>>> certain APIs) and 2.x seems like a false premise.
>>>>>>
>>>>>> To point out some problems with InternalRow that you think are
>>>>>> already pragmatic and stable:
>>>>>>
>>>>>> The class is in catalyst, which states:
>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>>
>>>>>> /**
>>>>>> * Catalyst is a library for manipulating relational query plans.  All
>>>>>> classes in catalyst are
>>>>>> * considered an internal API to Spark SQL and are subject to change
>>>>>> between minor releases.
>>>>>> */
>>>>>>
>>>>>> There is no even any annotation on the interface.
>>>>>>
>>>>>> The entire dependency chain were created to be private, and tightly
>>>>>> coupled with internal implementations. For example,
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>>
>>>>>> /**
>>>>>> * A UTF-8 String for internal Spark use.
>>>>>> * <p>
>>>>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>>>>> comparison,
>>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>>>> * <p>
>>>>>> * Note: This is not designed for general use cases, should not be
>>>>>> used outside SQL.
>>>>>> */
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>>
>>>>>> (which again is in catalyst package)
>>>>>>
>>>>>>
>>>>>> If you want to argue this way, you might as well argue we should make
>>>>>> the entire catalyst package public to be pragmatic and not allow any
>>>>>> changes.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com>
>>>>>> wrote:
>>>>>>
>>>>>> When you created the PR to make InternalRow public
>>>>>>
>>>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>>>> instead of UnsafeRow, which is a specific implementation of
>>>>>> InternalRow. Exposing this API has always been a part of DSv2 and
>>>>>> while both you and I did some work to avoid this, we are still in the 
>>>>>> phase
>>>>>> of starting with that API.
>>>>>>
>>>>>> Note that any change to InternalRow would be very costly to
>>>>>> implement because this interface is widely used. That is why I think we 
>>>>>> can
>>>>>> certainly consider it stable enough to use here, and that’s probably why
>>>>>> UnsafeRow was part of the original proposal.
>>>>>>
>>>>>> In any case, the goal for 3.0 was not to replace the use of
>>>>>> InternalRow, it was to get the majority of SQL working on top of the
>>>>>> interface added after 2.4. That’s done and stable, so I think a 2.5 
>>>>>> release
>>>>>> with it is also reasonable.
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>> To push back, while I agree we should not drastically change
>>>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>>>> the Catalyst package or the unsafe package. External implementations 
>>>>>> should
>>>>>> be decoupled from the internal implementations, with cheap ways to 
>>>>>> convert
>>>>>> back and forth.
>>>>>>
>>>>>> When you created the PR to make InternalRow public, the understanding
>>>>>> was to work towards making it stable in the future, assuming we will 
>>>>>> start
>>>>>> with an unstable API temporarily. You can't just make a bunch internal 
>>>>>> APIs
>>>>>> tightly coupled with other internal pieces public and stable and call it 
>>>>>> a
>>>>>> day, just because it happen to satisfy some use cases temporarily 
>>>>>> assuming
>>>>>> the rest of Spark doesn't change.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com>
>>>>>> wrote:
>>>>>>
>>>>>> > DSv2 is far from stable right?
>>>>>>
>>>>>> No, I think it is reasonably stable and very close to being ready for
>>>>>> a release.
>>>>>>
>>>>>> > All the actual data types are unstable and you guys have completely
>>>>>> ignored that.
>>>>>>
>>>>>> I think what you're referring to is the use of `InternalRow`. That's
>>>>>> a stable API and there has been no work to avoid using it. In any case, I
>>>>>> don't think that anyone is suggesting that we delay 3.0 until a 
>>>>>> replacement
>>>>>> for `InternalRow` is added, right?
>>>>>>
>>>>>> While I understand the motivation for a better solution here, I think
>>>>>> the pragmatic solution is to continue using `InternalRow`.
>>>>>>
>>>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>>>>> invasive of a change to backport once you consider the parts needed to 
>>>>>> make
>>>>>> dsv2 stable.
>>>>>>
>>>>>> I believe that those of us working on DSv2 are confident about the
>>>>>> current stability. We set goals for what to get into the 3.0 release 
>>>>>> months
>>>>>> ago and have very nearly reached the point where we are ready for that
>>>>>> release.
>>>>>>
>>>>>> I don't think instability would be a problem in maintaining
>>>>>> compatibility between the 2.5 version and the 3.0 version. If we find 
>>>>>> that
>>>>>> we need to make API changes (other than additions) then we can make those
>>>>>> in the 3.1 release. Because the goals we set for the 3.0 release have 
>>>>>> been
>>>>>> reached with the current API and if we are ready to release 3.0, we can
>>>>>> release a 2.5 with the same API.
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>> DSv2 is far from stable right? All the actual data types are unstable
>>>>>> and you guys have completely ignored that. We'd need to work on that and
>>>>>> that will be a breaking change. If the goal is to make DSv2 work across 
>>>>>> 3.x
>>>>>> and 2.x, that seems too invasive of a change to backport once you 
>>>>>> consider
>>>>>> the parts needed to make dsv2 stable.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <
>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support
>>>>>> added.
>>>>>>
>>>>>> A Spark 2.5 release with these two additions will help people migrate
>>>>>> to Spark 3.0 when it is released because they will be able to use a 
>>>>>> single
>>>>>> implementation for DSv2 sources that works in both 2.5 and 3.0. 
>>>>>> Similarly,
>>>>>> upgrading to 3.0 won't also require also updating to Java 11 because 
>>>>>> users
>>>>>> could update to Java 11 with the 2.5 release and have fewer major 
>>>>>> changes.
>>>>>>
>>>>>> Another reason to consider a 2.5 release is that many people are
>>>>>> interested in a release with the latest DSv2 API and support for DSv2 
>>>>>> SQL.
>>>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, 
>>>>>> so
>>>>>> it makes sense to share this work with the community.
>>>>>>
>>>>>> This release line would just consist of backports like DSv2 and Java
>>>>>> 11 that assist compatibility, to keep the scope of the release small. The
>>>>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>>>> release.
>>>>>>
>>>>>> Would a Spark 2.5 release help anyone else? Are there any concerns
>>>>>> about this plan?
>>>>>>
>>>>>>
>>>>>> rb
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Name : Jungtaek Lim
>>>>>> Blog : http://medium.com/@heartsavior
>>>>>> Twitter : http://twitter.com/heartsavior
>>>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Reply via email to