Re: [DISCUSS] Spark 2.5 release

Jungtaek Lim Fri, 20 Sep 2019 21:27:52 -0700

small correction: confusion -> conflict, so I had to go through and
understand parts of the changes


On Sat, Sep 21, 2019 at 1:25 PM Jungtaek Lim <[email protected]> wrote:

> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
> deal with this as the change made confusion on my PRs...), but my bet is
> that DSv2 would be already changed in incompatible way, at least who works
> for custom DataSource. Making downstream to diverge their implementation
> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
> experience - especially we are not completely closed the chance to further
> modify DSv2, and the change could be backward incompatible.
>
> If we really want to bring the DSv2 change to 2.x version line to let end
> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
> preparation of Spark 2.5 should be started after Spark 3.0 is officially
> released, honestly even later than that, say, getting some reports from
> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
> upgrade to next minor version.
>
> Btw, do we have any specific target users for this? Personally DSv2 change
> would be the major backward incompatibility which Spark 2.x users may
> hesitate to upgrade, so they might be already prepared to migrate to Spark
> 3.0 if they are prepared to migrate to new DSv2.
>
> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <[email protected]>
> wrote:
>
>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>> I believe we follow Semantic Versioning (
>> https://spark.apache.org/versioning-policy.html ).
>>
>> > We just won’t add any breaking changes before 3.1.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <[email protected]>
>> wrote:
>>
>>> I don’t think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow
>>>
>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>> problems with it.
>>>
>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>> seems like a false premise.
>>>
>>> Why do you think we will need to break certain APIs before 3.0?
>>>
>>> I’m only suggesting that we release the same support in a 2.5 release
>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>> seems like we can certainly do that. We just won’t add any breaking changes
>>> before 3.1.
>>>
>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <[email protected]>
>>> wrote:
>>>
>>>> I don't think we need to gate a 3.0 release on making a more stable
>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>>> (which will change and progress towards more stable, but will have to break
>>>> certain APIs) and 2.x seems like a false premise.
>>>>
>>>> To point out some problems with InternalRow that you think are already
>>>> pragmatic and stable:
>>>>
>>>> The class is in catalyst, which states:
>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>
>>>> /**
>>>> * Catalyst is a library for manipulating relational query plans.  All
>>>> classes in catalyst are
>>>> * considered an internal API to Spark SQL and are subject to change
>>>> between minor releases.
>>>> */
>>>>
>>>> There is no even any annotation on the interface.
>>>>
>>>> The entire dependency chain were created to be private, and tightly
>>>> coupled with internal implementations. For example,
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>
>>>> /**
>>>> * A UTF-8 String for internal Spark use.
>>>> * <p>
>>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>>> comparison,
>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>> * <p>
>>>> * Note: This is not designed for general use cases, should not be used
>>>> outside SQL.
>>>> */
>>>>
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>
>>>> (which again is in catalyst package)
>>>>
>>>>
>>>> If you want to argue this way, you might as well argue we should make
>>>> the entire catalyst package public to be pragmatic and not allow any
>>>> changes.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <[email protected]> wrote:
>>>>
>>>>> When you created the PR to make InternalRow public
>>>>>
>>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>>> instead of UnsafeRow, which is a specific implementation of
>>>>> InternalRow. Exposing this API has always been a part of DSv2 and
>>>>> while both you and I did some work to avoid this, we are still in the 
>>>>> phase
>>>>> of starting with that API.
>>>>>
>>>>> Note that any change to InternalRow would be very costly to implement
>>>>> because this interface is widely used. That is why I think we can 
>>>>> certainly
>>>>> consider it stable enough to use here, and that’s probably why
>>>>> UnsafeRow was part of the original proposal.
>>>>>
>>>>> In any case, the goal for 3.0 was not to replace the use of
>>>>> InternalRow, it was to get the majority of SQL working on top of the
>>>>> interface added after 2.4. That’s done and stable, so I think a 2.5 
>>>>> release
>>>>> with it is also reasonable.
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[email protected]>
>>>>> wrote:
>>>>>
>>>>> To push back, while I agree we should not drastically change
>>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>>> the Catalyst package or the unsafe package. External implementations 
>>>>> should
>>>>> be decoupled from the internal implementations, with cheap ways to convert
>>>>> back and forth.
>>>>>
>>>>> When you created the PR to make InternalRow public, the understanding
>>>>> was to work towards making it stable in the future, assuming we will start
>>>>> with an unstable API temporarily. You can't just make a bunch internal 
>>>>> APIs
>>>>> tightly coupled with other internal pieces public and stable and call it a
>>>>> day, just because it happen to satisfy some use cases temporarily assuming
>>>>> the rest of Spark doesn't change.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[email protected]> wrote:
>>>>>
>>>>> > DSv2 is far from stable right?
>>>>>
>>>>> No, I think it is reasonably stable and very close to being ready for
>>>>> a release.
>>>>>
>>>>> > All the actual data types are unstable and you guys have completely
>>>>> ignored that.
>>>>>
>>>>> I think what you're referring to is the use of `InternalRow`. That's a
>>>>> stable API and there has been no work to avoid using it. In any case, I
>>>>> don't think that anyone is suggesting that we delay 3.0 until a 
>>>>> replacement
>>>>> for `InternalRow` is added, right?
>>>>>
>>>>> While I understand the motivation for a better solution here, I think
>>>>> the pragmatic solution is to continue using `InternalRow`.
>>>>>
>>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>>>> invasive of a change to backport once you consider the parts needed to 
>>>>> make
>>>>> dsv2 stable.
>>>>>
>>>>> I believe that those of us working on DSv2 are confident about the
>>>>> current stability. We set goals for what to get into the 3.0 release 
>>>>> months
>>>>> ago and have very nearly reached the point where we are ready for that
>>>>> release.
>>>>>
>>>>> I don't think instability would be a problem in maintaining
>>>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>>>> we need to make API changes (other than additions) then we can make those
>>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>>>> reached with the current API and if we are ready to release 3.0, we can
>>>>> release a 2.5 with the same API.
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[email protected]>
>>>>> wrote:
>>>>>
>>>>> DSv2 is far from stable right? All the actual data types are unstable
>>>>> and you guys have completely ignored that. We'd need to work on that and
>>>>> that will be a breaking change. If the goal is to make DSv2 work across 
>>>>> 3.x
>>>>> and 2.x, that seems too invasive of a change to backport once you consider
>>>>> the parts needed to make dsv2 stable.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[email protected]
>>>>> > wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support
>>>>> added.
>>>>>
>>>>> A Spark 2.5 release with these two additions will help people migrate
>>>>> to Spark 3.0 when it is released because they will be able to use a single
>>>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>>>
>>>>> Another reason to consider a 2.5 release is that many people are
>>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>>> it makes sense to share this work with the community.
>>>>>
>>>>> This release line would just consist of backports like DSv2 and Java
>>>>> 11 that assist compatibility, to keep the scope of the release small. The
>>>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>>> release.
>>>>>
>>>>> Would a Spark 2.5 release help anyone else? Are there any concerns
>>>>> about this plan?
>>>>>
>>>>>
>>>>> rb
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

Re: [DISCUSS] Spark 2.5 release

Reply via email to