Re: [DISCUSS] Spark 2.5 release

Xiao Li Fri, 20 Sep 2019 22:11:13 -0700

+1 on Jungtaek's point. We can revisit this when we release Spark 3.1?
After the release of 3.0, I believe we will get more feedback about DSv2
from the community. The current design is just made by a small group of
contributors. DSv2 + catalog APIs are still evolving. It is very likely we
will make more changes after 3.0 release.


On Fri, Sep 20, 2019 at 9:27 PM Jungtaek Lim <kabh...@gmail.com> wrote:

> small correction: confusion -> conflict, so I had to go through and
> understand parts of the changes
>
> On Sat, Sep 21, 2019 at 1:25 PM Jungtaek Lim <kabh...@gmail.com> wrote:
>
>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>> deal with this as the change made confusion on my PRs...), but my bet is
>> that DSv2 would be already changed in incompatible way, at least who works
>> for custom DataSource. Making downstream to diverge their implementation
>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>> experience - especially we are not completely closed the chance to further
>> modify DSv2, and the change could be backward incompatible.
>>
>> If we really want to bring the DSv2 change to 2.x version line to let end
>> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
>> preparation of Spark 2.5 should be started after Spark 3.0 is officially
>> released, honestly even later than that, say, getting some reports from
>> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
>> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
>> upgrade to next minor version.
>>
>> Btw, do we have any specific target users for this? Personally DSv2
>> change would be the major backward incompatibility which Spark 2.x users
>> may hesitate to upgrade, so they might be already prepared to migrate to
>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>
>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>>> I believe we follow Semantic Versioning (
>>> https://spark.apache.org/versioning-policy.html ).
>>>
>>> > We just won’t add any breaking changes before 3.1.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> I don’t think we need to gate a 3.0 release on making a more stable
>>>> version of InternalRow
>>>>
>>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>>> problems with it.
>>>>
>>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>>> seems like a false premise.
>>>>
>>>> Why do you think we will need to break certain APIs before 3.0?
>>>>
>>>> I’m only suggesting that we release the same support in a 2.5 release
>>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>>> seems like we can certainly do that. We just won’t add any breaking changes
>>>> before 3.1.
>>>>
>>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> I don't think we need to gate a 3.0 release on making a more stable
>>>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>>>> (which will change and progress towards more stable, but will have to 
>>>>> break
>>>>> certain APIs) and 2.x seems like a false premise.
>>>>>
>>>>> To point out some problems with InternalRow that you think are already
>>>>> pragmatic and stable:
>>>>>
>>>>> The class is in catalyst, which states:
>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>>>
>>>>> /**
>>>>> * Catalyst is a library for manipulating relational query plans.  All
>>>>> classes in catalyst are
>>>>> * considered an internal API to Spark SQL and are subject to change
>>>>> between minor releases.
>>>>> */
>>>>>
>>>>> There is no even any annotation on the interface.
>>>>>
>>>>> The entire dependency chain were created to be private, and tightly
>>>>> coupled with internal implementations. For example,
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>>>
>>>>> /**
>>>>> * A UTF-8 String for internal Spark use.
>>>>> * <p>
>>>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>>>> comparison,
>>>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>>>> * <p>
>>>>> * Note: This is not designed for general use cases, should not be used
>>>>> outside SQL.
>>>>> */
>>>>>
>>>>>
>>>>>
>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>>>
>>>>> (which again is in catalyst package)
>>>>>
>>>>>
>>>>> If you want to argue this way, you might as well argue we should make
>>>>> the entire catalyst package public to be pragmatic and not allow any
>>>>> changes.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> When you created the PR to make InternalRow public
>>>>>>
>>>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>>>> instead of UnsafeRow, which is a specific implementation of
>>>>>> InternalRow. Exposing this API has always been a part of DSv2 and
>>>>>> while both you and I did some work to avoid this, we are still in the 
>>>>>> phase
>>>>>> of starting with that API.
>>>>>>
>>>>>> Note that any change to InternalRow would be very costly to
>>>>>> implement because this interface is widely used. That is why I think we 
>>>>>> can
>>>>>> certainly consider it stable enough to use here, and that’s probably why
>>>>>> UnsafeRow was part of the original proposal.
>>>>>>
>>>>>> In any case, the goal for 3.0 was not to replace the use of
>>>>>> InternalRow, it was to get the majority of SQL working on top of the
>>>>>> interface added after 2.4. That’s done and stable, so I think a 2.5 
>>>>>> release
>>>>>> with it is also reasonable.
>>>>>>
>>>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> To push back, while I agree we should not drastically change
>>>>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>>>>> stable. For example, none of the publicly exposed interfaces should be 
>>>>>>> in
>>>>>>> the Catalyst package or the unsafe package. External implementations 
>>>>>>> should
>>>>>>> be decoupled from the internal implementations, with cheap ways to 
>>>>>>> convert
>>>>>>> back and forth.
>>>>>>>
>>>>>>> When you created the PR to make InternalRow public, the
>>>>>>> understanding was to work towards making it stable in the future, 
>>>>>>> assuming
>>>>>>> we will start with an unstable API temporarily. You can't just make a 
>>>>>>> bunch
>>>>>>> internal APIs tightly coupled with other internal pieces public and 
>>>>>>> stable
>>>>>>> and call it a day, just because it happen to satisfy some use cases
>>>>>>> temporarily assuming the rest of Spark doesn't change.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> > DSv2 is far from stable right?
>>>>>>>>
>>>>>>>> No, I think it is reasonably stable and very close to being ready
>>>>>>>> for a release.
>>>>>>>>
>>>>>>>> > All the actual data types are unstable and you guys have
>>>>>>>> completely ignored that.
>>>>>>>>
>>>>>>>> I think what you're referring to is the use of `InternalRow`.
>>>>>>>> That's a stable API and there has been no work to avoid using it. In 
>>>>>>>> any
>>>>>>>> case, I don't think that anyone is suggesting that we delay 3.0 until a
>>>>>>>> replacement for `InternalRow` is added, right?
>>>>>>>>
>>>>>>>> While I understand the motivation for a better solution here, I
>>>>>>>> think the pragmatic solution is to continue using `InternalRow`.
>>>>>>>>
>>>>>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems
>>>>>>>> too invasive of a change to backport once you consider the parts 
>>>>>>>> needed to
>>>>>>>> make dsv2 stable.
>>>>>>>>
>>>>>>>> I believe that those of us working on DSv2 are confident about the
>>>>>>>> current stability. We set goals for what to get into the 3.0 release 
>>>>>>>> months
>>>>>>>> ago and have very nearly reached the point where we are ready for that
>>>>>>>> release.
>>>>>>>>
>>>>>>>> I don't think instability would be a problem in maintaining
>>>>>>>> compatibility between the 2.5 version and the 3.0 version. If we find 
>>>>>>>> that
>>>>>>>> we need to make API changes (other than additions) then we can make 
>>>>>>>> those
>>>>>>>> in the 3.1 release. Because the goals we set for the 3.0 release have 
>>>>>>>> been
>>>>>>>> reached with the current API and if we are ready to release 3.0, we can
>>>>>>>> release a 2.5 with the same API.
>>>>>>>>
>>>>>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <r...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> DSv2 is far from stable right? All the actual data types are
>>>>>>>>> unstable and you guys have completely ignored that. We'd need to work 
>>>>>>>>> on
>>>>>>>>> that and that will be a breaking change. If the goal is to make DSv2 
>>>>>>>>> work
>>>>>>>>> across 3.x and 2.x, that seems too invasive of a change to backport 
>>>>>>>>> once
>>>>>>>>> you consider the parts needed to make dsv2 stable.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <
>>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>>>>>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 
>>>>>>>>>> support
>>>>>>>>>> added.
>>>>>>>>>>
>>>>>>>>>> A Spark 2.5 release with these two additions will help people
>>>>>>>>>> migrate to Spark 3.0 when it is released because they will be able 
>>>>>>>>>> to use a
>>>>>>>>>> single implementation for DSv2 sources that works in both 2.5 and 
>>>>>>>>>> 3.0.
>>>>>>>>>> Similarly, upgrading to 3.0 won't also require also updating to Java 
>>>>>>>>>> 11
>>>>>>>>>> because users could update to Java 11 with the 2.5 release and have 
>>>>>>>>>> fewer
>>>>>>>>>> major changes.
>>>>>>>>>>
>>>>>>>>>> Another reason to consider a 2.5 release is that many people are
>>>>>>>>>> interested in a release with the latest DSv2 API and support for 
>>>>>>>>>> DSv2 SQL.
>>>>>>>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 
>>>>>>>>>> line, so
>>>>>>>>>> it makes sense to share this work with the community.
>>>>>>>>>>
>>>>>>>>>> This release line would just consist of backports like DSv2 and
>>>>>>>>>> Java 11 that assist compatibility, to keep the scope of the release 
>>>>>>>>>> small.
>>>>>>>>>> The purpose is to assist people moving to 3.0 and not distract from 
>>>>>>>>>> the 3.0
>>>>>>>>>> release.
>>>>>>>>>>
>>>>>>>>>> Would a Spark 2.5 release help anyone else? Are there any
>>>>>>>>>> concerns about this plan?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> rb
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Name : Jungtaek Lim
>> Blog : http://medium.com/@heartsavior
>> Twitter : http://twitter.com/heartsavior
>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>
>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


-- 
[image: Databricks Summit - Watch the talks]
<https://databricks.com/sparkaisummit/north-america>

Re: [DISCUSS] Spark 2.5 release

Reply via email to