Re: [DISCUSS] Spark 2.5 release

Ryan Blue Sat, 21 Sep 2019 14:11:15 -0700

Thanks for pointing this out, Dongjoon.

To clarify, I’m not suggesting that we can break compatibility. I’m
suggesting that we make a 2.5 release that uses the same DSv2 API as 3.0.


These APIs are marked unstable, so we could make changes to them if we
needed — as we have done in the 2.x line — but I don’t see a reason why we
would break compatibility in the 3.x line.

On Fri, Sep 20, 2019 at 8:46 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Do you mean you want to have a breaking API change between 3.0 and 3.1?
> I believe we follow Semantic Versioning (
> https://spark.apache.org/versioning-policy.html ).
>
> > We just won’t add any breaking changes before 3.1.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> I don’t think we need to gate a 3.0 release on making a more stable
>> version of InternalRow
>>
>> Sounds like we agree, then. We will use it for 3.0, but there are known
>> problems with it.
>>
>> Thinking we’d have dsv2 working in both 3.x (which will change and
>> progress towards more stable, but will have to break certain APIs) and 2.x
>> seems like a false premise.
>>
>> Why do you think we will need to break certain APIs before 3.0?
>>
>> I’m only suggesting that we release the same support in a 2.5 release
>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>> seems like we can certainly do that. We just won’t add any breaking changes
>> before 3.1.
>>
>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin <r...@databricks.com> wrote:
>>
>>> I don't think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>> (which will change and progress towards more stable, but will have to break
>>> certain APIs) and 2.x seems like a false premise.
>>>
>>> To point out some problems with InternalRow that you think are already
>>> pragmatic and stable:
>>>
>>> The class is in catalyst, which states:
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>
>>> /**
>>> * Catalyst is a library for manipulating relational query plans.  All
>>> classes in catalyst are
>>> * considered an internal API to Spark SQL and are subject to change
>>> between minor releases.
>>> */
>>>
>>> There is no even any annotation on the interface.
>>>
>>> The entire dependency chain were created to be private, and tightly
>>> coupled with internal implementations. For example,
>>>
>>>
>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>
>>> /**
>>> * A UTF-8 String for internal Spark use.
>>> * <p>
>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>> comparison,
>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>> * <p>
>>> * Note: This is not designed for general use cases, should not be used
>>> outside SQL.
>>> */
>>>
>>>
>>>
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>
>>> (which again is in catalyst package)
>>>
>>>
>>> If you want to argue this way, you might as well argue we should make
>>> the entire catalyst package public to be pragmatic and not allow any
>>> changes.
>>>
>>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> When you created the PR to make InternalRow public
>>>>
>>>> This isn’t quite accurate. The change I made was to use InternalRow
>>>> instead of UnsafeRow, which is a specific implementation of InternalRow.
>>>> Exposing this API has always been a part of DSv2 and while both you and I
>>>> did some work to avoid this, we are still in the phase of starting with
>>>> that API.
>>>>
>>>> Note that any change to InternalRow would be very costly to implement
>>>> because this interface is widely used. That is why I think we can certainly
>>>> consider it stable enough to use here, and that’s probably why
>>>> UnsafeRow was part of the original proposal.
>>>>
>>>> In any case, the goal for 3.0 was not to replace the use of InternalRow,
>>>> it was to get the majority of SQL working on top of the interface added
>>>> after 2.4. That’s done and stable, so I think a 2.5 release with it is also
>>>> reasonable.
>>>>
>>>> On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>> To push back, while I agree we should not drastically change
>>>> "InternalRow", there are a lot of changes that need to happen to make it
>>>> stable. For example, none of the publicly exposed interfaces should be in
>>>> the Catalyst package or the unsafe package. External implementations should
>>>> be decoupled from the internal implementations, with cheap ways to convert
>>>> back and forth.
>>>>
>>>> When you created the PR to make InternalRow public, the understanding
>>>> was to work towards making it stable in the future, assuming we will start
>>>> with an unstable API temporarily. You can't just make a bunch internal APIs
>>>> tightly coupled with other internal pieces public and stable and call it a
>>>> day, just because it happen to satisfy some use cases temporarily assuming
>>>> the rest of Spark doesn't change.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>> > DSv2 is far from stable right?
>>>>
>>>> No, I think it is reasonably stable and very close to being ready for a
>>>> release.
>>>>
>>>> > All the actual data types are unstable and you guys have completely
>>>> ignored that.
>>>>
>>>> I think what you're referring to is the use of `InternalRow`. That's a
>>>> stable API and there has been no work to avoid using it. In any case, I
>>>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>>>> for `InternalRow` is added, right?
>>>>
>>>> While I understand the motivation for a better solution here, I think
>>>> the pragmatic solution is to continue using `InternalRow`.
>>>>
>>>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>>>> invasive of a change to backport once you consider the parts needed to make
>>>> dsv2 stable.
>>>>
>>>> I believe that those of us working on DSv2 are confident about the
>>>> current stability. We set goals for what to get into the 3.0 release months
>>>> ago and have very nearly reached the point where we are ready for that
>>>> release.
>>>>
>>>> I don't think instability would be a problem in maintaining
>>>> compatibility between the 2.5 version and the 3.0 version. If we find that
>>>> we need to make API changes (other than additions) then we can make those
>>>> in the 3.1 release. Because the goals we set for the 3.0 release have been
>>>> reached with the current API and if we are ready to release 3.0, we can
>>>> release a 2.5 with the same API.
>>>>
>>>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>> DSv2 is far from stable right? All the actual data types are unstable
>>>> and you guys have completely ignored that. We'd need to work on that and
>>>> that will be a breaking change. If the goal is to make DSv2 work across 3.x
>>>> and 2.x, that seems too invasive of a change to backport once you consider
>>>> the parts needed to make dsv2 stable.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> In the DSv2 sync this week, we talked about a possible Spark 2.5
>>>> release based on the latest Spark 2.4, but with DSv2 and Java 11 support
>>>> added.
>>>>
>>>> A Spark 2.5 release with these two additions will help people migrate
>>>> to Spark 3.0 when it is released because they will be able to use a single
>>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>>>
>>>> Another reason to consider a 2.5 release is that many people are
>>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>>> it makes sense to share this work with the community.
>>>>
>>>> This release line would just consist of backports like DSv2 and Java 11
>>>> that assist compatibility, to keep the scope of the release small. The
>>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>>> release.
>>>>
>>>> Would a Spark 2.5 release help anyone else? Are there any concerns
>>>> about this plan?
>>>>
>>>>
>>>> rb
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Reply via email to