Re: [DISCUSS] Spark 2.5 release

Ryan Blue Fri, 20 Sep 2019 11:33:23 -0700

When you created the PR to make InternalRow public

This isn’t quite accurate. The change I made was to use InternalRow instead
of UnsafeRow, which is a specific implementation of InternalRow. Exposing
this API has always been a part of DSv2 and while both you and I did some
work to avoid this, we are still in the phase of starting with that API.


Note that any change to InternalRow would be very costly to implement
because this interface is widely used. That is why I think we can certainly
consider it stable enough to use here, and that’s probably why UnsafeRow
was part of the original proposal.

In any case, the goal for 3.0 was not to replace the use of InternalRow, it
was to get the majority of SQL working on top of the interface added after
2.4. That’s done and stable, so I think a 2.5 release with it is also
reasonable.

On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin <[email protected]> wrote:

> To push back, while I agree we should not drastically change
> "InternalRow", there are a lot of changes that need to happen to make it
> stable. For example, none of the publicly exposed interfaces should be in
> the Catalyst package or the unsafe package. External implementations should
> be decoupled from the internal implementations, with cheap ways to convert
> back and forth.
>
> When you created the PR to make InternalRow public, the understanding was
> to work towards making it stable in the future, assuming we will start with
> an unstable API temporarily. You can't just make a bunch internal APIs
> tightly coupled with other internal pieces public and stable and call it a
> day, just because it happen to satisfy some use cases temporarily assuming
> the rest of Spark doesn't change.
>
>
>
> On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue <[email protected]> wrote:
>
>> > DSv2 is far from stable right?
>>
>> No, I think it is reasonably stable and very close to being ready for a
>> release.
>>
>> > All the actual data types are unstable and you guys have completely
>> ignored that.
>>
>> I think what you're referring to is the use of `InternalRow`. That's a
>> stable API and there has been no work to avoid using it. In any case, I
>> don't think that anyone is suggesting that we delay 3.0 until a replacement
>> for `InternalRow` is added, right?
>>
>> While I understand the motivation for a better solution here, I think the
>> pragmatic solution is to continue using `InternalRow`.
>>
>> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
>> invasive of a change to backport once you consider the parts needed to make
>> dsv2 stable.
>>
>> I believe that those of us working on DSv2 are confident about the
>> current stability. We set goals for what to get into the 3.0 release months
>> ago and have very nearly reached the point where we are ready for that
>> release.
>>
>> I don't think instability would be a problem in maintaining compatibility
>> between the 2.5 version and the 3.0 version. If we find that we need to
>> make API changes (other than additions) then we can make those in the 3.1
>> release. Because the goals we set for the 3.0 release have been reached
>> with the current API and if we are ready to release 3.0, we can release a
>> 2.5 with the same API.
>>
>> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin <[email protected]> wrote:
>>
>> DSv2 is far from stable right? All the actual data types are unstable and
>> you guys have completely ignored that. We'd need to work on that and that
>> will be a breaking change. If the goal is to make DSv2 work across 3.x and
>> 2.x, that seems too invasive of a change to backport once you consider the
>> parts needed to make dsv2 stable.
>>
>>
>>
>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue <[email protected]>
>> wrote:
>>
>> Hi everyone,
>>
>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>
>> A Spark 2.5 release with these two additions will help people migrate to
>> Spark 3.0 when it is released because they will be able to use a single
>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>> upgrading to 3.0 won't also require also updating to Java 11 because users
>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>
>> Another reason to consider a 2.5 release is that many people are
>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>> it makes sense to share this work with the community.
>>
>> This release line would just consist of backports like DSv2 and Java 11
>> that assist compatibility, to keep the scope of the release small. The
>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>> release.
>>
>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>> this plan?
>>
>>
>> rb
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 2.5 release

Reply via email to