Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Dongjoon Hyun
+1 for Matei's suggestion!

Bests,
Dongjoon.

On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia 
wrote:

> If the goal is to get people to try the DSv2 API and build DSv2 data
> sources, can we recommend the 3.0-preview release for this? That would get
> people shifting to 3.0 faster, which is probably better overall compared to
> maintaining two major versions. There’s not that much else changing in 3.0
> if you already want to update your Java version.
>
> On Sep 21, 2019, at 2:45 PM, Ryan Blue  wrote:
>
> > If you insist we shouldn't change the unstable temporary API in 3.x . . .
>
> Not what I'm saying at all. I said we should carefully consider whether a
> breaking change is the right decision in the 3.x line.
>
> All I'm suggesting is that we can make a 2.5 release with the feature and
> an API that is the same as the one in 3.0.
>
> > I also don't get this backporting a giant feature to 2.x line
>
> I am planning to do this so we can use DSv2 before 3.0 is released. Then
> we can have a source implementation that works in both 2.x and 3.0 to make
> the transition easier. Since I'm already doing the work, I'm offering to
> share it with the community.
>
>
> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin  wrote:
>
>> Because for example we'd need to move the location of InternalRow,
>> breaking the package name. If you insist we shouldn't change the unstable
>> temporary API in 3.x to maintain compatibility with 3.0, which is totally
>> different from my understanding of the situation when you exposed it, then
>> I'd say we should gate 3.0 on having a stable row interface.
>>
>> I also don't get this backporting a giant feature to 2.x line ... as
>> suggested by others in the thread, DSv2 would be one of the main reasons
>> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
>> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>>
>>
>>
>> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue  wrote:
>>
>>> Why would that require an incompatible change?
>>>
>>> We *could* make an incompatible change and remove support for
>>> InternalRow, but I think we would want to carefully consider whether that
>>> is the right decision. And in any case, we would be able to keep 2.5 and
>>> 3.0 compatible, which is the main goal.
>>>
>>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin  wrote:
>>>
>>> How would you not make incompatible changes in 3.x? As discussed the
>>> InternalRow API is not stable and needs to change.
>>>
>>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue  wrote:
>>>
>>> > Making downstream to diverge their implementation heavily between
>>> minor versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>>
>>> You're right that the API has been evolving in the 2.x line. But, it is
>>> now reasonably stable with respect to the current feature set and we should
>>> not need to break compatibility in the 3.x line. Because we have reached
>>> our goals for the 3.0 release, we can backport at least those features to
>>> 2.x and confidently have an API that works in both a 2.x release and is
>>> compatible with 3.0, if not 3.1 and later releases as well.
>>>
>>> > I'd rather say preparation of Spark 2.5 should be started after Spark
>>> 3.0 is officially released
>>>
>>> The reason I'm suggesting this is that I'm already going to do the work
>>> to backport the 3.0 release features to 2.4. I've been asked by several
>>> people when DSv2 will be released, so I know there is a lot of interest in
>>> making this available sooner than 3.0. If I'm already doing the work, then
>>> I'd be happy to share that with the community.
>>>
>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>>> about complete so we can easily release the same set of features and API in
>>> 2.5 and 3.0.
>>>
>>> If we decide for some reason to wait until after 3.0 is released, I
>>> don't know that there is much value in a 2.5. The purpose is to be a step
>>> toward 3.0, and releasing that step after 3.0 doesn't seem helpful to me.
>>> It also wouldn't get these features out any sooner than 3.0, as a 2.5
>>> release probably would, given the work needed to validate the incompatible
>>> changes in 3.0.
>>>
>>> > DSv2 change would be the major backward incompatibility which Spark
>>> 2.x users may hesitate to upgrade
>>>
>>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>>> expected. I don't think it will need incompatible changes in the 3.x line.
>>>
>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  wrote:
>>>
>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>>> deal with this as the change made confusion on my PRs...), but my bet is
>>> that DSv2 would be already changed in incompatible way, at least who works
>>> for custom DataSource. Making downstream to diverge their implementation
>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>>> experience 

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Matei Zaharia
If the goal is to get people to try the DSv2 API and build DSv2 data sources, 
can we recommend the 3.0-preview release for this? That would get people 
shifting to 3.0 faster, which is probably better overall compared to 
maintaining two major versions. There’s not that much else changing in 3.0 if 
you already want to update your Java version.

> On Sep 21, 2019, at 2:45 PM, Ryan Blue  wrote:
> 
> > If you insist we shouldn't change the unstable temporary API in 3.x . . .
> 
> Not what I'm saying at all. I said we should carefully consider whether a 
> breaking change is the right decision in the 3.x line.
> 
> All I'm suggesting is that we can make a 2.5 release with the feature and an 
> API that is the same as the one in 3.0.
> 
> > I also don't get this backporting a giant feature to 2.x line
> 
> I am planning to do this so we can use DSv2 before 3.0 is released. Then we 
> can have a source implementation that works in both 2.x and 3.0 to make the 
> transition easier. Since I'm already doing the work, I'm offering to share it 
> with the community.
> 
> 
> On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin  > wrote:
> Because for example we'd need to move the location of InternalRow, breaking 
> the package name. If you insist we shouldn't change the unstable temporary 
> API in 3.x to maintain compatibility with 3.0, which is totally different 
> from my understanding of the situation when you exposed it, then I'd say we 
> should gate 3.0 on having a stable row interface.
> 
> I also don't get this backporting a giant feature to 2.x line ... as 
> suggested by others in the thread, DSv2 would be one of the main reasons 
> people upgrade to 3.0. What's so special about DSv2 that we are doing this? 
> Why not abandoning 3.0 entirely and backport all the features to 2.x?
> 
> 
> 
> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue  > wrote:
> Why would that require an incompatible change?
> 
> We *could* make an incompatible change and remove support for InternalRow, 
> but I think we would want to carefully consider whether that is the right 
> decision. And in any case, we would be able to keep 2.5 and 3.0 compatible, 
> which is the main goal.
> 
> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin  > wrote:
> How would you not make incompatible changes in 3.x? As discussed the 
> InternalRow API is not stable and needs to change. 
> 
> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue  > wrote:
> > Making downstream to diverge their implementation heavily between minor 
> > versions (say, 2.4 vs 2.5) wouldn't be a good experience
> 
> You're right that the API has been evolving in the 2.x line. But, it is now 
> reasonably stable with respect to the current feature set and we should not 
> need to break compatibility in the 3.x line. Because we have reached our 
> goals for the 3.0 release, we can backport at least those features to 2.x and 
> confidently have an API that works in both a 2.x release and is compatible 
> with 3.0, if not 3.1 and later releases as well.
> 
> > I'd rather say preparation of Spark 2.5 should be started after Spark 3.0 
> > is officially released
> 
> The reason I'm suggesting this is that I'm already going to do the work to 
> backport the 3.0 release features to 2.4. I've been asked by several people 
> when DSv2 will be released, so I know there is a lot of interest in making 
> this available sooner than 3.0. If I'm already doing the work, then I'd be 
> happy to share that with the community.
> 
> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5 while 
> preparing the 3.0 preview and fixing bugs. For DSv2, the work is about 
> complete so we can easily release the same set of features and API in 2.5 and 
> 3.0.
> 
> If we decide for some reason to wait until after 3.0 is released, I don't 
> know that there is much value in a 2.5. The purpose is to be a step toward 
> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also 
> wouldn't get these features out any sooner than 3.0, as a 2.5 release 
> probably would, given the work needed to validate the incompatible changes in 
> 3.0.
> 
> > DSv2 change would be the major backward incompatibility which Spark 2.x 
> > users may hesitate to upgrade
> 
> As I pointed out, DSv2 has been changing in the 2.x line, so this is 
> expected. I don't think it will need incompatible changes in the 3.x line.
> 
> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  > wrote:
> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal 
> with this as the change made confusion on my PRs...), but my bet is that DSv2 
> would be already changed in incompatible way, at least who works for custom 
> DataSource. Making downstream to diverge their implementation heavily between 
> minor versions (say, 2.4 vs 2.5) wouldn't be a good experience - especially 
> we are not completely 

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
> If you insist we shouldn't change the unstable temporary API in 3.x . . .

Not what I'm saying at all. I said we should carefully consider whether a
breaking change is the right decision in the 3.x line.

All I'm suggesting is that we can make a 2.5 release with the feature and
an API that is the same as the one in 3.0.

> I also don't get this backporting a giant feature to 2.x line

I am planning to do this so we can use DSv2 before 3.0 is released. Then we
can have a source implementation that works in both 2.x and 3.0 to make the
transition easier. Since I'm already doing the work, I'm offering to share
it with the community.


On Sat, Sep 21, 2019 at 2:36 PM Reynold Xin  wrote:

> Because for example we'd need to move the location of InternalRow,
> breaking the package name. If you insist we shouldn't change the unstable
> temporary API in 3.x to maintain compatibility with 3.0, which is totally
> different from my understanding of the situation when you exposed it, then
> I'd say we should gate 3.0 on having a stable row interface.
>
> I also don't get this backporting a giant feature to 2.x line ... as
> suggested by others in the thread, DSv2 would be one of the main reasons
> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>
>
>
> On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue  wrote:
>
>> Why would that require an incompatible change?
>>
>> We *could* make an incompatible change and remove support for
>> InternalRow, but I think we would want to carefully consider whether that
>> is the right decision. And in any case, we would be able to keep 2.5 and
>> 3.0 compatible, which is the main goal.
>>
>> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin  wrote:
>>
>> How would you not make incompatible changes in 3.x? As discussed the
>> InternalRow API is not stable and needs to change.
>>
>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue  wrote:
>>
>> > Making downstream to diverge their implementation heavily between minor
>> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>
>> You're right that the API has been evolving in the 2.x line. But, it is
>> now reasonably stable with respect to the current feature set and we should
>> not need to break compatibility in the 3.x line. Because we have reached
>> our goals for the 3.0 release, we can backport at least those features to
>> 2.x and confidently have an API that works in both a 2.x release and is
>> compatible with 3.0, if not 3.1 and later releases as well.
>>
>> > I'd rather say preparation of Spark 2.5 should be started after Spark
>> 3.0 is officially released
>>
>> The reason I'm suggesting this is that I'm already going to do the work
>> to backport the 3.0 release features to 2.4. I've been asked by several
>> people when DSv2 will be released, so I know there is a lot of interest in
>> making this available sooner than 3.0. If I'm already doing the work, then
>> I'd be happy to share that with the community.
>>
>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>> about complete so we can easily release the same set of features and API in
>> 2.5 and 3.0.
>>
>> If we decide for some reason to wait until after 3.0 is released, I don't
>> know that there is much value in a 2.5. The purpose is to be a step toward
>> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
>> wouldn't get these features out any sooner than 3.0, as a 2.5 release
>> probably would, given the work needed to validate the incompatible changes
>> in 3.0.
>>
>> > DSv2 change would be the major backward incompatibility which Spark 2.x
>> users may hesitate to upgrade
>>
>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>> expected. I don't think it will need incompatible changes in the 3.x line.
>>
>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  wrote:
>>
>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>> deal with this as the change made confusion on my PRs...), but my bet is
>> that DSv2 would be already changed in incompatible way, at least who works
>> for custom DataSource. Making downstream to diverge their implementation
>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>> experience - especially we are not completely closed the chance to further
>> modify DSv2, and the change could be backward incompatible.
>>
>> If we really want to bring the DSv2 change to 2.x version line to let end
>> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
>> preparation of Spark 2.5 should be started after Spark 3.0 is officially
>> released, honestly even later than that, say, getting some reports from
>> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
>> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
>> upgrade to 

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Reynold Xin
Because for example we'd need to move the location of InternalRow, breaking the 
package name. If you insist we shouldn't change the unstable temporary API in 
3.x to maintain compatibility with 3.0, which is totally different from my 
understanding of the situation when you exposed it, then I'd say we should gate 
3.0 on having a stable row interface.

I also don't get this backporting a giant feature to 2.x line ... as suggested 
by others in the thread, DSv2 would be one of the main reasons people upgrade 
to 3.0. What's so special about DSv2 that we are doing this? Why not abandoning 
3.0 entirely and backport all the features to 2.x?

On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue < rb...@netflix.com > wrote:

> 
> Why would that require an incompatible change?
> 
> 
> We *could* make an incompatible change and remove support for InternalRow,
> but I think we would want to carefully consider whether that is the right
> decision. And in any case, we would be able to keep 2.5 and 3.0
> compatible, which is the main goal.
> 
> On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin < r...@databricks.com > wrote:
> 
> 
> 
>> How would you not make incompatible changes in 3.x? As discussed the
>> InternalRow API is not stable and needs to change. 
>> 
>> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue < rb...@netflix.com > wrote:
>> 
>> 
>>> > Making downstream to diverge their implementation heavily between minor
>>> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>> 
>>> 
>>> You're right that the API has been evolving in the 2.x line. But, it is
>>> now reasonably stable with respect to the current feature set and we
>>> should not need to break compatibility in the 3.x line. Because we have
>>> reached our goals for the 3.0 release, we can backport at least those
>>> features to 2.x and confidently have an API that works in both a 2.x
>>> release and is compatible with 3.0, if not 3.1 and later releases as well.
>>> 
>>> 
>>> 
>>> > I'd rather say preparation of Spark 2.5 should be started after Spark
>>> 3.0 is officially released
>>> 
>>> 
>>> The reason I'm suggesting this is that I'm already going to do the work to
>>> backport the 3.0 release features to 2.4. I've been asked by several
>>> people when DSv2 will be released, so I know there is a lot of interest in
>>> making this available sooner than 3.0. If I'm already doing the work, then
>>> I'd be happy to share that with the community.
>>> 
>>> 
>>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>>> about complete so we can easily release the same set of features and API
>>> in 2.5 and 3.0.
>>> 
>>> 
>>> If we decide for some reason to wait until after 3.0 is released, I don't
>>> know that there is much value in a 2.5. The purpose is to be a step toward
>>> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
>>> wouldn't get these features out any sooner than 3.0, as a 2.5 release
>>> probably would, given the work needed to validate the incompatible changes
>>> in 3.0.
>>> 
>>> 
>>> > DSv2 change would be the major backward incompatibility which Spark 2.x
>>> users may hesitate to upgrade
>>> 
>>> 
>>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>>> expected. I don't think it will need incompatible changes in the 3.x line.
>>> 
>>> 
>>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim < kabh...@gmail.com > wrote:
>>> 
>>> 
 Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
 deal with this as the change made confusion on my PRs...), but my bet is
 that DSv2 would be already changed in incompatible way, at least who works
 for custom DataSource. Making downstream to diverge their implementation
 heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
 experience - especially we are not completely closed the chance to further
 modify DSv2, and the change could be backward incompatible.
 
 
 If we really want to bring the DSv2 change to 2.x version line to let end
 users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
 preparation of Spark 2.5 should be started after Spark 3.0 is officially
 released, honestly even later than that, say, getting some reports from
 Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make
 Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may be
 frustrated to upgrade to next minor version.
 
 
 Btw, do we have any specific target users for this? Personally DSv2 change
 would be the major backward incompatibility which Spark 2.x users may
 hesitate to upgrade, so they might be already prepared to migrate to Spark
 3.0 if they are prepared to migrate to new DSv2.
 
 
 On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun < dongjoon.h...@gmail.com >
 wrote:
 
 
> Do you mean you want to have a breaking 

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
Why would that require an incompatible change?

We *could* make an incompatible change and remove support for InternalRow,
but I think we would want to carefully consider whether that is the right
decision. And in any case, we would be able to keep 2.5 and 3.0 compatible,
which is the main goal.

On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin  wrote:

> How would you not make incompatible changes in 3.x? As discussed the
> InternalRow API is not stable and needs to change.
>
> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue  wrote:
>
>> > Making downstream to diverge their implementation heavily between minor
>> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>>
>> You're right that the API has been evolving in the 2.x line. But, it is
>> now reasonably stable with respect to the current feature set and we should
>> not need to break compatibility in the 3.x line. Because we have reached
>> our goals for the 3.0 release, we can backport at least those features to
>> 2.x and confidently have an API that works in both a 2.x release and is
>> compatible with 3.0, if not 3.1 and later releases as well.
>>
>> > I'd rather say preparation of Spark 2.5 should be started after Spark
>> 3.0 is officially released
>>
>> The reason I'm suggesting this is that I'm already going to do the work
>> to backport the 3.0 release features to 2.4. I've been asked by several
>> people when DSv2 will be released, so I know there is a lot of interest in
>> making this available sooner than 3.0. If I'm already doing the work, then
>> I'd be happy to share that with the community.
>>
>> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
>> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
>> about complete so we can easily release the same set of features and API in
>> 2.5 and 3.0.
>>
>> If we decide for some reason to wait until after 3.0 is released, I don't
>> know that there is much value in a 2.5. The purpose is to be a step toward
>> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
>> wouldn't get these features out any sooner than 3.0, as a 2.5 release
>> probably would, given the work needed to validate the incompatible changes
>> in 3.0.
>>
>> > DSv2 change would be the major backward incompatibility which Spark 2.x
>> users may hesitate to upgrade
>>
>> As I pointed out, DSv2 has been changing in the 2.x line, so this is
>> expected. I don't think it will need incompatible changes in the 3.x line.
>>
>> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  wrote:
>>
>>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>>> deal with this as the change made confusion on my PRs...), but my bet is
>>> that DSv2 would be already changed in incompatible way, at least who works
>>> for custom DataSource. Making downstream to diverge their implementation
>>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>>> experience - especially we are not completely closed the chance to further
>>> modify DSv2, and the change could be backward incompatible.
>>>
>>> If we really want to bring the DSv2 change to 2.x version line to let
>>> end users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather
>>> say preparation of Spark 2.5 should be started after Spark 3.0 is
>>> officially released, honestly even later than that, say, getting some
>>> reports from Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we
>>> don't make Spark 2.5 be a kind of "tech-preview" which Spark 2.4 users may
>>> be frustrated to upgrade to next minor version.
>>>
>>> Btw, do we have any specific target users for this? Personally DSv2
>>> change would be the major backward incompatibility which Spark 2.x users
>>> may hesitate to upgrade, so they might be already prepared to migrate to
>>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>>
>>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun 
>>> wrote:
>>>
 Do you mean you want to have a breaking API change between 3.0 and 3.1?
 I believe we follow Semantic Versioning (
 https://spark.apache.org/versioning-policy.html ).

 > We just won’t add any breaking changes before 3.1.

 Bests,
 Dongjoon.


 On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue 
 wrote:

> I don’t think we need to gate a 3.0 release on making a more stable
> version of InternalRow
>
> Sounds like we agree, then. We will use it for 3.0, but there are
> known problems with it.
>
> Thinking we’d have dsv2 working in both 3.x (which will change and
> progress towards more stable, but will have to break certain APIs) and 2.x
> seems like a false premise.
>
> Why do you think we will need to break certain APIs before 3.0?
>
> I’m only suggesting that we release the same support in a 2.5 release
> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
> seems like we can certainly do that. We just won’t add 

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Reynold Xin
How would you not make incompatible changes in 3.x? As discussed the
InternalRow API is not stable and needs to change.

On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue  wrote:

> > Making downstream to diverge their implementation heavily between minor
> versions (say, 2.4 vs 2.5) wouldn't be a good experience
>
> You're right that the API has been evolving in the 2.x line. But, it is
> now reasonably stable with respect to the current feature set and we should
> not need to break compatibility in the 3.x line. Because we have reached
> our goals for the 3.0 release, we can backport at least those features to
> 2.x and confidently have an API that works in both a 2.x release and is
> compatible with 3.0, if not 3.1 and later releases as well.
>
> > I'd rather say preparation of Spark 2.5 should be started after Spark
> 3.0 is officially released
>
> The reason I'm suggesting this is that I'm already going to do the work to
> backport the 3.0 release features to 2.4. I've been asked by several people
> when DSv2 will be released, so I know there is a lot of interest in making
> this available sooner than 3.0. If I'm already doing the work, then I'd be
> happy to share that with the community.
>
> I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
> while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
> about complete so we can easily release the same set of features and API in
> 2.5 and 3.0.
>
> If we decide for some reason to wait until after 3.0 is released, I don't
> know that there is much value in a 2.5. The purpose is to be a step toward
> 3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
> wouldn't get these features out any sooner than 3.0, as a 2.5 release
> probably would, given the work needed to validate the incompatible changes
> in 3.0.
>
> > DSv2 change would be the major backward incompatibility which Spark 2.x
> users may hesitate to upgrade
>
> As I pointed out, DSv2 has been changing in the 2.x line, so this is
> expected. I don't think it will need incompatible changes in the 3.x line.
>
> On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  wrote:
>
>> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
>> deal with this as the change made confusion on my PRs...), but my bet is
>> that DSv2 would be already changed in incompatible way, at least who works
>> for custom DataSource. Making downstream to diverge their implementation
>> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
>> experience - especially we are not completely closed the chance to further
>> modify DSv2, and the change could be backward incompatible.
>>
>> If we really want to bring the DSv2 change to 2.x version line to let end
>> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
>> preparation of Spark 2.5 should be started after Spark 3.0 is officially
>> released, honestly even later than that, say, getting some reports from
>> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
>> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
>> upgrade to next minor version.
>>
>> Btw, do we have any specific target users for this? Personally DSv2
>> change would be the major backward incompatibility which Spark 2.x users
>> may hesitate to upgrade, so they might be already prepared to migrate to
>> Spark 3.0 if they are prepared to migrate to new DSv2.
>>
>> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun 
>> wrote:
>>
>>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>>> I believe we follow Semantic Versioning (
>>> https://spark.apache.org/versioning-policy.html ).
>>>
>>> > We just won’t add any breaking changes before 3.1.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue 
>>> wrote:
>>>
 I don’t think we need to gate a 3.0 release on making a more stable
 version of InternalRow

 Sounds like we agree, then. We will use it for 3.0, but there are known
 problems with it.

 Thinking we’d have dsv2 working in both 3.x (which will change and
 progress towards more stable, but will have to break certain APIs) and 2.x
 seems like a false premise.

 Why do you think we will need to break certain APIs before 3.0?

 I’m only suggesting that we release the same support in a 2.5 release
 that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
 seems like we can certainly do that. We just won’t add any breaking changes
 before 3.1.

 On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin 
 wrote:

> I don't think we need to gate a 3.0 release on making a more stable
> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
> (which will change and progress towards more stable, but will have to 
> break
> certain APIs) and 2.x seems like a false premise.
>
> To point out some problems 

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
> Making downstream to diverge their implementation heavily between minor
versions (say, 2.4 vs 2.5) wouldn't be a good experience

You're right that the API has been evolving in the 2.x line. But, it is now
reasonably stable with respect to the current feature set and we should not
need to break compatibility in the 3.x line. Because we have reached our
goals for the 3.0 release, we can backport at least those features to 2.x
and confidently have an API that works in both a 2.x release and is
compatible with 3.0, if not 3.1 and later releases as well.

> I'd rather say preparation of Spark 2.5 should be started after Spark 3.0
is officially released

The reason I'm suggesting this is that I'm already going to do the work to
backport the 3.0 release features to 2.4. I've been asked by several people
when DSv2 will be released, so I know there is a lot of interest in making
this available sooner than 3.0. If I'm already doing the work, then I'd be
happy to share that with the community.

I don't see why 2.5 and 3.0 are mutually exclusive. We can work on 2.5
while preparing the 3.0 preview and fixing bugs. For DSv2, the work is
about complete so we can easily release the same set of features and API in
2.5 and 3.0.

If we decide for some reason to wait until after 3.0 is released, I don't
know that there is much value in a 2.5. The purpose is to be a step toward
3.0, and releasing that step after 3.0 doesn't seem helpful to me. It also
wouldn't get these features out any sooner than 3.0, as a 2.5 release
probably would, given the work needed to validate the incompatible changes
in 3.0.

> DSv2 change would be the major backward incompatibility which Spark 2.x
users may hesitate to upgrade

As I pointed out, DSv2 has been changing in the 2.x line, so this is
expected. I don't think it will need incompatible changes in the 3.x line.

On Fri, Sep 20, 2019 at 9:25 PM Jungtaek Lim  wrote:

> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
> deal with this as the change made confusion on my PRs...), but my bet is
> that DSv2 would be already changed in incompatible way, at least who works
> for custom DataSource. Making downstream to diverge their implementation
> heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good
> experience - especially we are not completely closed the chance to further
> modify DSv2, and the change could be backward incompatible.
>
> If we really want to bring the DSv2 change to 2.x version line to let end
> users avoid forcing to upgrade Spark 3.x to enjoy new DSv2, I'd rather say
> preparation of Spark 2.5 should be started after Spark 3.0 is officially
> released, honestly even later than that, say, getting some reports from
> Spark 3.0 about DSv2 so that we feel DSv2 is OK. I hope we don't make Spark
> 2.5 be a kind of "tech-preview" which Spark 2.4 users may be frustrated to
> upgrade to next minor version.
>
> Btw, do we have any specific target users for this? Personally DSv2 change
> would be the major backward incompatibility which Spark 2.x users may
> hesitate to upgrade, so they might be already prepared to migrate to Spark
> 3.0 if they are prepared to migrate to new DSv2.
>
> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun 
> wrote:
>
>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>> I believe we follow Semantic Versioning (
>> https://spark.apache.org/versioning-policy.html ).
>>
>> > We just won’t add any breaking changes before 3.1.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue 
>> wrote:
>>
>>> I don’t think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow
>>>
>>> Sounds like we agree, then. We will use it for 3.0, but there are known
>>> problems with it.
>>>
>>> Thinking we’d have dsv2 working in both 3.x (which will change and
>>> progress towards more stable, but will have to break certain APIs) and 2.x
>>> seems like a false premise.
>>>
>>> Why do you think we will need to break certain APIs before 3.0?
>>>
>>> I’m only suggesting that we release the same support in a 2.5 release
>>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>>> seems like we can certainly do that. We just won’t add any breaking changes
>>> before 3.1.
>>>
>>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin 
>>> wrote:
>>>
 I don't think we need to gate a 3.0 release on making a more stable
 version of InternalRow, but thinking we'd have dsv2 working in both 3.x
 (which will change and progress towards more stable, but will have to break
 certain APIs) and 2.x seems like a false premise.

 To point out some problems with InternalRow that you think are already
 pragmatic and stable:

 The class is in catalyst, which states:
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala

 /**
 * Catalyst is a library for manipulating relational 

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
Thanks for pointing this out, Dongjoon.

To clarify, I’m not suggesting that we can break compatibility. I’m
suggesting that we make a 2.5 release that uses the same DSv2 API as 3.0.

These APIs are marked unstable, so we could make changes to them if we
needed — as we have done in the 2.x line — but I don’t see a reason why we
would break compatibility in the 3.x line.

On Fri, Sep 20, 2019 at 8:46 PM Dongjoon Hyun 
wrote:

> Do you mean you want to have a breaking API change between 3.0 and 3.1?
> I believe we follow Semantic Versioning (
> https://spark.apache.org/versioning-policy.html ).
>
> > We just won’t add any breaking changes before 3.1.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue 
> wrote:
>
>> I don’t think we need to gate a 3.0 release on making a more stable
>> version of InternalRow
>>
>> Sounds like we agree, then. We will use it for 3.0, but there are known
>> problems with it.
>>
>> Thinking we’d have dsv2 working in both 3.x (which will change and
>> progress towards more stable, but will have to break certain APIs) and 2.x
>> seems like a false premise.
>>
>> Why do you think we will need to break certain APIs before 3.0?
>>
>> I’m only suggesting that we release the same support in a 2.5 release
>> that we do in 3.0. Since we are nearly finished with the 3.0 goals, it
>> seems like we can certainly do that. We just won’t add any breaking changes
>> before 3.1.
>>
>> On Fri, Sep 20, 2019 at 11:39 AM Reynold Xin  wrote:
>>
>>> I don't think we need to gate a 3.0 release on making a more stable
>>> version of InternalRow, but thinking we'd have dsv2 working in both 3.x
>>> (which will change and progress towards more stable, but will have to break
>>> certain APIs) and 2.x seems like a false premise.
>>>
>>> To point out some problems with InternalRow that you think are already
>>> pragmatic and stable:
>>>
>>> The class is in catalyst, which states:
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala
>>>
>>> /**
>>> * Catalyst is a library for manipulating relational query plans.  All
>>> classes in catalyst are
>>> * considered an internal API to Spark SQL and are subject to change
>>> between minor releases.
>>> */
>>>
>>> There is no even any annotation on the interface.
>>>
>>> The entire dependency chain were created to be private, and tightly
>>> coupled with internal implementations. For example,
>>>
>>>
>>> https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
>>>
>>> /**
>>> * A UTF-8 String for internal Spark use.
>>> * 
>>> * A String encoded in UTF-8 as an Array[Byte], which can be used for
>>> comparison,
>>> * search, see http://en.wikipedia.org/wiki/UTF-8 for details.
>>> * 
>>> * Note: This is not designed for general use cases, should not be used
>>> outside SQL.
>>> */
>>>
>>>
>>>
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala
>>>
>>> (which again is in catalyst package)
>>>
>>>
>>> If you want to argue this way, you might as well argue we should make
>>> the entire catalyst package public to be pragmatic and not allow any
>>> changes.
>>>
>>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue  wrote:
>>>
 When you created the PR to make InternalRow public

 This isn’t quite accurate. The change I made was to use InternalRow
 instead of UnsafeRow, which is a specific implementation of InternalRow.
 Exposing this API has always been a part of DSv2 and while both you and I
 did some work to avoid this, we are still in the phase of starting with
 that API.

 Note that any change to InternalRow would be very costly to implement
 because this interface is widely used. That is why I think we can certainly
 consider it stable enough to use here, and that’s probably why
 UnsafeRow was part of the original proposal.

 In any case, the goal for 3.0 was not to replace the use of InternalRow,
 it was to get the majority of SQL working on top of the interface added
 after 2.4. That’s done and stable, so I think a 2.5 release with it is also
 reasonable.

 On Fri, Sep 20, 2019 at 11:23 AM Reynold Xin 
 wrote:

 To push back, while I agree we should not drastically change
 "InternalRow", there are a lot of changes that need to happen to make it
 stable. For example, none of the publicly exposed interfaces should be in
 the Catalyst package or the unsafe package. External implementations should
 be decoupled from the internal implementations, with cheap ways to convert
 back and forth.

 When you created the PR to make InternalRow public, the understanding
 was to work towards making it stable in the future, assuming we will start
 with an unstable API temporarily. You can't just make a bunch internal APIs
 tightly