RE: [DISCUSS] Spark 2.5 release

2019-09-25 Thread JOAQUIN GUANTER GONZALBEZ
Gaido ; Matei Zaharia ; Reynold Xin ; Spark Dev List Asunto: Re: [DISCUSS] Spark 2.5 release > That's not a new requirement, that's an "implicit" requirement via semantic > versioning. The expectation is that the DSv2 API will change in minor versions in the 2.x line. T

Re: [DISCUSS] Spark 2.5 release

2019-09-24 Thread Ryan Blue
> That's not a new requirement, that's an "implicit" requirement via semantic versioning. The expectation is that the DSv2 API will change in minor versions in the 2.x line. The API is marked with the Experimental API annotation to signal that it can change, and it has been changing. A

Re: [DISCUSS] Spark 2.5 release

2019-09-24 Thread Jungtaek Lim
>> Apache Spark 2.4.x and 2.5.x DSv2 should be compatible. > This has not been a requirement for DSv2 development so far. If this is a new requirement, then we should not do a 2.5 release. My 2 cents, target version of new DSv2 has been only 3.0 so we don't ever have a chance to think about such

Re: [DISCUSS] Spark 2.5 release

2019-09-24 Thread Ryan Blue
>From those questions, I can see that there is significant confusion about what I'm proposing, so let me try to clear it up. > 1. Is DSv2 stable in `master`? DSv2 has reached a stable API that is capable of supporting all of the features we intend to deliver for Spark 3.0. The proposal is to

Re: [DISCUSS] Spark 2.5 release

2019-09-23 Thread Dongjoon Hyun
Hi, Ryan. This thread has many replied as you see. That is the evidence that the community is interested in your suggestion a lot. > I'm offering to help build a stable release without breaking changes. But if there is no community interest in it, I'm happy to drop this. In this thread, the

Re: [DISCUSS] Spark 2.5 release

2019-09-23 Thread Holden Karau
I would personally love to see us provide a gentle migration path to Spark 3 especially if much of the work is already going to happen anyways. Maybe giving it a different name (eg something like Spark-2-to-3-transitional) would make it more clear about its intended purpose and encourage folks to

Re: [DISCUSS] Spark 2.5 release

2019-09-23 Thread Ryan Blue
My understanding is that 3.0-preview is not going to be a production-ready release. For those of us that have been using backports of DSv2 in production, that doesn't help. It also doesn't help as a stepping stone because users would need to handle all of the incompatible changes in 3.0. Using

Re: [DISCUSS] Spark 2.5 release

2019-09-22 Thread Hyukjin Kwon
+1 for Matei's as well. On Sun, 22 Sep 2019, 14:59 Marco Gaido, wrote: > I agree with Matei too. > > Thanks, > Marco > > Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun < > dongjoon.h...@gmail.com> ha scritto: > >> +1 for Matei's suggestion! >> >> Bests, >> Dongjoon. >> >> On Sat, Sep

Re: [DISCUSS] Spark 2.5 release

2019-09-22 Thread Marco Gaido
I agree with Matei too. Thanks, Marco Il giorno dom 22 set 2019 alle ore 03:44 Dongjoon Hyun < dongjoon.h...@gmail.com> ha scritto: > +1 for Matei's suggestion! > > Bests, > Dongjoon. > > On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia > wrote: > >> If the goal is to get people to try the DSv2

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Dongjoon Hyun
+1 for Matei's suggestion! Bests, Dongjoon. On Sat, Sep 21, 2019 at 5:44 PM Matei Zaharia wrote: > If the goal is to get people to try the DSv2 API and build DSv2 data > sources, can we recommend the 3.0-preview release for this? That would get > people shifting to 3.0 faster, which is

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Matei Zaharia
If the goal is to get people to try the DSv2 API and build DSv2 data sources, can we recommend the 3.0-preview release for this? That would get people shifting to 3.0 faster, which is probably better overall compared to maintaining two major versions. There’s not that much else changing in 3.0

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
> If you insist we shouldn't change the unstable temporary API in 3.x . . . Not what I'm saying at all. I said we should carefully consider whether a breaking change is the right decision in the 3.x line. All I'm suggesting is that we can make a 2.5 release with the feature and an API that is

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Reynold Xin
Because for example we'd need to move the location of InternalRow, breaking the package name. If you insist we shouldn't change the unstable temporary API in 3.x to maintain compatibility with 3.0, which is totally different from my understanding of the situation when you exposed it, then I'd

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
Why would that require an incompatible change? We *could* make an incompatible change and remove support for InternalRow, but I think we would want to carefully consider whether that is the right decision. And in any case, we would be able to keep 2.5 and 3.0 compatible, which is the main goal.

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Reynold Xin
How would you not make incompatible changes in 3.x? As discussed the InternalRow API is not stable and needs to change. On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue wrote: > > Making downstream to diverge their implementation heavily between minor > versions (say, 2.4 vs 2.5) wouldn't be a good

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
> Making downstream to diverge their implementation heavily between minor versions (say, 2.4 vs 2.5) wouldn't be a good experience You're right that the API has been evolving in the 2.x line. But, it is now reasonably stable with respect to the current feature set and we should not need to break

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
Thanks for pointing this out, Dongjoon. To clarify, I’m not suggesting that we can break compatibility. I’m suggesting that we make a 2.5 release that uses the same DSv2 API as 3.0. These APIs are marked unstable, so we could make changes to them if we needed — as we have done in the 2.x line —

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Xiao Li
+1 on Jungtaek's point. We can revisit this when we release Spark 3.1? After the release of 3.0, I believe we will get more feedback about DSv2 from the community. The current design is just made by a small group of contributors. DSv2 + catalog APIs are still evolving. It is very likely we will

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Jungtaek Lim
Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal with this as the change made confusion on my PRs...), but my bet is that DSv2 would be already changed in incompatible way, at least who works for custom DataSource. Making downstream to diverge their implementation

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Jungtaek Lim
small correction: confusion -> conflict, so I had to go through and understand parts of the changes On Sat, Sep 21, 2019 at 1:25 PM Jungtaek Lim wrote: > Just 2 cents, I haven't tracked the change of DSv2 (though I needed to > deal with this as the change made confusion on my PRs...), but my

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Dongjoon Hyun
Do you mean you want to have a breaking API change between 3.0 and 3.1? I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ). > We just won’t add any breaking changes before 3.1. Bests, Dongjoon. On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue wrote: > I

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow Sounds like we agree, then. We will use it for 3.0, but there are known problems with it. Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise. To point out some problems

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
When you created the PR to make InternalRow public This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Sean Owen
I don't know enough about DSv2 to comment on this part, but, any theoretical 2.5 is still a ways off. Does waiting for 3.0 to 'stabilize' it as much as is possible help? I say that because re: Java 11, the main breaking change is probably the Hive 2 / Hadoop 3 dependency, JPMML (minor), as well

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
I didn't realize that Java 11 would require breaking changes. What breaking changes are required? On Fri, Sep 20, 2019 at 11:18 AM Sean Owen wrote: > Narrowly on Java 11: the problem is that it'll take some breaking > changes, more than would be usually appropriate in a minor release, I >

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
> DSv2 is far from stable right? No, I think it is reasonably stable and very close to being ready for a release. > All the actual data types are unstable and you guys have completely ignored that. I think what you're referring to is the use of `InternalRow`. That's a stable API and there has

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Sean Owen
Narrowly on Java 11: the problem is that it'll take some breaking changes, more than would be usually appropriate in a minor release, I think. I'm still not convinced there is a burning need to use Java 11 but stay on 2.4, after 3.0 is out, and at least the wheels are in motion there. Java 8 is

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider

[DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
Hi everyone, In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added. A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a