I’m a bit concern about what Arun is summarizing?

We are building on DSv2 and already have to rewrite for bunch of changes in 
master/2.4, increasing in cost for dev work and release management.

If we are saying more changes are coming in 3.0, do we have more info on what 
value the current changes in 2.4 are adding now?



________________________________
From: Wenchen Fan <cloud0...@gmail.com>
Sent: Monday, September 10, 2018 12:35 AM
To: ar...@apache.org
Cc: Ryan Blue; skn...@berkeley.edu; Dongjoon Hyun; joshro...@databricks.com; 
Sean Owen; Spark dev list
Subject: Re: Branch 2.4 is cut

There are a lot of "breaking" changes we made in 2.4 for data source v2, while 
I agree SPARK-24882 is "breaking" most.

I don't agree SPARK-24882 is half-baked. But I'm willing to revert it if we 
have a bunch of data source v2 users and they are not willing to update their 
implementation intensely before data source v2 API is stabilized.

On Mon, Sep 10, 2018 at 2:55 PM Arun Mahadevan 
<ar...@apache.org<mailto:ar...@apache.org>> wrote:
Ryan's proposal makes a lot of sense. Its better not to release half-baked 
changes in 2.4 which not only breaks a lot of the APIs released in 2.3, but 
also expected to change further due redesigns before 3.0 so don't see much 
value releasing it in 2.4.

On Sun, 9 Sep 2018 at 22:42, Wenchen Fan 
<cloud0...@gmail.com<mailto:cloud0...@gmail.com>> wrote:
Strictly speaking, data source v2 is always half-finished until we mark it as 
stable. We need some small milestones to move forward step by step.

The redesign also happens in an incremental way. SPARK-24882 mostly focus on 
the "RDD" part of the API: the separation of reader factory and input 
partitions, the introduction of ScanConfig, etc. Then we focus on the 
high-level abstraction and want to change the "table" part of the API.

In my understanding, each PR should be self-contained. If we are OK to have 
SPARK-24882 in master as an individual commit, I think it's also OK to have it 
in branch 2.4.

I've created https://issues.apache.org/jira/browse/SPARK-25390 to track the new 
abstraction. It doesn't change the API a lot, but update the streaming 
execution engine quite a bit.

Thanks,
Wenchen

On Mon, Sep 10, 2018 at 4:20 AM Ryan Blue 
<rb...@netflix.com<mailto:rb...@netflix.com>> wrote:
Wenchen, can you hold off on the first RC?

The half-finished changes from the redesign of the DataSourceV2 API are in 
master, added in SPARK-24882<https://github.com/apache/spark/pull/22009>, and 
are now in the 2.4 branch. We've had a lot of good discussion since that PR was 
merged to update and fix the design, plus only one of the follow-ups on 
SPARK-25186<https://issues.apache.org/jira/browse/SPARK-25186> is done. 
Clearly, the redesign was too large to get into 2.4 in so little time -- it was 
proposed about 10 days before the original branch date -- and I don't think it 
is a good idea to release half-finished major changes.

The easiest solution is to revert SPARK-24882 in the release branch. That way 
we have minor changes in 2.4 and major changes in the next release, instead of 
major changes in both. What does everyone think?

rb

On Fri, Sep 7, 2018 at 10:37 AM shane knapp 
<skn...@berkeley.edu<mailto:skn...@berkeley.edu>> wrote:
++joshrosen  (thanks for the help w/deploying the jenkins configs)

the basic 2.4 builds are deployed and building!

i haven't created (a) build(s) yet for scala 2.12...  i'll be coordinating this 
w/the databricks folks next week.

On Fri, Sep 7, 2018 at 9:53 AM, Dongjoon Hyun 
<dongjoon.h...@gmail.com<mailto:dongjoon.h...@gmail.com>> wrote:
Thank you, Shane! :D

Bests,
Dongjoon.

On Fri, Sep 7, 2018 at 9:51 AM shane knapp 
<skn...@berkeley.edu<mailto:skn...@berkeley.edu>> wrote:
i'll try and get to the 2.4 branch stuff today...




--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Ryan Blue
Software Engineer
Netflix

Reply via email to