Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matei Zaharia
To add to this, we can add a stable interface anytime if the original one was marked as unstable; we wouldn’t have to wait until 4.0. We had a lot of APIs that were experimental in 2.0 and then got stabilized in later 2.x releases for example. Matei > On Feb 26, 2019, at 5:12 PM, Reynold Xin

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Reynold Xin
We will have to fix that before we declare dev2 is stable, because InternalRow is not a stable API. We don’t necessarily need to do it in 3.0. On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah wrote: > Will that then require an API break down the line? Do we save that for > Spark 4? > > > > -Matt

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matt Cheah
Will that then require an API break down the line? Do we save that for Spark 4? -Matt Cheah? From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Tuesday, February 26, 2019 at 4:53 PM To: Matt Cheah Cc: Sean Owen , Wenchen Fan , Xiao Li , Matei Zaharia , Spark Dev List Subject: Re:

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Ryan Blue
That's a good question. While I'd love to have a solution for that, I don't think it is a good idea to delay DSv2 until we have one. That is going to require a lot of internal changes and I don't see how we could make the release date if we are including an InternalRow replacement. On Tue, Feb

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Ryan Blue
Thanks for bumping this, Matt. I think we can have the discussion here to clarify exactly what we’re committing to and then have a vote thread once we’re agreed. Getting back to the DSv2 discussion, I think we have a good handle on what would be added: - Plugin system for catalogs -

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matt Cheah
Reynold made a note earlier about a proper Row API that isn’t InternalRow – is that still on the table? -Matt Cheah From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Tuesday, February 26, 2019 at 4:40 PM To: Matt Cheah Cc: Sean Owen , Wenchen Fan , Xiao Li , Matei Zaharia , Spark

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Matt Cheah
What would then be the next steps we'd take to collectively decide on plans and timelines moving forward? Might I suggest scheduling a conference call with appropriate PMCs to put our ideas together? Maybe such a discussion can take place at next week's meeting? Or do we need to have a separate

Re: Request review for long-standing PRs

2019-02-26 Thread Arun Mahadevan
Yes, I agree thats its a valid concern and leads to individual contributors giving up on new ideas or major improvements. On Tue, 26 Feb 2019 at 15:24, Jungtaek Lim wrote: > Adding one more, it implicitly leads individual contributors to give up > with challenging major things and just focus on

Re: Request review for long-standing PRs

2019-02-26 Thread Sean Owen
Mr Torres can you give these a pass please? On Tue, Feb 26, 2019 at 4:38 PM Jungtaek Lim wrote: > > Hi devs, > > sorry to bring this again to mailing list, but you know, ping in Github PR > just doesn't work. > > I have long-stand (created in last year) PRs on SS area which already got > over

Re: Request review for long-standing PRs

2019-02-26 Thread Jungtaek Lim
Adding one more, it implicitly leads individual contributors to give up with challenging major things and just focus on minor things, which would even help on project, but not in the long run. We don't have roadmap put into wall and let whole community share the load together, so individual

Re: Request review for long-standing PRs

2019-02-26 Thread Jungtaek Lim
Thanks Sean, as always, to share your thought quickly! I agree most of points, except "they add a lot of code and complexity relative to benefit", since no one can weigh on something before at least taking quick review. IMHO if someone would think so, better to speak (I know it's hard and being a

Re: Request review for long-standing PRs

2019-02-26 Thread Sean Owen
Those aren't bad changes, but they add a lot of code and complexity relative to benefit. I think it's positive that you've gotten people to spend time reviewing them, quite a lot. I don't know whether they should be merged. This isn't a 'bug' though; not all changes should be committed. Simple and

Re: SPIP: Accelerator-aware Scheduling

2019-02-26 Thread Xiangrui Meng
In case there are issues visiting Google doc, I attached PDF files to the JIRA. On Tue, Feb 26, 2019 at 7:41 AM Xingbo Jiang wrote: > Hi all, > > I want send a revised SPIP on implementing Accelerator(GPU)-aware > Scheduling. It improves Spark by making it aware of GPUs exposed by cluster >

Request review for long-standing PRs

2019-02-26 Thread Jungtaek Lim
Hi devs, sorry to bring this again to mailing list, but you know, ping in Github PR just doesn't work. I have long-stand (created in last year) PRs on SS area which already got over 100 comments (so community and me already put lots of efforts) but no progress in point of view for being merged

Re: [build system] jenkins pull request builds not triggering

2019-02-26 Thread shane knapp
jenkins is churning through a lot of github updates, and i'm finally seeing the backlog of pull requests builds starting. i'll keep an eye on things over the afternoon. On Tue, Feb 26, 2019 at 12:26 PM shane knapp wrote: > restarted jenkins, staring at logs. will report back when things

Re: [SS] Allowing stream Sink metadata as part of checkpoint?

2019-02-26 Thread Jungtaek Lim
I understand the reason about storing information along with data for transactional committing, but it mostly makes sense if we store outputs along with all necessary checkpoint information via transactional manner. Spark doesn't store query checkpoint along with outputs. I feel this is regarding

Re: [build system] jenkins pull request builds not triggering

2019-02-26 Thread shane knapp
restarted jenkins, staring at logs. will report back when things look good. On Tue, Feb 26, 2019 at 12:22 PM shane knapp wrote: > investigating, and this will most likely require a jenkins restart. > > -- > Shane Knapp > UC Berkeley EECS Research / RISELab Staff Technical Lead >

Re: PR tests not running?

2019-02-26 Thread shane knapp
yeah, i'm on it. On Tue, Feb 26, 2019 at 11:39 AM Xiao Li wrote: > Thanks for reporting it! It sounds like Shane is working on it. I manually > triggered the test for the PR https://github.com/apache/spark/pull/23894 > . > > Cheers, > > Xiao > > > Bruce Robbins 于2019年2月26日周二 上午11:33写道: > >>

[build system] jenkins pull request builds not triggering

2019-02-26 Thread shane knapp
investigating, and this will most likely require a jenkins restart. -- Shane Knapp UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu

Re: PR tests not running?

2019-02-26 Thread Xiao Li
Thanks for reporting it! It sounds like Shane is working on it. I manually triggered the test for the PR https://github.com/apache/spark/pull/23894 . Cheers, Xiao Bruce Robbins 于2019年2月26日周二 上午11:33写道: > Sorry for stating what is likely obvious, but PR tests don't appear to be > running.

PR tests not running?

2019-02-26 Thread Bruce Robbins
Sorry for stating what is likely obvious, but PR tests don't appear to be running. Last one started was around 2AM.

Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-26 Thread Ryan Blue
Hi everyone, With 12 +1 votes and no +0 or -1 votes, this SPIP passes. Thanks to everyone that participated in the discussions and voted! rb On Thu, Feb 21, 2019 at 12:14 AM Xiao Li wrote: > +1 This is in the right direction. The resolution rules and catalog APIs > need more discussion when

SPIP: Accelerator-aware Scheduling

2019-02-26 Thread Xingbo Jiang
Hi all, I want send a revised SPIP on implementing Accelerator(GPU)-aware Scheduling. It improves Spark by making it aware of GPUs exposed by cluster managers, and hence Spark can match GPU resources with user task requests properly. If you have scenarios that need to run workloads(DL/ML/Signal

Re: Thoughts on dataframe cogroup?

2019-02-26 Thread Li Jin
Thank you both for the reply. Chris and I have very similar use cases for cogroup. One of the goals for groupby apply + pandas UDF was to avoid things like collect list and reshaping data between Spark and Pandas. Cogroup feels very similar and can be an extension to the groupby apply + pandas