Re: Spark Improvement Proposals

2017-03-13 Thread Sean Owen
Responding to your request for a vote, I meant that this isn't required per
se and the consensus here was not to vote on it. Hence the jokes about
meta-voting protocol. In that sense nothing new happened process-wise,
nothing against ASF norms, if that's your concern.

I think it's just an agreed convention now, that we will VOTE, as normal,
on particular types of changes that we call SPIPs. I mean it's no new
process in the ASF sense because VOTEs are an existing mechanic. I
personally view it as, simply, additional guidance about how to manage huge
JIRAs in a way that makes them stand a chance of moving forward. I suppose
we could VOTE about any JIRA if we wanted. They all proceed via lazy
consensus at the moment.

Practically -- I heard support for codifying this process and no objections
to the final form. This was bouncing around in process purgatory, when no
particular new process was called for.

It takes effect immediately, implicitly, like anything else I guess, like
amendments to code style guidelines. Please uses SPIPs to propose big
changes from here.

As to finding it hard to pick out of the noise, sure, I sympathize. Many
big things happen without a VOTE tag though. It does take a time investment
to triage these email lists. I don't know that this by itself means a VOTE
should have happened.

On Mon, Mar 13, 2017 at 6:15 PM Tom Graves  wrote:

> Another thing I think you should send out is when exactly does this take
> affect.  Is it any major new feature without a pull request?   Is it
> anything major starting with the 2.3 release?
>
> Tom
>
>
> On Monday, March 13, 2017 1:08 PM, Tom Graves 
> wrote:
>
>
> I'm not sure how you can say its not a new process.  If that is the case
> why do we need a page documenting it?
> As a developer if I want to put up a major improvement I have to now
> follow the SPIP whereas before I didn't, that certain seems like a new
> process.  As a PMC member I now have the ability to vote on these SPIPs,
> that seems like something new again.
>
> There are  apache bylaws and then there are project specific bylaws.  As
> far as I know Spark doesn't document any of its project specific bylaws so
> I guess this isn't officially a change to them, but it was implicit before
> that you didn't need any review for major improvements before, now you need
> an explicit vote for them to be approved.  Certainly seems to fall under
> the "Procedural" section in the voting link you sent.
>
> I understand this was under discussion for a while and you have asked for
> peoples feedback multiple times.  But sometimes long threads are easy to
> ignore.  That is why personally I like to see things labelled [VOTE],
> [ANNOUNCE], [DISCUSS] when it gets close to finalizing on something like
> this.
>
> I don't really want to draw this out or argue anymore about it, if I
> really wanted a vote I guess I would -1 the change. I'm not going to do
> that.
> I would at least like to see an announcement go out about it.  The last
> thing I saw you say was you were going to call a vote.  A few people chimed
> in with their thoughts on that vote, but nothing was said after that.
>
> Tom
>
>
>
> On Monday, March 13, 2017 12:36 PM, Sean Owen  wrote:
>
>
> It's not a new process, in that it doesn't entail anything not already in
> http://apache.org/foundation/voting.html . We're just deciding to call a
> VOTE for this type of code modification.
>
> To your point -- yes, it's been around a long time with no further
> comment, and I called several times for more input. That's pretty strong
> lazy consensus of the form we use every day.
>
> On Mon, Mar 13, 2017 at 5:30 PM Tom Graves  wrote:
>
> It seems like if you are adding responsibilities you should do a vote.
> SPIP'S require votes from PMC members so you are now putting more
> responsibility on them. It feels like we should have an official vote to
> make sure they (PMC members) agree with that and to make sure everyone pays
> attention to it.  That thread has been there for a while just as discussion
> and now all of a sudden its implemented without even an announcement being
> sent out about it.
>
> Tom
>
>
>
>
>
>


Re: Spark Improvement Proposals

2017-03-13 Thread Tom Graves
Another thing I think you should send out is when exactly does this take 
affect.  Is it any major new feature without a pull request?   Is it anything 
major starting with the 2.3 release?  
Tom 

On Monday, March 13, 2017 1:08 PM, Tom Graves 
 wrote:
 

 I'm not sure how you can say its not a new process.  If that is the case why 
do we need a page documenting it?  
As a developer if I want to put up a major improvement I have to now follow the 
SPIP whereas before I didn't, that certain seems like a new process.  As a PMC 
member I now have the ability to vote on these SPIPs, that seems like something 
new again. 
There are  apache bylaws and then there are project specific bylaws.  As far as 
I know Spark doesn't document any of its project specific bylaws so I guess 
this isn't officially a change to them, but it was implicit before that you 
didn't need any review for major improvements before, now you need an explicit 
vote for them to be approved.  Certainly seems to fall under the "Procedural" 
section in the voting link you sent.
I understand this was under discussion for a while and you have asked for 
peoples feedback multiple times.  But sometimes long threads are easy to 
ignore.  That is why personally I like to see things labelled [VOTE], 
[ANNOUNCE], [DISCUSS] when it gets close to finalizing on something like this. 
I don't really want to draw this out or argue anymore about it, if I really 
wanted a vote I guess I would -1 the change. I'm not going to do that. I would 
at least like to see an announcement go out about it.  The last thing I saw you 
say was you were going to call a vote.  A few people chimed in with their 
thoughts on that vote, but nothing was said after that. 
Tom

 

On Monday, March 13, 2017 12:36 PM, Sean Owen  wrote:
 

 It's not a new process, in that it doesn't entail anything not already in 
http://apache.org/foundation/voting.html . We're just deciding to call a VOTE 
for this type of code modification.
To your point -- yes, it's been around a long time with no further comment, and 
I called several times for more input. That's pretty strong lazy consensus of 
the form we use every day. 

On Mon, Mar 13, 2017 at 5:30 PM Tom Graves  wrote:

It seems like if you are adding responsibilities you should do a vote.  SPIP'S 
require votes from PMC members so you are now putting more responsibility on 
them. It feels like we should have an official vote to make sure they (PMC 
members) agree with that and to make sure everyone pays attention to it.  That 
thread has been there for a while just as discussion and now all of a sudden 
its implemented without even an announcement being sent out about it. 
Tom 



   

   

Re: Spark Improvement Proposals

2017-03-13 Thread Tom Graves
I'm not sure how you can say its not a new process.  If that is the case why do 
we need a page documenting it?  
As a developer if I want to put up a major improvement I have to now follow the 
SPIP whereas before I didn't, that certain seems like a new process.  As a PMC 
member I now have the ability to vote on these SPIPs, that seems like something 
new again. 
There are  apache bylaws and then there are project specific bylaws.  As far as 
I know Spark doesn't document any of its project specific bylaws so I guess 
this isn't officially a change to them, but it was implicit before that you 
didn't need any review for major improvements before, now you need an explicit 
vote for them to be approved.  Certainly seems to fall under the "Procedural" 
section in the voting link you sent.
I understand this was under discussion for a while and you have asked for 
peoples feedback multiple times.  But sometimes long threads are easy to 
ignore.  That is why personally I like to see things labelled [VOTE], 
[ANNOUNCE], [DISCUSS] when it gets close to finalizing on something like this. 
I don't really want to draw this out or argue anymore about it, if I really 
wanted a vote I guess I would -1 the change. I'm not going to do that. I would 
at least like to see an announcement go out about it.  The last thing I saw you 
say was you were going to call a vote.  A few people chimed in with their 
thoughts on that vote, but nothing was said after that. 
Tom

 

On Monday, March 13, 2017 12:36 PM, Sean Owen  wrote:
 

 It's not a new process, in that it doesn't entail anything not already in 
http://apache.org/foundation/voting.html . We're just deciding to call a VOTE 
for this type of code modification.
To your point -- yes, it's been around a long time with no further comment, and 
I called several times for more input. That's pretty strong lazy consensus of 
the form we use every day. 

On Mon, Mar 13, 2017 at 5:30 PM Tom Graves  wrote:

It seems like if you are adding responsibilities you should do a vote.  SPIP'S 
require votes from PMC members so you are now putting more responsibility on 
them. It feels like we should have an official vote to make sure they (PMC 
members) agree with that and to make sure everyone pays attention to it.  That 
thread has been there for a while just as discussion and now all of a sudden 
its implemented without even an announcement being sent out about it. 
Tom 



   

Re: Spark Improvement Proposals

2017-03-13 Thread Sean Owen
It's not a new process, in that it doesn't entail anything not already in
http://apache.org/foundation/voting.html . We're just deciding to call a
VOTE for this type of code modification.

To your point -- yes, it's been around a long time with no further comment,
and I called several times for more input. That's pretty strong lazy
consensus of the form we use every day.

On Mon, Mar 13, 2017 at 5:30 PM Tom Graves  wrote:

> It seems like if you are adding responsibilities you should do a vote.
> SPIP'S require votes from PMC members so you are now putting more
> responsibility on them. It feels like we should have an official vote to
> make sure they (PMC members) agree with that and to make sure everyone pays
> attention to it.  That thread has been there for a while just as discussion
> and now all of a sudden its implemented without even an announcement being
> sent out about it.
>
> Tom
>
>


Re: Spark Improvement Proposals

2017-03-13 Thread Tom Graves
It seems like if you are adding responsibilities you should do a vote.  SPIP'S 
require votes from PMC members so you are now putting more responsibility on 
them. It feels like we should have an official vote to make sure they (PMC 
members) agree with that and to make sure everyone pays attention to it.  That 
thread has been there for a while just as discussion and now all of a sudden 
its implemented without even an announcement being sent out about it. 
Tom 

On Monday, March 13, 2017 11:37 AM, Sean Owen  wrote:
 

 This ended up proceeding as a normal doc change, instead of precipitating a 
meta-vote.However, the text that's on the web site now can certainly be further 
amended if anyone wants to propose a change from here.
On Mon, Mar 13, 2017 at 1:50 PM Tom Graves  wrote:

I think a vote here would be good. I think most of the discussion was done by 4 
or 5 people and its a long thread.  If nothing else it summarizes everything 
and gets people attention to the change.
Tom 

On Thursday, March 9, 2017 10:55 AM, Sean Owen  wrote:
 

 I think a VOTE is over-thinking it, and is rarely used, but, can't hurt. Nah, 
anyone can call a vote. This really isn't that formal. We just want to declare 
and document consensus.
I think SPIP is just a remix of existing process anyway, and don't think it 
will actually do much anyway, which is why I am sanguine about the whole thing.
To bring this to a conclusion, I will just put the contents of the doc in an 
email tomorrow for a VOTE. Raise any objections now.
On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger  wrote:

I started this idea as a fork with a merge-able change to docs.
Reynold moved it to his google doc, and has suggested during this
email thread that a vote should occur.
If a vote needs to occur, I can't see anything on
http://apache.org/foundation/voting.html suggesting that I can call
for a vote, which is why I'm asking PMC members to do it since they're
the ones who would vote anyway.
Now Sean is saying this is a code/doc change that can just be reviewed
and merged as usual...which is what I tried to do to begin with.

The fact that you haven't agreed on a process to agree on your process
is, I think, an indication that the process really does need
improvement ;)




   


   

Re: Spark Improvement Proposals

2017-03-13 Thread Sean Owen
This ended up proceeding as a normal doc change, instead of precipitating a
meta-vote.
However, the text that's on the web site now can certainly be further
amended if anyone wants to propose a change from here.

On Mon, Mar 13, 2017 at 1:50 PM Tom Graves  wrote:

> I think a vote here would be good. I think most of the discussion was done
> by 4 or 5 people and its a long thread.  If nothing else it summarizes
> everything and gets people attention to the change.
>
> Tom
>
>
> On Thursday, March 9, 2017 10:55 AM, Sean Owen  wrote:
>
>
> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
> Nah, anyone can call a vote. This really isn't that formal. We just want to
> declare and document consensus.
>
> I think SPIP is just a remix of existing process anyway, and don't think
> it will actually do much anyway, which is why I am sanguine about the whole
> thing.
>
> To bring this to a conclusion, I will just put the contents of the doc in
> an email tomorrow for a VOTE. Raise any objections now.
>
> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger  wrote:
>
> I started this idea as a fork with a merge-able change to docs.
> Reynold moved it to his google doc, and has suggested during this
> email thread that a vote should occur.
> If a vote needs to occur, I can't see anything on
> http://apache.org/foundation/voting.html suggesting that I can call
> for a vote, which is why I'm asking PMC members to do it since they're
> the ones who would vote anyway.
> Now Sean is saying this is a code/doc change that can just be reviewed
> and merged as usual...which is what I tried to do to begin with.
>
> The fact that you haven't agreed on a process to agree on your process
> is, I think, an indication that the process really does need
> improvement ;)
>
>
>
>


Re: Spark Improvement Proposals

2017-03-13 Thread Tom Graves
I think a vote here would be good. I think most of the discussion was done by 4 
or 5 people and its a long thread.  If nothing else it summarizes everything 
and gets people attention to the change.
Tom 

On Thursday, March 9, 2017 10:55 AM, Sean Owen  wrote:
 

 I think a VOTE is over-thinking it, and is rarely used, but, can't hurt. Nah, 
anyone can call a vote. This really isn't that formal. We just want to declare 
and document consensus.
I think SPIP is just a remix of existing process anyway, and don't think it 
will actually do much anyway, which is why I am sanguine about the whole thing.
To bring this to a conclusion, I will just put the contents of the doc in an 
email tomorrow for a VOTE. Raise any objections now.
On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger  wrote:

I started this idea as a fork with a merge-able change to docs.
Reynold moved it to his google doc, and has suggested during this
email thread that a vote should occur.
If a vote needs to occur, I can't see anything on
http://apache.org/foundation/voting.html suggesting that I can call
for a vote, which is why I'm asking PMC members to do it since they're
the ones who would vote anyway.
Now Sean is saying this is a code/doc change that can just be reviewed
and merged as usual...which is what I tried to do to begin with.

The fact that you haven't agreed on a process to agree on your process
is, I think, an indication that the process really does need
improvement ;)




   

Re: Spark Improvement Proposals

2017-03-10 Thread Reynold Xin
We can just start using spip label and link to it.



On Fri, Mar 10, 2017 at 9:18 AM, Cody Koeninger  wrote:

> So to be clear, if I translate that google doc to markup and submit a
> PR, you will merge it?
>
> If we're just using "spip" label, that's probably fine, but we still
> need shared filters for open and closed SPIPs so the page can link to
> them.
>
> I do not believe I have jira permissions to share filters, I just
> attempted to edit one of mine and do not see an add shares field.
>
> On Fri, Mar 10, 2017 at 10:54 AM, Sean Owen  wrote:
> > Sure, that seems OK to me. I can merge anything like that.
> > I think anyone can make a new label in JIRA; I don't know if even the
> admins
> > can make a new issue type unfortunately. We may just have to mention a
> > convention involving title and label or something.
> >
> > On Fri, Mar 10, 2017 at 4:52 PM Cody Koeninger 
> wrote:
> >>
> >> I think it ought to be its own page, linked from the more / community
> >> menu dropdowns.
> >>
> >> We also need the jira tag, and for the page to clearly link to filters
> >> that show proposed / completed SPIPs
> >>
> >> On Fri, Mar 10, 2017 at 3:39 AM, Sean Owen  wrote:
> >> > Alrighty, if nobody is objecting, and nobody calls for a VOTE, then,
> >> > let's
> >> > say this document is the SPIP 1.0 process.
> >> >
> >> > I think the next step is just to translate the text to some suitable
> >> > location. I suggest adding it to
> >> > https://github.com/apache/spark-website/blob/asf-site/contributing.md
> >> >
> >> > On Thu, Mar 9, 2017 at 4:55 PM Sean Owen  wrote:
> >> >>
> >> >> I think a VOTE is over-thinking it, and is rarely used, but, can't
> >> >> hurt.
> >> >> Nah, anyone can call a vote. This really isn't that formal. We just
> >> >> want to
> >> >> declare and document consensus.
> >> >>
> >> >> I think SPIP is just a remix of existing process anyway, and don't
> >> >> think
> >> >> it will actually do much anyway, which is why I am sanguine about the
> >> >> whole
> >> >> thing.
> >> >>
> >> >> To bring this to a conclusion, I will just put the contents of the
> doc
> >> >> in
> >> >> an email tomorrow for a VOTE. Raise any objections now.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark Improvement Proposals

2017-03-10 Thread Cody Koeninger
Can someone with filter share permissions can make a filter for open
SPIP and one for closed SPIP and share it?

e.g.

project = SPARK AND status in (Open, Reopened, "In Progress") AND
labels=SPIP ORDER BY createdDate DESC

and another with the status closed equivalent

I just made an open ticket with the SPIP label show it should show up

On Fri, Mar 10, 2017 at 11:19 AM, Reynold Xin  wrote:
> We can just start using spip label and link to it.
>
>
>
> On Fri, Mar 10, 2017 at 9:18 AM, Cody Koeninger  wrote:
>>
>> So to be clear, if I translate that google doc to markup and submit a
>> PR, you will merge it?
>>
>> If we're just using "spip" label, that's probably fine, but we still
>> need shared filters for open and closed SPIPs so the page can link to
>> them.
>>
>> I do not believe I have jira permissions to share filters, I just
>> attempted to edit one of mine and do not see an add shares field.
>>
>> On Fri, Mar 10, 2017 at 10:54 AM, Sean Owen  wrote:
>> > Sure, that seems OK to me. I can merge anything like that.
>> > I think anyone can make a new label in JIRA; I don't know if even the
>> > admins
>> > can make a new issue type unfortunately. We may just have to mention a
>> > convention involving title and label or something.
>> >
>> > On Fri, Mar 10, 2017 at 4:52 PM Cody Koeninger 
>> > wrote:
>> >>
>> >> I think it ought to be its own page, linked from the more / community
>> >> menu dropdowns.
>> >>
>> >> We also need the jira tag, and for the page to clearly link to filters
>> >> that show proposed / completed SPIPs
>> >>
>> >> On Fri, Mar 10, 2017 at 3:39 AM, Sean Owen  wrote:
>> >> > Alrighty, if nobody is objecting, and nobody calls for a VOTE, then,
>> >> > let's
>> >> > say this document is the SPIP 1.0 process.
>> >> >
>> >> > I think the next step is just to translate the text to some suitable
>> >> > location. I suggest adding it to
>> >> > https://github.com/apache/spark-website/blob/asf-site/contributing.md
>> >> >
>> >> > On Thu, Mar 9, 2017 at 4:55 PM Sean Owen  wrote:
>> >> >>
>> >> >> I think a VOTE is over-thinking it, and is rarely used, but, can't
>> >> >> hurt.
>> >> >> Nah, anyone can call a vote. This really isn't that formal. We just
>> >> >> want to
>> >> >> declare and document consensus.
>> >> >>
>> >> >> I think SPIP is just a remix of existing process anyway, and don't
>> >> >> think
>> >> >> it will actually do much anyway, which is why I am sanguine about
>> >> >> the
>> >> >> whole
>> >> >> thing.
>> >> >>
>> >> >> To bring this to a conclusion, I will just put the contents of the
>> >> >> doc
>> >> >> in
>> >> >> an email tomorrow for a VOTE. Raise any objections now.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-03-10 Thread Cody Koeninger
So to be clear, if I translate that google doc to markup and submit a
PR, you will merge it?

If we're just using "spip" label, that's probably fine, but we still
need shared filters for open and closed SPIPs so the page can link to
them.

I do not believe I have jira permissions to share filters, I just
attempted to edit one of mine and do not see an add shares field.

On Fri, Mar 10, 2017 at 10:54 AM, Sean Owen  wrote:
> Sure, that seems OK to me. I can merge anything like that.
> I think anyone can make a new label in JIRA; I don't know if even the admins
> can make a new issue type unfortunately. We may just have to mention a
> convention involving title and label or something.
>
> On Fri, Mar 10, 2017 at 4:52 PM Cody Koeninger  wrote:
>>
>> I think it ought to be its own page, linked from the more / community
>> menu dropdowns.
>>
>> We also need the jira tag, and for the page to clearly link to filters
>> that show proposed / completed SPIPs
>>
>> On Fri, Mar 10, 2017 at 3:39 AM, Sean Owen  wrote:
>> > Alrighty, if nobody is objecting, and nobody calls for a VOTE, then,
>> > let's
>> > say this document is the SPIP 1.0 process.
>> >
>> > I think the next step is just to translate the text to some suitable
>> > location. I suggest adding it to
>> > https://github.com/apache/spark-website/blob/asf-site/contributing.md
>> >
>> > On Thu, Mar 9, 2017 at 4:55 PM Sean Owen  wrote:
>> >>
>> >> I think a VOTE is over-thinking it, and is rarely used, but, can't
>> >> hurt.
>> >> Nah, anyone can call a vote. This really isn't that formal. We just
>> >> want to
>> >> declare and document consensus.
>> >>
>> >> I think SPIP is just a remix of existing process anyway, and don't
>> >> think
>> >> it will actually do much anyway, which is why I am sanguine about the
>> >> whole
>> >> thing.
>> >>
>> >> To bring this to a conclusion, I will just put the contents of the doc
>> >> in
>> >> an email tomorrow for a VOTE. Raise any objections now.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-03-10 Thread Cody Koeninger
I think it ought to be its own page, linked from the more / community
menu dropdowns.

We also need the jira tag, and for the page to clearly link to filters
that show proposed / completed SPIPs

On Fri, Mar 10, 2017 at 3:39 AM, Sean Owen  wrote:
> Alrighty, if nobody is objecting, and nobody calls for a VOTE, then, let's
> say this document is the SPIP 1.0 process.
>
> I think the next step is just to translate the text to some suitable
> location. I suggest adding it to
> https://github.com/apache/spark-website/blob/asf-site/contributing.md
>
> On Thu, Mar 9, 2017 at 4:55 PM Sean Owen  wrote:
>>
>> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
>> Nah, anyone can call a vote. This really isn't that formal. We just want to
>> declare and document consensus.
>>
>> I think SPIP is just a remix of existing process anyway, and don't think
>> it will actually do much anyway, which is why I am sanguine about the whole
>> thing.
>>
>> To bring this to a conclusion, I will just put the contents of the doc in
>> an email tomorrow for a VOTE. Raise any objections now.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-03-10 Thread Sean Owen
Alrighty, if nobody is objecting, and nobody calls for a VOTE, then, let's
say this document is the SPIP 1.0 process.

I think the next step is just to translate the text to some suitable
location. I suggest adding it to
https://github.com/apache/spark-website/blob/asf-site/contributing.md

On Thu, Mar 9, 2017 at 4:55 PM Sean Owen  wrote:

> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
> Nah, anyone can call a vote. This really isn't that formal. We just want to
> declare and document consensus.
>
> I think SPIP is just a remix of existing process anyway, and don't think
> it will actually do much anyway, which is why I am sanguine about the whole
> thing.
>
> To bring this to a conclusion, I will just put the contents of the doc in
> an email tomorrow for a VOTE. Raise any objections now.
>


Re: Spark Improvement Proposals

2017-03-09 Thread Koert Kuipers
gonna end up with a stackoverflow on recursive votes here

On Thu, Mar 9, 2017 at 1:17 PM, Mark Hamstra 
wrote:

> -0 on voting on whether we need a vote.
>
> On Thu, Mar 9, 2017 at 9:00 AM, Reynold Xin  wrote:
>
>> I'm fine without a vote. (are we voting on wether we need a vote?)
>>
>>
>> On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen  wrote:
>>
>>> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
>>> Nah, anyone can call a vote. This really isn't that formal. We just want to
>>> declare and document consensus.
>>>
>>> I think SPIP is just a remix of existing process anyway, and don't think
>>> it will actually do much anyway, which is why I am sanguine about the whole
>>> thing.
>>>
>>> To bring this to a conclusion, I will just put the contents of the doc
>>> in an email tomorrow for a VOTE. Raise any objections now.
>>>
>>> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger 
>>> wrote:
>>>
 I started this idea as a fork with a merge-able change to docs.
 Reynold moved it to his google doc, and has suggested during this
 email thread that a vote should occur.
 If a vote needs to occur, I can't see anything on
 http://apache.org/foundation/voting.html suggesting that I can call
 for a vote, which is why I'm asking PMC members to do it since they're
 the ones who would vote anyway.
 Now Sean is saying this is a code/doc change that can just be reviewed
 and merged as usual...which is what I tried to do to begin with.

 The fact that you haven't agreed on a process to agree on your process
 is, I think, an indication that the process really does need
 improvement ;)


>>
>


Re: Spark Improvement Proposals

2017-03-09 Thread Mark Hamstra
-0 on voting on whether we need a vote.

On Thu, Mar 9, 2017 at 9:00 AM, Reynold Xin  wrote:

> I'm fine without a vote. (are we voting on wether we need a vote?)
>
>
> On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen  wrote:
>
>> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
>> Nah, anyone can call a vote. This really isn't that formal. We just want to
>> declare and document consensus.
>>
>> I think SPIP is just a remix of existing process anyway, and don't think
>> it will actually do much anyway, which is why I am sanguine about the whole
>> thing.
>>
>> To bring this to a conclusion, I will just put the contents of the doc in
>> an email tomorrow for a VOTE. Raise any objections now.
>>
>> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger  wrote:
>>
>>> I started this idea as a fork with a merge-able change to docs.
>>> Reynold moved it to his google doc, and has suggested during this
>>> email thread that a vote should occur.
>>> If a vote needs to occur, I can't see anything on
>>> http://apache.org/foundation/voting.html suggesting that I can call
>>> for a vote, which is why I'm asking PMC members to do it since they're
>>> the ones who would vote anyway.
>>> Now Sean is saying this is a code/doc change that can just be reviewed
>>> and merged as usual...which is what I tried to do to begin with.
>>>
>>> The fact that you haven't agreed on a process to agree on your process
>>> is, I think, an indication that the process really does need
>>> improvement ;)
>>>
>>>
>


Re: Spark Improvement Proposals

2017-03-09 Thread vaquar khan
Many of us have issue with "shepherd role " , i think we should go with
vote.

Regards,
Vaquar khan

On Thu, Mar 9, 2017 at 11:00 AM, Reynold Xin  wrote:

> I'm fine without a vote. (are we voting on wether we need a vote?)
>
>
> On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen  wrote:
>
>> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
>> Nah, anyone can call a vote. This really isn't that formal. We just want to
>> declare and document consensus.
>>
>> I think SPIP is just a remix of existing process anyway, and don't think
>> it will actually do much anyway, which is why I am sanguine about the whole
>> thing.
>>
>> To bring this to a conclusion, I will just put the contents of the doc in
>> an email tomorrow for a VOTE. Raise any objections now.
>>
>> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger  wrote:
>>
>>> I started this idea as a fork with a merge-able change to docs.
>>> Reynold moved it to his google doc, and has suggested during this
>>> email thread that a vote should occur.
>>> If a vote needs to occur, I can't see anything on
>>> http://apache.org/foundation/voting.html suggesting that I can call
>>> for a vote, which is why I'm asking PMC members to do it since they're
>>> the ones who would vote anyway.
>>> Now Sean is saying this is a code/doc change that can just be reviewed
>>> and merged as usual...which is what I tried to do to begin with.
>>>
>>> The fact that you haven't agreed on a process to agree on your process
>>> is, I think, an indication that the process really does need
>>> improvement ;)
>>>
>>>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: Spark Improvement Proposals

2017-03-09 Thread Reynold Xin
I'm fine without a vote. (are we voting on wether we need a vote?)


On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen  wrote:

> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
> Nah, anyone can call a vote. This really isn't that formal. We just want to
> declare and document consensus.
>
> I think SPIP is just a remix of existing process anyway, and don't think
> it will actually do much anyway, which is why I am sanguine about the whole
> thing.
>
> To bring this to a conclusion, I will just put the contents of the doc in
> an email tomorrow for a VOTE. Raise any objections now.
>
> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger  wrote:
>
>> I started this idea as a fork with a merge-able change to docs.
>> Reynold moved it to his google doc, and has suggested during this
>> email thread that a vote should occur.
>> If a vote needs to occur, I can't see anything on
>> http://apache.org/foundation/voting.html suggesting that I can call
>> for a vote, which is why I'm asking PMC members to do it since they're
>> the ones who would vote anyway.
>> Now Sean is saying this is a code/doc change that can just be reviewed
>> and merged as usual...which is what I tried to do to begin with.
>>
>> The fact that you haven't agreed on a process to agree on your process
>> is, I think, an indication that the process really does need
>> improvement ;)
>>
>>


Re: Spark Improvement Proposals

2017-03-09 Thread Sean Owen
I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
Nah, anyone can call a vote. This really isn't that formal. We just want to
declare and document consensus.

I think SPIP is just a remix of existing process anyway, and don't think it
will actually do much anyway, which is why I am sanguine about the whole
thing.

To bring this to a conclusion, I will just put the contents of the doc in
an email tomorrow for a VOTE. Raise any objections now.

On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger  wrote:

> I started this idea as a fork with a merge-able change to docs.
> Reynold moved it to his google doc, and has suggested during this
> email thread that a vote should occur.
> If a vote needs to occur, I can't see anything on
> http://apache.org/foundation/voting.html suggesting that I can call
> for a vote, which is why I'm asking PMC members to do it since they're
> the ones who would vote anyway.
> Now Sean is saying this is a code/doc change that can just be reviewed
> and merged as usual...which is what I tried to do to begin with.
>
> The fact that you haven't agreed on a process to agree on your process
> is, I think, an indication that the process really does need
> improvement ;)
>
>


Re: Spark Improvement Proposals

2017-03-09 Thread Cody Koeninger
I started this idea as a fork with a merge-able change to docs.
Reynold moved it to his google doc, and has suggested during this
email thread that a vote should occur.
If a vote needs to occur, I can't see anything on
http://apache.org/foundation/voting.html suggesting that I can call
for a vote, which is why I'm asking PMC members to do it since they're
the ones who would vote anyway.
Now Sean is saying this is a code/doc change that can just be reviewed
and merged as usual...which is what I tried to do to begin with.

The fact that you haven't agreed on a process to agree on your process
is, I think, an indication that the process really does need
improvement ;)

On Tue, Mar 7, 2017 at 11:05 AM, Sean Owen  wrote:
> Do we need a VOTE? heck I think anyone can call one, anyway.
>
> Pre-flight vote check: anyone have objections to the text as-is?
> See
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>
> If so let's hash out specific suggest changes.
>
> If not, then I think the next step is to probably update the
> github.com/apache/spark-website repo with the text here. That's a code/doc
> change we can just review and merge as usual.
>
> On Tue, Mar 7, 2017 at 3:15 PM Cody Koeninger  wrote:
>>
>> Another week, another ping.  Anyone on the PMC willing to call a vote on
>> this?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-03-07 Thread Sean Owen
Do we need a VOTE? heck I think anyone can call one, anyway.

Pre-flight vote check: anyone have objections to the text as-is?
See
https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#

If so let's hash out specific suggest changes.

If not, then I think the next step is to probably update the
github.com/apache/spark-website repo with the text here. That's a code/doc
change we can just review and merge as usual.

On Tue, Mar 7, 2017 at 3:15 PM Cody Koeninger  wrote:

> Another week, another ping.  Anyone on the PMC willing to call a vote on
> this?
>


Re: Spark Improvement Proposals

2017-03-07 Thread Cody Koeninger
iting down a process won't necessarily solve any problems one
>>> way or
>>> >>> > the other.  But one outwardly visible change I'm hoping for out of
>>> >>> > this a way for people who have a stake in Spark, but can't follow
>>> >>> > jiras closely, to go to the Spark website, see the list of proposed
>>> >>> > major changes, contribute discussion on issues that are relevant to
>>> >>> > their needs, and see a clear direction once a vote has passed.  We
>>> >>> > don't have that now.
>>> >>> >
>>> >>> > Ryan, realistically speaking any PMC member can and will stop any
>>> >>> > changes they don't like anyway, so might as well be up front about
>>> the
>>> >>> > reality of the situation.
>>> >>> >
>>> >>> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com>
>>> wrote:
>>> >>> >> The text seems fine to me. Really, this is not describing a
>>> >>> >> fundamentally
>>> >>> >> new process, which is good. We've always had JIRAs, we've always
>>> been
>>> >>> >> able
>>> >>> >> to call a VOTE for a big question. This just writes down a
>>> sensible
>>> >>> >> set of
>>> >>> >> guidelines for putting those two together when a major change is
>>> >>> >> proposed. I
>>> >>> >> look forward to turning some big JIRAs into a request for a SPIP.
>>> >>> >>
>>> >>> >> My only hesitation is that this seems to be perceived by some as
>>> a new
>>> >>> >> or
>>> >>> >> different thing, that is supposed to solve some problems that
>>> aren't
>>> >>> >> otherwise solvable. I see mentioned problems like: clear process
>>> for
>>> >>> >> managing work, public communication, more committers, some sort of
>>> >>> >> binding
>>> >>> >> outcome and deadline.
>>> >>> >>
>>> >>> >> If SPIP is supposed to be a way to make people design in public
>>> and a
>>> >>> >> way to
>>> >>> >> force attention to a particular change, then, this doesn't do
>>> that by
>>> >>> >> itself. Therefore I don't want to let a detailed discussion of
>>> SPIP
>>> >>> >> detract
>>> >>> >> from the discussion about doing what SPIP implies. It's just a
>>> process
>>> >>> >> document.
>>> >>> >>
>>> >>> >> Still, a fine step IMHO.
>>> >>> >>
>>> >>> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <r...@databricks.com>
>>> >>> >> wrote:
>>> >>> >>>
>>> >>> >>> Updated. Any feedback from other community members?
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <
>>> c...@koeninger.org>
>>> >>> >>> wrote:
>>> >>> >>>>
>>> >>> >>>> Thanks for doing that.
>>> >>> >>>>
>>> >>> >>>> Given that there are at least 4 different Apache voting
>>> processes,
>>> >>> >>>> "typical Apache vote process" isn't meaningful to me.
>>> >>> >>>>
>>> >>> >>>> I think the intention is that in order to pass, it needs at
>>> least 3
>>> >>> >>>> +1
>>> >>> >>>> votes from PMC members *and no -1 votes from PMC members*.  But
>>> the
>>> >>> >>>> document
>>> >>> >>>> doesn't explicitly say that second part.
>>> >>> >>>>
>>> >>> >>>> There's also no mention of the duration a vote should remain
>>> open.
>>> >>> >>>> There's a mention of a month for finding a shepherd, but that's
>>> >>> >>>> different.
>>> >>> >>>>
>>> >>> >>>> Other than that, LGT

Re: Spark Improvement Proposals

2017-02-27 Thread Sean Owen
To me, no new process is being invented here, on purpose, and so we should
just rely on whatever governs any large JIRA or vote, because SPIPs are
really just guidance for making a big JIRA.

http://apache.org/foundation/voting.html suggests that PMC members have the
binding votes in general, and for code-modification votes in particular,
which is what this is. Absent a strong reason to diverge from that, I'd go
with that.

(PS: On reading this, I didn't realize that the guidance was that releases
are blessed just by majority vote. Oh well, not that it has mattered.)

I also don't see a need to require a shepherd, because JIRAs don't have
such a process, though I also can't see a situation where nobody with a
vote cares to endorse the SPIP ever, but three people vote for it and
nobody objects?

Perhaps downgrade this to "strongly suggested, so that you don't waste your
time."

Or, implicitly, that proposing a SPIP calls a vote that lasts for, dunno, a
month. If fewer than 3 PMC vote for it, it doesn't pass anyway. If at least
1 does, OK, they're the shepherd(s). No new process.

On Mon, Feb 27, 2017 at 9:09 PM Ryan Blue  wrote:

> I'd like to see more discussion on the issues I raised. I don't think
> there was a response for why voting is limited to PMC members.
>
> Tim was kind enough to reply with his rationale for a shepherd, but I
> don't think that it justifies failing proposals. I think it boiled down to
> "shepherds can be helpful", which isn't a good reason to require them in my
> opinion. Sam also had some good comments on this and I think that there's
> more to talk about.
>
> That said, I'd rather not have this proposal fail because we're tired of
> talking about it. If most people are okay with it as it stands and want a
> vote, I'm fine testing this out and fixing it later.
>
> rb
>
>


Re: Spark Improvement Proposals

2017-02-27 Thread Ryan Blue
>
>> >>> > Ryan, realistically speaking any PMC member can and will stop any
>> >>> > changes they don't like anyway, so might as well be up front about
>> the
>> >>> > reality of the situation.
>> >>> >
>> >>> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com>
>> wrote:
>> >>> >> The text seems fine to me. Really, this is not describing a
>> >>> >> fundamentally
>> >>> >> new process, which is good. We've always had JIRAs, we've always
>> been
>> >>> >> able
>> >>> >> to call a VOTE for a big question. This just writes down a sensible
>> >>> >> set of
>> >>> >> guidelines for putting those two together when a major change is
>> >>> >> proposed. I
>> >>> >> look forward to turning some big JIRAs into a request for a SPIP.
>> >>> >>
>> >>> >> My only hesitation is that this seems to be perceived by some as a
>> new
>> >>> >> or
>> >>> >> different thing, that is supposed to solve some problems that
>> aren't
>> >>> >> otherwise solvable. I see mentioned problems like: clear process
>> for
>> >>> >> managing work, public communication, more committers, some sort of
>> >>> >> binding
>> >>> >> outcome and deadline.
>> >>> >>
>> >>> >> If SPIP is supposed to be a way to make people design in public
>> and a
>> >>> >> way to
>> >>> >> force attention to a particular change, then, this doesn't do that
>> by
>> >>> >> itself. Therefore I don't want to let a detailed discussion of SPIP
>> >>> >> detract
>> >>> >> from the discussion about doing what SPIP implies. It's just a
>> process
>> >>> >> document.
>> >>> >>
>> >>> >> Still, a fine step IMHO.
>> >>> >>
>> >>> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <r...@databricks.com>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> Updated. Any feedback from other community members?
>> >>> >>>
>> >>> >>>
>> >>> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <
>> c...@koeninger.org>
>> >>> >>> wrote:
>> >>> >>>>
>> >>> >>>> Thanks for doing that.
>> >>> >>>>
>> >>> >>>> Given that there are at least 4 different Apache voting
>> processes,
>> >>> >>>> "typical Apache vote process" isn't meaningful to me.
>> >>> >>>>
>> >>> >>>> I think the intention is that in order to pass, it needs at
>> least 3
>> >>> >>>> +1
>> >>> >>>> votes from PMC members *and no -1 votes from PMC members*.  But
>> the
>> >>> >>>> document
>> >>> >>>> doesn't explicitly say that second part.
>> >>> >>>>
>> >>> >>>> There's also no mention of the duration a vote should remain
>> open.
>> >>> >>>> There's a mention of a month for finding a shepherd, but that's
>> >>> >>>> different.
>> >>> >>>>
>> >>> >>>> Other than that, LGTM.
>> >>> >>>>
>> >>> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <
>> r...@databricks.com>
>> >>> >>>> wrote:
>> >>> >>>>>
>> >>> >>>>> Here's a new draft that incorporated most of the feedback:
>> >>> >>>>>
>> >>> >>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>> nRanvXmnZ7SUi4qMljg/edit#
>> >>> >>>>>
>> >>> >>>>> I added a specific role for SPIP Author and another one for SPIP
>> >>> >>>>> Shepherd.
>> >>> >>>>>
>> >>> >>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com>
>> >>> >>>>> wrote:
>> >>

Re: Spark Improvement Proposals

2017-02-24 Thread Joseph Bradley
gt; >>> >> or
> >>> >> different thing, that is supposed to solve some problems that aren't
> >>> >> otherwise solvable. I see mentioned problems like: clear process for
> >>> >> managing work, public communication, more committers, some sort of
> >>> >> binding
> >>> >> outcome and deadline.
> >>> >>
> >>> >> If SPIP is supposed to be a way to make people design in public and
> a
> >>> >> way to
> >>> >> force attention to a particular change, then, this doesn't do that
> by
> >>> >> itself. Therefore I don't want to let a detailed discussion of SPIP
> >>> >> detract
> >>> >> from the discussion about doing what SPIP implies. It's just a
> process
> >>> >> document.
> >>> >>
> >>> >> Still, a fine step IMHO.
> >>> >>
> >>> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <r...@databricks.com>
> >>> >> wrote:
> >>> >>>
> >>> >>> Updated. Any feedback from other community members?
> >>> >>>
> >>> >>>
> >>> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <
> c...@koeninger.org>
> >>> >>> wrote:
> >>> >>>>
> >>> >>>> Thanks for doing that.
> >>> >>>>
> >>> >>>> Given that there are at least 4 different Apache voting processes,
> >>> >>>> "typical Apache vote process" isn't meaningful to me.
> >>> >>>>
> >>> >>>> I think the intention is that in order to pass, it needs at least
> 3
> >>> >>>> +1
> >>> >>>> votes from PMC members *and no -1 votes from PMC members*.  But
> the
> >>> >>>> document
> >>> >>>> doesn't explicitly say that second part.
> >>> >>>>
> >>> >>>> There's also no mention of the duration a vote should remain open.
> >>> >>>> There's a mention of a month for finding a shepherd, but that's
> >>> >>>> different.
> >>> >>>>
> >>> >>>> Other than that, LGTM.
> >>> >>>>
> >>> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com
> >
> >>> >>>> wrote:
> >>> >>>>>
> >>> >>>>> Here's a new draft that incorporated most of the feedback:
> >>> >>>>>
> >>> >>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#
> >>> >>>>>
> >>> >>>>> I added a specific role for SPIP Author and another one for SPIP
> >>> >>>>> Shepherd.
> >>> >>>>>
> >>> >>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com>
> >>> >>>>> wrote:
> >>> >>>>>>
> >>> >>>>>> During the summit, I also had a lot of discussions over similar
> >>> >>>>>> topics
> >>> >>>>>> with multiple Committers and active users. I heard many
> fantastic
> >>> >>>>>> ideas. I
> >>> >>>>>> believe Spark improvement proposals are good channels to collect
> >>> >>>>>> the
> >>> >>>>>> requirements/designs.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> IMO, we also need to consider the priority when working on these
> >>> >>>>>> items.
> >>> >>>>>> Even if the proposal is accepted, it does not mean it will be
> >>> >>>>>> implemented
> >>> >>>>>> and merged immediately. It is not a FIFO queue.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Even if some PRs are merged, sometimes, we still have to revert
> >>> >>>>>> them
> >>> >>>>>> back, if the design and implementation are not reviewed
> carefully.
> >>> >>>>>> We have
> >>> >>>>>> to ensure our quality. Spark is not an application software. It
> is
> >>> >>>>>> an
> >>> >>>>>> infrastructure software that is being used by many many
> companies.
> >>> >>>>>> We have
> >>> >>>>>> to be very careful in the design and implementation, especially
> >>> >>>>>> adding/changing the external APIs.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> When I developed the Mainframe infrastructure/middleware
> software
> >>> >>>>>> in
> >>> >>>>>> the past 6 years, I were involved in the discussions with
> >>> >>>>>> external/internal
> >>> >>>>>> customers. The to-do feature list was always above 100.
> Sometimes,
> >>> >>>>>> the
> >>> >>>>>> customers are feeling frustrated when we are unable to deliver
> >>> >>>>>> them on time
> >>> >>>>>> due to the resource limits and others. Even if they paid us
> >>> >>>>>> billions, we
> >>> >>>>>> still need to do it phase by phase or sometimes they have to
> >>> >>>>>> accept the
> >>> >>>>>> workarounds. That is the reality everyone has to face, I think.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Thanks,
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Xiao Li
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>
> >>> >
> >>> > 
> -
> >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >
> >
> >
> >
> > --
> > Regards,
> > Vaquar Khan
> > +1 -224-436-0783
> >
> > IT Architect / Lead Consultant
> > Greater Chicago
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>


Re: Spark Improvement Proposals

2017-02-24 Thread Cody Koeninger
>>> >> document.
>>> >>
>>> >> Still, a fine step IMHO.
>>> >>
>>> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <r...@databricks.com>
>>> >> wrote:
>>> >>>
>>> >>> Updated. Any feedback from other community members?
>>> >>>
>>> >>>
>>> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <c...@koeninger.org>
>>> >>> wrote:
>>> >>>>
>>> >>>> Thanks for doing that.
>>> >>>>
>>> >>>> Given that there are at least 4 different Apache voting processes,
>>> >>>> "typical Apache vote process" isn't meaningful to me.
>>> >>>>
>>> >>>> I think the intention is that in order to pass, it needs at least 3
>>> >>>> +1
>>> >>>> votes from PMC members *and no -1 votes from PMC members*.  But the
>>> >>>> document
>>> >>>> doesn't explicitly say that second part.
>>> >>>>
>>> >>>> There's also no mention of the duration a vote should remain open.
>>> >>>> There's a mention of a month for finding a shepherd, but that's
>>> >>>> different.
>>> >>>>
>>> >>>> Other than that, LGTM.
>>> >>>>
>>> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Here's a new draft that incorporated most of the feedback:
>>> >>>>>
>>> >>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>>> >>>>>
>>> >>>>> I added a specific role for SPIP Author and another one for SPIP
>>> >>>>> Shepherd.
>>> >>>>>
>>> >>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com>
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>> During the summit, I also had a lot of discussions over similar
>>> >>>>>> topics
>>> >>>>>> with multiple Committers and active users. I heard many fantastic
>>> >>>>>> ideas. I
>>> >>>>>> believe Spark improvement proposals are good channels to collect
>>> >>>>>> the
>>> >>>>>> requirements/designs.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> IMO, we also need to consider the priority when working on these
>>> >>>>>> items.
>>> >>>>>> Even if the proposal is accepted, it does not mean it will be
>>> >>>>>> implemented
>>> >>>>>> and merged immediately. It is not a FIFO queue.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Even if some PRs are merged, sometimes, we still have to revert
>>> >>>>>> them
>>> >>>>>> back, if the design and implementation are not reviewed carefully.
>>> >>>>>> We have
>>> >>>>>> to ensure our quality. Spark is not an application software. It is
>>> >>>>>> an
>>> >>>>>> infrastructure software that is being used by many many companies.
>>> >>>>>> We have
>>> >>>>>> to be very careful in the design and implementation, especially
>>> >>>>>> adding/changing the external APIs.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> When I developed the Mainframe infrastructure/middleware software
>>> >>>>>> in
>>> >>>>>> the past 6 years, I were involved in the discussions with
>>> >>>>>> external/internal
>>> >>>>>> customers. The to-do feature list was always above 100. Sometimes,
>>> >>>>>> the
>>> >>>>>> customers are feeling frustrated when we are unable to deliver
>>> >>>>>> them on time
>>> >>>>>> due to the resource limits and others. Even if they paid us
>>> >>>>>> billions, we
>>> >>>>>> still need to do it phase by phase or sometimes they have to
>>> >>>>>> accept the
>>> >>>>>> workarounds. That is the reality everyone has to face, I think.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Xiao Li
>>> >>>>>>>
>>> >>>>>>>
>>> >>
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>
>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783
>
> IT Architect / Lead Consultant
> Greater Chicago

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-02-17 Thread vaquar khan
st 4 different Apache voting processes,
>> >>>> "typical Apache vote process" isn't meaningful to me.
>> >>>>
>> >>>> I think the intention is that in order to pass, it needs at least 3
>> +1
>> >>>> votes from PMC members *and no -1 votes from PMC members*.  But the
>> document
>> >>>> doesn't explicitly say that second part.
>> >>>>
>> >>>> There's also no mention of the duration a vote should remain open.
>> >>>> There's a mention of a month for finding a shepherd, but that's
>> different.
>> >>>>
>> >>>> Other than that, LGTM.
>> >>>>
>> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com>
>> wrote:
>> >>>>>
>> >>>>> Here's a new draft that incorporated most of the feedback:
>> >>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>> nRanvXmnZ7SUi4qMljg/edit#
>> >>>>>
>> >>>>> I added a specific role for SPIP Author and another one for SPIP
>> >>>>> Shepherd.
>> >>>>>
>> >>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com>
>> wrote:
>> >>>>>>
>> >>>>>> During the summit, I also had a lot of discussions over similar
>> topics
>> >>>>>> with multiple Committers and active users. I heard many fantastic
>> ideas. I
>> >>>>>> believe Spark improvement proposals are good channels to collect
>> the
>> >>>>>> requirements/designs.
>> >>>>>>
>> >>>>>>
>> >>>>>> IMO, we also need to consider the priority when working on these
>> items.
>> >>>>>> Even if the proposal is accepted, it does not mean it will be
>> implemented
>> >>>>>> and merged immediately. It is not a FIFO queue.
>> >>>>>>
>> >>>>>>
>> >>>>>> Even if some PRs are merged, sometimes, we still have to revert
>> them
>> >>>>>> back, if the design and implementation are not reviewed carefully.
>> We have
>> >>>>>> to ensure our quality. Spark is not an application software. It is
>> an
>> >>>>>> infrastructure software that is being used by many many companies.
>> We have
>> >>>>>> to be very careful in the design and implementation, especially
>> >>>>>> adding/changing the external APIs.
>> >>>>>>
>> >>>>>>
>> >>>>>> When I developed the Mainframe infrastructure/middleware software
>> in
>> >>>>>> the past 6 years, I were involved in the discussions with
>> external/internal
>> >>>>>> customers. The to-do feature list was always above 100. Sometimes,
>> the
>> >>>>>> customers are feeling frustrated when we are unable to deliver
>> them on time
>> >>>>>> due to the resource limits and others. Even if they paid us
>> billions, we
>> >>>>>> still need to do it phase by phase or sometimes they have to
>> accept the
>> >>>>>> workarounds. That is the reality everyone has to face, I think.
>> >>>>>>
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>>
>> >>>>>>
>> >>>>>> Xiao Li
>> >>>>>>>
>> >>>>>>>
>> >>
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: Spark Improvement Proposals

2017-02-16 Thread Ryan Blue
> [The shepherd] can advise on technical and procedural considerations for
people outside the community

The sentiment is good, but this doesn't justify requiring a shepherd for a
proposal. There are plenty of people that wouldn't need this, would get
feedback during discussion, or would ask a committer or PMC member if it
weren't a formal requirement.

> if no one is willing to be a shepherd, the proposed idea is probably not
going to receive much traction in the first place.

This also doesn't sound like a reason for needing a shepherd. Saying that a
shepherd probably won't hurt the process doesn't give me an idea of why a
shepherd should be required in the first place.

What was the motivation for adding a shepherd originally? It may not be bad
and it could be helpful, but neither of those makes me think that they
should be required or else the proposal fails.

rb

On Thu, Feb 16, 2017 at 12:23 PM, Tim Hunter <timhun...@databricks.com>
wrote:

> The doc looks good to me.
>
> Ryan, the role of the shepherd is to make sure that someone
> knowledgeable with Spark processes is involved: this person can advise
> on technical and procedural considerations for people outside the
> community. Also, if no one is willing to be a shepherd, the proposed
> idea is probably not going to receive much traction in the first
> place.
>
> Tim
>
> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
> > Reynold, thanks, LGTM.
> >
> > Sean, great concerns.  I agree that behavior is largely cultural and
> > writing down a process won't necessarily solve any problems one way or
> > the other.  But one outwardly visible change I'm hoping for out of
> > this a way for people who have a stake in Spark, but can't follow
> > jiras closely, to go to the Spark website, see the list of proposed
> > major changes, contribute discussion on issues that are relevant to
> > their needs, and see a clear direction once a vote has passed.  We
> > don't have that now.
> >
> > Ryan, realistically speaking any PMC member can and will stop any
> > changes they don't like anyway, so might as well be up front about the
> > reality of the situation.
> >
> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com> wrote:
> >> The text seems fine to me. Really, this is not describing a
> fundamentally
> >> new process, which is good. We've always had JIRAs, we've always been
> able
> >> to call a VOTE for a big question. This just writes down a sensible set
> of
> >> guidelines for putting those two together when a major change is
> proposed. I
> >> look forward to turning some big JIRAs into a request for a SPIP.
> >>
> >> My only hesitation is that this seems to be perceived by some as a new
> or
> >> different thing, that is supposed to solve some problems that aren't
> >> otherwise solvable. I see mentioned problems like: clear process for
> >> managing work, public communication, more committers, some sort of
> binding
> >> outcome and deadline.
> >>
> >> If SPIP is supposed to be a way to make people design in public and a
> way to
> >> force attention to a particular change, then, this doesn't do that by
> >> itself. Therefore I don't want to let a detailed discussion of SPIP
> detract
> >> from the discussion about doing what SPIP implies. It's just a process
> >> document.
> >>
> >> Still, a fine step IMHO.
> >>
> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <r...@databricks.com>
> wrote:
> >>>
> >>> Updated. Any feedback from other community members?
> >>>
> >>>
> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <c...@koeninger.org>
> >>> wrote:
> >>>>
> >>>> Thanks for doing that.
> >>>>
> >>>> Given that there are at least 4 different Apache voting processes,
> >>>> "typical Apache vote process" isn't meaningful to me.
> >>>>
> >>>> I think the intention is that in order to pass, it needs at least 3 +1
> >>>> votes from PMC members *and no -1 votes from PMC members*.  But the
> document
> >>>> doesn't explicitly say that second part.
> >>>>
> >>>> There's also no mention of the duration a vote should remain open.
> >>>> There's a mention of a month for finding a shepherd, but that's
> different.
> >>>>
> >>>> Other than that, LGTM.
> >>>>
> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r.

Re: Spark Improvement Proposals

2017-02-16 Thread Sam Elamin
> detract
> >> from the discussion about doing what SPIP implies. It's just a process
> >> document.
> >>
> >> Still, a fine step IMHO.
> >>
> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <r...@databricks.com>
> wrote:
> >>>
> >>> Updated. Any feedback from other community members?
> >>>
> >>>
> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <c...@koeninger.org>
> >>> wrote:
> >>>>
> >>>> Thanks for doing that.
> >>>>
> >>>> Given that there are at least 4 different Apache voting processes,
> >>>> "typical Apache vote process" isn't meaningful to me.
> >>>>
> >>>> I think the intention is that in order to pass, it needs at least 3 +1
> >>>> votes from PMC members *and no -1 votes from PMC members*.  But the
> document
> >>>> doesn't explicitly say that second part.
> >>>>
> >>>> There's also no mention of the duration a vote should remain open.
> >>>> There's a mention of a month for finding a shepherd, but that's
> different.
> >>>>
> >>>> Other than that, LGTM.
> >>>>
> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com>
> wrote:
> >>>>>
> >>>>> Here's a new draft that incorporated most of the feedback:
> >>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#
> >>>>>
> >>>>> I added a specific role for SPIP Author and another one for SPIP
> >>>>> Shepherd.
> >>>>>
> >>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> During the summit, I also had a lot of discussions over similar
> topics
> >>>>>> with multiple Committers and active users. I heard many fantastic
> ideas. I
> >>>>>> believe Spark improvement proposals are good channels to collect the
> >>>>>> requirements/designs.
> >>>>>>
> >>>>>>
> >>>>>> IMO, we also need to consider the priority when working on these
> items.
> >>>>>> Even if the proposal is accepted, it does not mean it will be
> implemented
> >>>>>> and merged immediately. It is not a FIFO queue.
> >>>>>>
> >>>>>>
> >>>>>> Even if some PRs are merged, sometimes, we still have to revert them
> >>>>>> back, if the design and implementation are not reviewed carefully.
> We have
> >>>>>> to ensure our quality. Spark is not an application software. It is
> an
> >>>>>> infrastructure software that is being used by many many companies.
> We have
> >>>>>> to be very careful in the design and implementation, especially
> >>>>>> adding/changing the external APIs.
> >>>>>>
> >>>>>>
> >>>>>> When I developed the Mainframe infrastructure/middleware software in
> >>>>>> the past 6 years, I were involved in the discussions with
> external/internal
> >>>>>> customers. The to-do feature list was always above 100. Sometimes,
> the
> >>>>>> customers are feeling frustrated when we are unable to deliver them
> on time
> >>>>>> due to the resource limits and others. Even if they paid us
> billions, we
> >>>>>> still need to do it phase by phase or sometimes they have to accept
> the
> >>>>>> workarounds. That is the reality everyone has to face, I think.
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>>
> >>>>>> Xiao Li
> >>>>>>>
> >>>>>>>
> >>
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark Improvement Proposals

2017-02-16 Thread Cody Koeninger
Reynold, thanks, LGTM.

Sean, great concerns.  I agree that behavior is largely cultural and
writing down a process won't necessarily solve any problems one way or
the other.  But one outwardly visible change I'm hoping for out of
this a way for people who have a stake in Spark, but can't follow
jiras closely, to go to the Spark website, see the list of proposed
major changes, contribute discussion on issues that are relevant to
their needs, and see a clear direction once a vote has passed.  We
don't have that now.

Ryan, realistically speaking any PMC member can and will stop any
changes they don't like anyway, so might as well be up front about the
reality of the situation.

On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com> wrote:
> The text seems fine to me. Really, this is not describing a fundamentally
> new process, which is good. We've always had JIRAs, we've always been able
> to call a VOTE for a big question. This just writes down a sensible set of
> guidelines for putting those two together when a major change is proposed. I
> look forward to turning some big JIRAs into a request for a SPIP.
>
> My only hesitation is that this seems to be perceived by some as a new or
> different thing, that is supposed to solve some problems that aren't
> otherwise solvable. I see mentioned problems like: clear process for
> managing work, public communication, more committers, some sort of binding
> outcome and deadline.
>
> If SPIP is supposed to be a way to make people design in public and a way to
> force attention to a particular change, then, this doesn't do that by
> itself. Therefore I don't want to let a detailed discussion of SPIP detract
> from the discussion about doing what SPIP implies. It's just a process
> document.
>
> Still, a fine step IMHO.
>
> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <r...@databricks.com> wrote:
>>
>> Updated. Any feedback from other community members?
>>
>>
>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>>
>>> Thanks for doing that.
>>>
>>> Given that there are at least 4 different Apache voting processes,
>>> "typical Apache vote process" isn't meaningful to me.
>>>
>>> I think the intention is that in order to pass, it needs at least 3 +1
>>> votes from PMC members *and no -1 votes from PMC members*.  But the document
>>> doesn't explicitly say that second part.
>>>
>>> There's also no mention of the duration a vote should remain open.
>>> There's a mention of a month for finding a shepherd, but that's different.
>>>
>>> Other than that, LGTM.
>>>
>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com> wrote:
>>>>
>>>> Here's a new draft that incorporated most of the feedback:
>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>>>>
>>>> I added a specific role for SPIP Author and another one for SPIP
>>>> Shepherd.
>>>>
>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com> wrote:
>>>>>
>>>>> During the summit, I also had a lot of discussions over similar topics
>>>>> with multiple Committers and active users. I heard many fantastic ideas. I
>>>>> believe Spark improvement proposals are good channels to collect the
>>>>> requirements/designs.
>>>>>
>>>>>
>>>>> IMO, we also need to consider the priority when working on these items.
>>>>> Even if the proposal is accepted, it does not mean it will be implemented
>>>>> and merged immediately. It is not a FIFO queue.
>>>>>
>>>>>
>>>>> Even if some PRs are merged, sometimes, we still have to revert them
>>>>> back, if the design and implementation are not reviewed carefully. We have
>>>>> to ensure our quality. Spark is not an application software. It is an
>>>>> infrastructure software that is being used by many many companies. We have
>>>>> to be very careful in the design and implementation, especially
>>>>> adding/changing the external APIs.
>>>>>
>>>>>
>>>>> When I developed the Mainframe infrastructure/middleware software in
>>>>> the past 6 years, I were involved in the discussions with 
>>>>> external/internal
>>>>> customers. The to-do feature list was always above 100. Sometimes, the
>>>>> customers are feeling frustrated when we are unable to deliver them on 
>>>>> time
>>>>> due to the resource limits and others. Even if they paid us billions, we
>>>>> still need to do it phase by phase or sometimes they have to accept the
>>>>> workarounds. That is the reality everyone has to face, I think.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> Xiao Li
>>>>>>
>>>>>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-02-16 Thread Sean Owen
The text seems fine to me. Really, this is not describing a fundamentally
new process, which is good. We've always had JIRAs, we've always been able
to call a VOTE for a big question. This just writes down a sensible set of
guidelines for putting those two together when a major change is proposed.
I look forward to turning some big JIRAs into a request for a SPIP.

My only hesitation is that this seems to be perceived by some as a new or
different thing, that is supposed to solve some problems that aren't
otherwise solvable. I see mentioned problems like: clear process for
managing work, public communication, more committers, some sort of binding
outcome and deadline.

If SPIP is supposed to be a way to make people design in public and a way
to force attention to a particular change, then, this doesn't do that by
itself. Therefore I don't want to let a detailed discussion of SPIP detract
from the discussion about doing what SPIP implies. It's just a process
document.

Still, a fine step IMHO.

On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <r...@databricks.com> wrote:

> Updated. Any feedback from other community members?
>
>
> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
> Thanks for doing that.
>
> Given that there are at least 4 different Apache voting processes,
> "typical Apache vote process" isn't meaningful to me.
>
> I think the intention is that in order to pass, it needs at least 3 +1
> votes from PMC members *and no -1 votes from PMC members*.  But the
> document doesn't explicitly say that second part.
>
> There's also no mention of the duration a vote should remain open.
> There's a mention of a month for finding a shepherd, but that's different.
>
> Other than that, LGTM.
>
> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com> wrote:
>
> Here's a new draft that incorporated most of the feedback:
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>
> I added a specific role for SPIP Author and another one for SPIP Shepherd.
>
> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com> wrote:
>
> During the summit, I also had a lot of discussions over similar topics
> with multiple Committers and active users. I heard many fantastic ideas. I
> believe Spark improvement proposals are good channels to collect the
> requirements/designs.
>
>
> IMO, we also need to consider the priority when working on these items.
> Even if the proposal is accepted, it does not mean it will be implemented
> and merged immediately. It is not a FIFO queue.
>
>
> Even if some PRs are merged, sometimes, we still have to revert them back,
> if the design and implementation are not reviewed carefully. We have to
> ensure our quality. Spark is not an application software. It is an
> infrastructure software that is being used by many many companies. We have
> to be very careful in the design and implementation, especially
> adding/changing the external APIs.
>
>
> When I developed the Mainframe infrastructure/middleware software in the
> past 6 years, I were involved in the discussions with external/internal
> customers. The to-do feature list was always above 100. Sometimes, the
> customers are feeling frustrated when we are unable to deliver them on time
> due to the resource limits and others. Even if they paid us billions, we
> still need to do it phase by phase or sometimes they have to accept the
> workarounds. That is the reality everyone has to face, I think.
>
>
> Thanks,
>
>
> Xiao Li
>
>
>


Re: Spark Improvement Proposals

2017-02-16 Thread Ryan Blue
The current proposal seems process-heavy to me. That's not necessarily bad,
but there are a couple areas I haven't seen discussed.

Why is there a shepherd? If the person proposing a change has a good idea,
I don't see why one is either a good idea or necessary. The result of this
requirement is that each SPIP must attract the attention of a PMC member,
and that PMC member has then taken on extra responsibility. Why can't the
SPIP author simply call a vote when an idea has been sufficiently
discussed? I think *this* proposal would have moved faster if Cody had felt
empowered to bring it to a vote. More steps out of the author's control
will cause fewer ideas to move forward, regardless of quality, so we should
make sure this is balanced by a real benefit.

Why are only PMC members allowed a binding vote? I don't have a strong
inclination one way or another, but until recently this was an open
question. I'd like to hear the argument for restricting voting to PMC
members, or I think we should change it to allow all commiters. If this
decision is left to default, let's be more inclusive.

I would be fine with the proposal overall if there are good reasons behind
these choices.

rb

On Thu, Feb 16, 2017 at 8:22 AM, Reynold Xin <r...@databricks.com> wrote:

> Updated. Any feedback from other community members?
>
>
> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> Thanks for doing that.
>>
>> Given that there are at least 4 different Apache voting processes,
>> "typical Apache vote process" isn't meaningful to me.
>>
>> I think the intention is that in order to pass, it needs at least 3 +1
>> votes from PMC members *and no -1 votes from PMC members*.  But the
>> document doesn't explicitly say that second part.
>>
>> There's also no mention of the duration a vote should remain open.
>> There's a mention of a month for finding a shepherd, but that's different.
>>
>> Other than that, LGTM.
>>
>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Here's a new draft that incorporated most of the feedback:
>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9h
>>> TK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>>>
>>> I added a specific role for SPIP Author and another one for SPIP
>>> Shepherd.
>>>
>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com> wrote:
>>>
>>>> During the summit, I also had a lot of discussions over similar topics
>>>> with multiple Committers and active users. I heard many fantastic ideas. I
>>>> believe Spark improvement proposals are good channels to collect the
>>>> requirements/designs.
>>>>
>>>>
>>>> IMO, we also need to consider the priority when working on these items.
>>>> Even if the proposal is accepted, it does not mean it will be implemented
>>>> and merged immediately. It is not a FIFO queue.
>>>>
>>>>
>>>> Even if some PRs are merged, sometimes, we still have to revert them
>>>> back, if the design and implementation are not reviewed carefully. We have
>>>> to ensure our quality. Spark is not an application software. It is an
>>>> infrastructure software that is being used by many many companies. We have
>>>> to be very careful in the design and implementation, especially
>>>> adding/changing the external APIs.
>>>>
>>>>
>>>> When I developed the Mainframe infrastructure/middleware software in
>>>> the past 6 years, I were involved in the discussions with external/internal
>>>> customers. The to-do feature list was always above 100. Sometimes, the
>>>> customers are feeling frustrated when we are unable to deliver them on time
>>>> due to the resource limits and others. Even if they paid us billions, we
>>>> still need to do it phase by phase or sometimes they have to accept the
>>>> workarounds. That is the reality everyone has to face, I think.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Xiao Li
>>>>
>>>> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <c...@koeninger.org>:
>>>>
>>>>> At the spark summit this week, everyone from PMC members to users I
>>>>> had never met before were asking me about the Spark improvement proposals
>>>>> idea.  It's clear that it's a real community need.
>>>>>
>>>>> But it's been almost half a year, and nothing visible has been done.
>>>>>
>&

Re: Spark Improvement Proposals

2017-02-16 Thread Reynold Xin
Updated. Any feedback from other community members?


On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Thanks for doing that.
>
> Given that there are at least 4 different Apache voting processes,
> "typical Apache vote process" isn't meaningful to me.
>
> I think the intention is that in order to pass, it needs at least 3 +1
> votes from PMC members *and no -1 votes from PMC members*.  But the
> document doesn't explicitly say that second part.
>
> There's also no mention of the duration a vote should remain open.
> There's a mention of a month for finding a shepherd, but that's different.
>
> Other than that, LGTM.
>
> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com> wrote:
>
>> Here's a new draft that incorporated most of the feedback:
>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9h
>> TK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>>
>> I added a specific role for SPIP Author and another one for SPIP Shepherd.
>>
>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com> wrote:
>>
>>> During the summit, I also had a lot of discussions over similar topics
>>> with multiple Committers and active users. I heard many fantastic ideas. I
>>> believe Spark improvement proposals are good channels to collect the
>>> requirements/designs.
>>>
>>>
>>> IMO, we also need to consider the priority when working on these items.
>>> Even if the proposal is accepted, it does not mean it will be implemented
>>> and merged immediately. It is not a FIFO queue.
>>>
>>>
>>> Even if some PRs are merged, sometimes, we still have to revert them
>>> back, if the design and implementation are not reviewed carefully. We have
>>> to ensure our quality. Spark is not an application software. It is an
>>> infrastructure software that is being used by many many companies. We have
>>> to be very careful in the design and implementation, especially
>>> adding/changing the external APIs.
>>>
>>>
>>> When I developed the Mainframe infrastructure/middleware software in the
>>> past 6 years, I were involved in the discussions with external/internal
>>> customers. The to-do feature list was always above 100. Sometimes, the
>>> customers are feeling frustrated when we are unable to deliver them on time
>>> due to the resource limits and others. Even if they paid us billions, we
>>> still need to do it phase by phase or sometimes they have to accept the
>>> workarounds. That is the reality everyone has to face, I think.
>>>
>>>
>>> Thanks,
>>>
>>>
>>> Xiao Li
>>>
>>> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <c...@koeninger.org>:
>>>
>>>> At the spark summit this week, everyone from PMC members to users I had
>>>> never met before were asking me about the Spark improvement proposals
>>>> idea.  It's clear that it's a real community need.
>>>>
>>>> But it's been almost half a year, and nothing visible has been done.
>>>>
>>>> Reynold, are you going to do this?
>>>>
>>>> If so, when?
>>>>
>>>> If not, why?
>>>>
>>>> You already did the right thing by including long-deserved committers.
>>>> Please keep doing the right thing for the community.
>>>>
>>>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> +1 on all counts (consensus, time bound, define roles)
>>>>>
>>>>> I can update the doc in the next few days and share back. Then maybe
>>>>> we can just officially vote on this. As Tim suggested, we might not get it
>>>>> 100% right the first time and would need to re-iterate. But that's fine.
>>>>>
>>>>>
>>>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <timhun...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Cody,
>>>>>> thank you for bringing up this topic, I agree it is very important to
>>>>>> keep a cohesive community around some common, fluid goals. Here are a few
>>>>>> comments about the current document:
>>>>>>
>>>>>> 1. name: it should not overlap with an existing one such as SIP. Can
>>>>>> you imagine someone trying to discuss a scala spore proposal for spark?
>>>>>> "[Spark] SIP-3 is intende

Re: Spark Improvement Proposals

2017-02-14 Thread Cody Koeninger
Thanks for doing that.

Given that there are at least 4 different Apache voting processes, "typical
Apache vote process" isn't meaningful to me.

I think the intention is that in order to pass, it needs at least 3 +1
votes from PMC members *and no -1 votes from PMC members*.  But the
document doesn't explicitly say that second part.

There's also no mention of the duration a vote should remain open.  There's
a mention of a month for finding a shepherd, but that's different.

Other than that, LGTM.

On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com> wrote:

> Here's a new draft that incorporated most of the feedback:
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#
>
> I added a specific role for SPIP Author and another one for SPIP Shepherd.
>
> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com> wrote:
>
>> During the summit, I also had a lot of discussions over similar topics
>> with multiple Committers and active users. I heard many fantastic ideas. I
>> believe Spark improvement proposals are good channels to collect the
>> requirements/designs.
>>
>>
>> IMO, we also need to consider the priority when working on these items.
>> Even if the proposal is accepted, it does not mean it will be implemented
>> and merged immediately. It is not a FIFO queue.
>>
>>
>> Even if some PRs are merged, sometimes, we still have to revert them
>> back, if the design and implementation are not reviewed carefully. We have
>> to ensure our quality. Spark is not an application software. It is an
>> infrastructure software that is being used by many many companies. We have
>> to be very careful in the design and implementation, especially
>> adding/changing the external APIs.
>>
>>
>> When I developed the Mainframe infrastructure/middleware software in the
>> past 6 years, I were involved in the discussions with external/internal
>> customers. The to-do feature list was always above 100. Sometimes, the
>> customers are feeling frustrated when we are unable to deliver them on time
>> due to the resource limits and others. Even if they paid us billions, we
>> still need to do it phase by phase or sometimes they have to accept the
>> workarounds. That is the reality everyone has to face, I think.
>>
>>
>> Thanks,
>>
>>
>> Xiao Li
>>
>> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <c...@koeninger.org>:
>>
>>> At the spark summit this week, everyone from PMC members to users I had
>>> never met before were asking me about the Spark improvement proposals
>>> idea.  It's clear that it's a real community need.
>>>
>>> But it's been almost half a year, and nothing visible has been done.
>>>
>>> Reynold, are you going to do this?
>>>
>>> If so, when?
>>>
>>> If not, why?
>>>
>>> You already did the right thing by including long-deserved committers.
>>> Please keep doing the right thing for the community.
>>>
>>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <r...@databricks.com>
>>> wrote:
>>>
>>>> +1 on all counts (consensus, time bound, define roles)
>>>>
>>>> I can update the doc in the next few days and share back. Then maybe we
>>>> can just officially vote on this. As Tim suggested, we might not get it
>>>> 100% right the first time and would need to re-iterate. But that's fine.
>>>>
>>>>
>>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <timhun...@databricks.com>
>>>> wrote:
>>>>
>>>>> Hi Cody,
>>>>> thank you for bringing up this topic, I agree it is very important to
>>>>> keep a cohesive community around some common, fluid goals. Here are a few
>>>>> comments about the current document:
>>>>>
>>>>> 1. name: it should not overlap with an existing one such as SIP. Can
>>>>> you imagine someone trying to discuss a scala spore proposal for spark?
>>>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>>>>> sounds great.
>>>>>
>>>>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>>>>> technical decisions with a lasting impact. As such, the template should
>>>>> emphasize the role of the various parties during this process:
>>>>>
>>>>>  - the SPIP author is responsible for building consensus. She is the
>>>>> champion driv

Re: Spark Improvement Proposals

2017-02-13 Thread Reynold Xin
Here's a new draft that incorporated most of the feedback:
https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#

I added a specific role for SPIP Author and another one for SPIP Shepherd.

On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com> wrote:

> During the summit, I also had a lot of discussions over similar topics
> with multiple Committers and active users. I heard many fantastic ideas. I
> believe Spark improvement proposals are good channels to collect the
> requirements/designs.
>
>
> IMO, we also need to consider the priority when working on these items.
> Even if the proposal is accepted, it does not mean it will be implemented
> and merged immediately. It is not a FIFO queue.
>
>
> Even if some PRs are merged, sometimes, we still have to revert them back,
> if the design and implementation are not reviewed carefully. We have to
> ensure our quality. Spark is not an application software. It is an
> infrastructure software that is being used by many many companies. We have
> to be very careful in the design and implementation, especially
> adding/changing the external APIs.
>
>
> When I developed the Mainframe infrastructure/middleware software in the
> past 6 years, I were involved in the discussions with external/internal
> customers. The to-do feature list was always above 100. Sometimes, the
> customers are feeling frustrated when we are unable to deliver them on time
> due to the resource limits and others. Even if they paid us billions, we
> still need to do it phase by phase or sometimes they have to accept the
> workarounds. That is the reality everyone has to face, I think.
>
>
> Thanks,
>
>
> Xiao Li
>
> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <c...@koeninger.org>:
>
>> At the spark summit this week, everyone from PMC members to users I had
>> never met before were asking me about the Spark improvement proposals
>> idea.  It's clear that it's a real community need.
>>
>> But it's been almost half a year, and nothing visible has been done.
>>
>> Reynold, are you going to do this?
>>
>> If so, when?
>>
>> If not, why?
>>
>> You already did the right thing by including long-deserved committers.
>> Please keep doing the right thing for the community.
>>
>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> +1 on all counts (consensus, time bound, define roles)
>>>
>>> I can update the doc in the next few days and share back. Then maybe we
>>> can just officially vote on this. As Tim suggested, we might not get it
>>> 100% right the first time and would need to re-iterate. But that's fine.
>>>
>>>
>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <timhun...@databricks.com>
>>> wrote:
>>>
>>>> Hi Cody,
>>>> thank you for bringing up this topic, I agree it is very important to
>>>> keep a cohesive community around some common, fluid goals. Here are a few
>>>> comments about the current document:
>>>>
>>>> 1. name: it should not overlap with an existing one such as SIP. Can
>>>> you imagine someone trying to discuss a scala spore proposal for spark?
>>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>>>> sounds great.
>>>>
>>>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>>>> technical decisions with a lasting impact. As such, the template should
>>>> emphasize the role of the various parties during this process:
>>>>
>>>>  - the SPIP author is responsible for building consensus. She is the
>>>> champion driving the process forward and is responsible for ensuring that
>>>> the SPIP follows the general guidelines. The author should be identified in
>>>> the SPIP. The authorship of a SPIP can be transferred if the current author
>>>> is not interested and someone else wants to move the SPIP forward. There
>>>> should probably be 2-3 authors at most for each SPIP.
>>>>
>>>>  - someone with voting power should probably shepherd the SPIP (and be
>>>> recorded as such): ensuring that the final decision over the SPIP is
>>>> recorded (rejected, accepted, etc.), and advising about the technical
>>>> quality of the SPIP: this person need not be a champion for the SPIP or
>>>> contribute to it, but rather makes sure it stands a chance of being
>>>> approved when the vote happens. Also, if the author cannot find anyone who

Re: Spark Improvement Proposals

2017-02-11 Thread Xiao Li
During the summit, I also had a lot of discussions over similar topics with
multiple Committers and active users. I heard many fantastic ideas. I
believe Spark improvement proposals are good channels to collect the
requirements/designs.


IMO, we also need to consider the priority when working on these items.
Even if the proposal is accepted, it does not mean it will be implemented
and merged immediately. It is not a FIFO queue.


Even if some PRs are merged, sometimes, we still have to revert them back,
if the design and implementation are not reviewed carefully. We have to
ensure our quality. Spark is not an application software. It is an
infrastructure software that is being used by many many companies. We have
to be very careful in the design and implementation, especially
adding/changing the external APIs.


When I developed the Mainframe infrastructure/middleware software in the
past 6 years, I were involved in the discussions with external/internal
customers. The to-do feature list was always above 100. Sometimes, the
customers are feeling frustrated when we are unable to deliver them on time
due to the resource limits and others. Even if they paid us billions, we
still need to do it phase by phase or sometimes they have to accept the
workarounds. That is the reality everyone has to face, I think.


Thanks,


Xiao Li

2017-02-11 7:57 GMT-08:00 Cody Koeninger <c...@koeninger.org>:

> At the spark summit this week, everyone from PMC members to users I had
> never met before were asking me about the Spark improvement proposals
> idea.  It's clear that it's a real community need.
>
> But it's been almost half a year, and nothing visible has been done.
>
> Reynold, are you going to do this?
>
> If so, when?
>
> If not, why?
>
> You already did the right thing by including long-deserved committers.
> Please keep doing the right thing for the community.
>
> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <r...@databricks.com> wrote:
>
>> +1 on all counts (consensus, time bound, define roles)
>>
>> I can update the doc in the next few days and share back. Then maybe we
>> can just officially vote on this. As Tim suggested, we might not get it
>> 100% right the first time and would need to re-iterate. But that's fine.
>>
>>
>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <timhun...@databricks.com>
>> wrote:
>>
>>> Hi Cody,
>>> thank you for bringing up this topic, I agree it is very important to
>>> keep a cohesive community around some common, fluid goals. Here are a few
>>> comments about the current document:
>>>
>>> 1. name: it should not overlap with an existing one such as SIP. Can you
>>> imagine someone trying to discuss a scala spore proposal for spark?
>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>>> sounds great.
>>>
>>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>>> technical decisions with a lasting impact. As such, the template should
>>> emphasize the role of the various parties during this process:
>>>
>>>  - the SPIP author is responsible for building consensus. She is the
>>> champion driving the process forward and is responsible for ensuring that
>>> the SPIP follows the general guidelines. The author should be identified in
>>> the SPIP. The authorship of a SPIP can be transferred if the current author
>>> is not interested and someone else wants to move the SPIP forward. There
>>> should probably be 2-3 authors at most for each SPIP.
>>>
>>>  - someone with voting power should probably shepherd the SPIP (and be
>>> recorded as such): ensuring that the final decision over the SPIP is
>>> recorded (rejected, accepted, etc.), and advising about the technical
>>> quality of the SPIP: this person need not be a champion for the SPIP or
>>> contribute to it, but rather makes sure it stands a chance of being
>>> approved when the vote happens. Also, if the author cannot find anyone who
>>> would want to take this role, this proposal is likely to be rejected anyway.
>>>
>>>  - users, committers, contributors have the roles already outlined in
>>> the document
>>>
>>> 3. timeline: ideally, once a SPIP has been offered for voting, it should
>>> move swiftly into either being accepted or rejected, so that we do not end
>>> up with a distracting long tail of half-hearted proposals.
>>>
>>> These rules are meant to be flexible, but the current document should be
>>> clear about who is in charge of a SPIP, and the state it is currently in.
>>>
>>> We have had

Re: Spark Improvement Proposals

2017-02-11 Thread Cody Koeninger
At the spark summit this week, everyone from PMC members to users I had
never met before were asking me about the Spark improvement proposals
idea.  It's clear that it's a real community need.

But it's been almost half a year, and nothing visible has been done.

Reynold, are you going to do this?

If so, when?

If not, why?

You already did the right thing by including long-deserved committers.
Please keep doing the right thing for the community.

On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <r...@databricks.com> wrote:

> +1 on all counts (consensus, time bound, define roles)
>
> I can update the doc in the next few days and share back. Then maybe we
> can just officially vote on this. As Tim suggested, we might not get it
> 100% right the first time and would need to re-iterate. But that's fine.
>
>
> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <timhun...@databricks.com>
> wrote:
>
>> Hi Cody,
>> thank you for bringing up this topic, I agree it is very important to
>> keep a cohesive community around some common, fluid goals. Here are a few
>> comments about the current document:
>>
>> 1. name: it should not overlap with an existing one such as SIP. Can you
>> imagine someone trying to discuss a scala spore proposal for spark?
>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>> sounds great.
>>
>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>> technical decisions with a lasting impact. As such, the template should
>> emphasize the role of the various parties during this process:
>>
>>  - the SPIP author is responsible for building consensus. She is the
>> champion driving the process forward and is responsible for ensuring that
>> the SPIP follows the general guidelines. The author should be identified in
>> the SPIP. The authorship of a SPIP can be transferred if the current author
>> is not interested and someone else wants to move the SPIP forward. There
>> should probably be 2-3 authors at most for each SPIP.
>>
>>  - someone with voting power should probably shepherd the SPIP (and be
>> recorded as such): ensuring that the final decision over the SPIP is
>> recorded (rejected, accepted, etc.), and advising about the technical
>> quality of the SPIP: this person need not be a champion for the SPIP or
>> contribute to it, but rather makes sure it stands a chance of being
>> approved when the vote happens. Also, if the author cannot find anyone who
>> would want to take this role, this proposal is likely to be rejected anyway.
>>
>>  - users, committers, contributors have the roles already outlined in the
>> document
>>
>> 3. timeline: ideally, once a SPIP has been offered for voting, it should
>> move swiftly into either being accepted or rejected, so that we do not end
>> up with a distracting long tail of half-hearted proposals.
>>
>> These rules are meant to be flexible, but the current document should be
>> clear about who is in charge of a SPIP, and the state it is currently in.
>>
>> We have had long discussions over some very important questions such as
>> approval. I do not have an opinion on these, but why not make a pick and
>> reevaluate this decision later? This is not a binding process at this point.
>>
>> Tim
>>
>>
>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> I don't have a concern about voting vs consensus.
>>>
>>> I have a concern that whatever the decision making process is, it is
>>> explicitly announced on the ticket for the given proposal, with an explicit
>>> deadline, and an explicit outcome.
>>>
>>>
>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <iras...@cloudera.com>
>>> wrote:
>>>
>>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>>
>>>> My take on the specific issues Joseph mentioned:
>>>>
>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>>>> earlier for consensus:
>>>>
>>>> > Majority vs consensus: My rationale is that I don't think we want to
>>>> consider a proposal approved if it had objections serious enough that
>>>> committers down-voted (or PMC depending on who gets a vote). If these
>>>> proposals are like PEPs, then they represent a significant amount of
>>>> community effort and I wouldn't want to move forward if up to half of the
>>>> community thinks it's an untenable idea.
>>>>
>>>> 2) Design doc template 

Re: Spark Improvement Proposals

2017-01-11 Thread Reynold Xin
K0P9qb2x-
>>>>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>>>>> >> >>> > (I've made all my modifications trackable)
>>>>> >> >>> >
>>>>> >> >>> > There are couple high level changes I made:
>>>>> >> >>> >
>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>>>> consensus
>>>>> >> >>> > as
>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>> easily be a
>>>>> >> >>> > "loser'
>>>>> >> >>> > that gets outvoted.
>>>>> >> >>> >
>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>> "optional
>>>>> >> >>> > design
>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside
>>>>> from
>>>>> >> >>> > tagging
>>>>> >> >>> > things and linking them elsewhere simply having design docs
>>>>> and
>>>>> >> >>> > prototypes
>>>>> >> >>> > implementations in PRs is not something that has not worked
>>>>> so far".
>>>>> >> >>> >
>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>> visibility. For
>>>>> >> >>> > example,
>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather than
>>>>> just
>>>>> >> >>> > "involve". SIPs should also have at least two emails that go
>>>>> to
>>>>> >> >>> > dev@.
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>> suggested
>>>>> >> >>> > template
>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>> r...@databricks.com>
>>>>> >> >>> > wrote:
>>>>> >> >>> >>
>>>>> >> >>> >> Most things looked OK to me too, although I do plan to take a
>>>>> >> >>> >> closer
>>>>> >> >>> >> look
>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>> >> >>> >>
>>>>> >> >>> >>
>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>> >> >>> >> <van...@cloudera.com>
>>>>> >> >>> >> wrote:
>>>>> >> >>> >>>
>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not
>>>>> >> >>> >>> explicitly
>>>>> >> >>> >>> called, that voting would happen by e-mail? A template for
>>>>> the
>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would
>>>>> also be
>>>>> >> >>> >>> nice,
>>>>> >> >>> >>> but that can be done at any time.
>>>>> >> >>> >>>
>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
>>>>> >> >>> >>> candidate
>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>> attached even
>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants to
>>>>> try
>>>>> >> >>> >>> out
>>>>> >> >>> >>> the process...
>>>>> >> >>> >>>
&

Re: Spark Improvement Proposals

2017-01-05 Thread Tim Hunter
; >
>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>> visibility. For
>>>> >> >>> > example,
>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather than
>>>> just
>>>> >> >>> > "involve". SIPs should also have at least two emails that go to
>>>> >> >>> > dev@.
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > While I was editing this, I thought we really needed a
>>>> suggested
>>>> >> >>> > template
>>>> >> >>> > for design doc too. I will get to that too ...
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>> r...@databricks.com>
>>>> >> >>> > wrote:
>>>> >> >>> >>
>>>> >> >>> >> Most things looked OK to me too, although I do plan to take a
>>>> >> >>> >> closer
>>>> >> >>> >> look
>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>> >> >>> >> <van...@cloudera.com>
>>>> >> >>> >> wrote:
>>>> >> >>> >>>
>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not
>>>> >> >>> >>> explicitly
>>>> >> >>> >>> called, that voting would happen by e-mail? A template for
>>>> the
>>>> >> >>> >>> proposal document (instead of just a bullet nice) would also
>>>> be
>>>> >> >>> >>> nice,
>>>> >> >>> >>> but that can be done at any time.
>>>> >> >>> >>>
>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
>>>> >> >>> >>> candidate
>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>> attached even
>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants to
>>>> try
>>>> >> >>> >>> out
>>>> >> >>> >>> the process...
>>>> >> >>> >>>
>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>> >> >>> >>> <c...@koeninger.org>
>>>> >> >>> >>> wrote:
>>>> >> >>> >>> > Now that spark summit europe is over, are any committers
>>>> >> >>> >>> > interested
>>>> >> >>> >>> > in
>>>> >> >>> >>> > moving forward with this?
>>>> >> >>> >>> >
>>>> >> >>> >>> >
>>>> >> >>> >>> >
>>>> >> >>> >>> >
>>>> >> >>> >>> > https://github.com/koeninger/s
>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>> >> >>> >>> >
>>>> >> >>> >>> > Or are we going to let this discussion die on the vine?
>>>> >> >>> >>> >
>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>> >> >>> >>> > <tomasz.gaw...@outlook.com> wrote:
>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>> >> >>> >>> >>
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
>>>> >> >>> >>> >> framework.
>>>> >> >>> >>> >> The
&g

Re: Spark Improvement Proposals

2017-01-03 Thread Cody Koeninger
gt; >> >>> > I've taken Cody's doc and edited it:
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>>> >> >>> > (I've made all my modifications trackable)
>>> >> >>> >
>>> >> >>> > There are couple high level changes I made:
>>> >> >>> >
>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>> consensus
>>> >> >>> > as
>>> >> >>> > opposed to voting. The reason being in voting there can easily
>>> be a
>>> >> >>> > "loser'
>>> >> >>> > that gets outvoted.
>>> >> >>> >
>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to "optional
>>> >> >>> > design
>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside
>>> from
>>> >> >>> > tagging
>>> >> >>> > things and linking them elsewhere simply having design docs and
>>> >> >>> > prototypes
>>> >> >>> > implementations in PRs is not something that has not worked so
>>> far".
>>> >> >>> >
>>> >> >>> > 3. I made some the language tweaks to focus more on visibility.
>>> For
>>> >> >>> > example,
>>> >> >>> > "The purpose of an SIP is to inform and involve", rather than
>>> just
>>> >> >>> > "involve". SIPs should also have at least two emails that go to
>>> >> >>> > dev@.
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > While I was editing this, I thought we really needed a suggested
>>> >> >>> > template
>>> >> >>> > for design doc too. I will get to that too ...
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>> r...@databricks.com>
>>> >> >>> > wrote:
>>> >> >>> >>
>>> >> >>> >> Most things looked OK to me too, although I do plan to take a
>>> >> >>> >> closer
>>> >> >>> >> look
>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>> >> >>> >> <van...@cloudera.com>
>>> >> >>> >> wrote:
>>> >> >>> >>>
>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not
>>> >> >>> >>> explicitly
>>> >> >>> >>> called, that voting would happen by e-mail? A template for the
>>> >> >>> >>> proposal document (instead of just a bullet nice) would also
>>> be
>>> >> >>> >>> nice,
>>> >> >>> >>> but that can be done at any time.
>>> >> >>> >>>
>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
>>> >> >>> >>> candidate
>>> >> >>> >>> for a SIP, given the scope of the work. The document attached
>>> even
>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants to
>>> try
>>> >> >>> >>> out
>>> >> >>> >>> the process...
>>> >> >>> >>>
>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>> >> >>> >>> <c...@koeninger.org>
>>> >> >>> >>> wrote:
>>> >> >>> >>> > Now that spark summit europe is over, are any committers
>>> >> >>> >>> > interested
>>> >> >>&

Re: Spark Improvement Proposals

2017-01-03 Thread Imran Rashid
; sketch". Echoing one of the earlier email: "IMHO so far aside
>> from
>> >> >>> > tagging
>> >> >>> > things and linking them elsewhere simply having design docs and
>> >> >>> > prototypes
>> >> >>> > implementations in PRs is not something that has not worked so
>> far".
>> >> >>> >
>> >> >>> > 3. I made some the language tweaks to focus more on visibility.
>> For
>> >> >>> > example,
>> >> >>> > "The purpose of an SIP is to inform and involve", rather than
>> just
>> >> >>> > "involve". SIPs should also have at least two emails that go to
>> >> >>> > dev@.
>> >> >>> >
>> >> >>> >
>> >> >>> > While I was editing this, I thought we really needed a suggested
>> >> >>> > template
>> >> >>> > for design doc too. I will get to that too ...
>> >> >>> >
>> >> >>> >
>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>> r...@databricks.com>
>> >> >>> > wrote:
>> >> >>> >>
>> >> >>> >> Most things looked OK to me too, although I do plan to take a
>> >> >>> >> closer
>> >> >>> >> look
>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>> >> >>> >> <van...@cloudera.com>
>> >> >>> >> wrote:
>> >> >>> >>>
>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not
>> >> >>> >>> explicitly
>> >> >>> >>> called, that voting would happen by e-mail? A template for the
>> >> >>> >>> proposal document (instead of just a bullet nice) would also be
>> >> >>> >>> nice,
>> >> >>> >>> but that can be done at any time.
>> >> >>> >>>
>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
>> >> >>> >>> candidate
>> >> >>> >>> for a SIP, given the scope of the work. The document attached
>> even
>> >> >>> >>> somewhat matches the proposed format. So if anyone wants to try
>> >> >>> >>> out
>> >> >>> >>> the process...
>> >> >>> >>>
>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>> >> >>> >>> <c...@koeninger.org>
>> >> >>> >>> wrote:
>> >> >>> >>> > Now that spark summit europe is over, are any committers
>> >> >>> >>> > interested
>> >> >>> >>> > in
>> >> >>> >>> > moving forward with this?
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-i
>> mprovement-proposals.md
>> >> >>> >>> >
>> >> >>> >>> > Or are we going to let this discussion die on the vine?
>> >> >>> >>> >
>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>> >> >>> >>> > <tomasz.gaw...@outlook.com> wrote:
>> >> >>> >>> >> Maybe my mail was not clear enough.
>> >> >>> >>> >>
>> >> >>> >>> >>
>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
>> >> >>> >>> >> framework.
>> >> >>> >>> >> The
>> >> >>> >>> >> idea with benchmarks was to show two things:
>> >> >>> >>> >>
>> >> >>> >

Re: Spark Improvement Proposals

2017-01-03 Thread Joseph Bradley
gt;>> >>> The proposal looks OK to me. I assume, even though it's not
> >> >>> >>> explicitly
> >> >>> >>> called, that voting would happen by e-mail? A template for the
> >> >>> >>> proposal document (instead of just a bullet nice) would also be
> >> >>> >>> nice,
> >> >>> >>> but that can be done at any time.
> >> >>> >>>
> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
> >> >>> >>> candidate
> >> >>> >>> for a SIP, given the scope of the work. The document attached
> even
> >> >>> >>> somewhat matches the proposed format. So if anyone wants to try
> >> >>> >>> out
> >> >>> >>> the process...
> >> >>> >>>
> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
> >> >>> >>> <c...@koeninger.org>
> >> >>> >>> wrote:
> >> >>> >>> > Now that spark summit europe is over, are any committers
> >> >>> >>> > interested
> >> >>> >>> > in
> >> >>> >>> > moving forward with this?
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >> >>> >>> >
> >> >>> >>> > Or are we going to let this discussion die on the vine?
> >> >>> >>> >
> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> >> >>> >>> > <tomasz.gaw...@outlook.com> wrote:
> >> >>> >>> >> Maybe my mail was not clear enough.
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
> >> >>> >>> >> framework.
> >> >>> >>> >> The
> >> >>> >>> >> idea with benchmarks was to show two things:
> >> >>> >>> >>
> >> >>> >>> >> - why some people are doing bad PR for Spark
> >> >>> >>> >>
> >> >>> >>> >> - how - in easy way - we can change it and show that Spark is
> >> >>> >>> >> still on
> >> >>> >>> >> the
> >> >>> >>> >> top
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I don't
> think
> >> >>> >>> >> they're the
> >> >>> >>> >> most important thing in Spark :) On the Spark main page there
> >> >>> >>> >> is
> >> >>> >>> >> still
> >> >>> >>> >> chart
> >> >>> >>> >> "Spark vs Hadoop". It is important to show that framework is
> >> >>> >>> >> not
> >> >>> >>> >> the
> >> >>> >>> >> same
> >> >>> >>> >> Spark with other API, but much faster and optimized,
> comparable
> >> >>> >>> >> or
> >> >>> >>> >> even
> >> >>> >>> >> faster than other frameworks.
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >> About real-time streaming, I think it would be just good to
> see
> >> >>> >>> >> it
> >> >>> >>> >> in
> >> >>> >>> >> Spark.
> >> >>> >>> >> I very like current Spark model, but many voices that says
> "we
> >> >>> >>> >> need
> >> >>> >>> >> more" -
> >> >>> >>> >> community should listen also them an

Re: Spark Improvement Proposals

2016-11-08 Thread Ryan Blue
gt; 1. I've consulted a board member and he recommended lazy consensus as
> >>> > opposed to voting. The reason being in voting there can easily be a
> >>> > "loser'
> >>> > that gets outvoted.
> >>> >
> >>> > 2. I made it lighter weight, and renamed "strategy" to "optional
> design
> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside from
> >>> > tagging
> >>> > things and linking them elsewhere simply having design docs and
> >>> > prototypes
> >>> > implementations in PRs is not something that has not worked so far".
> >>> >
> >>> > 3. I made some the language tweaks to focus more on visibility. For
> >>> > example,
> >>> > "The purpose of an SIP is to inform and involve", rather than just
> >>> > "involve". SIPs should also have at least two emails that go to dev@
> .
> >>> >
> >>> >
> >>> > While I was editing this, I thought we really needed a suggested
> >>> > template
> >>> > for design doc too. I will get to that too ...
> >>> >
> >>> >
> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com>
> >>> > wrote:
> >>> >>
> >>> >> Most things looked OK to me too, although I do plan to take a closer
> >>> >> look
> >>> >> after Nov 1st when we cut the release branch for 2.1.
> >>> >>
> >>> >>
> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <
> van...@cloudera.com>
> >>> >> wrote:
> >>> >>>
> >>> >>> The proposal looks OK to me. I assume, even though it's not
> >>> >>> explicitly
> >>> >>> called, that voting would happen by e-mail? A template for the
> >>> >>> proposal document (instead of just a bullet nice) would also be
> nice,
> >>> >>> but that can be done at any time.
> >>> >>>
> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
> candidate
> >>> >>> for a SIP, given the scope of the work. The document attached even
> >>> >>> somewhat matches the proposed format. So if anyone wants to try out
> >>> >>> the process...
> >>> >>>
> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <
> c...@koeninger.org>
> >>> >>> wrote:
> >>> >>> > Now that spark summit europe is over, are any committers
> interested
> >>> >>> > in
> >>> >>> > moving forward with this?
> >>> >>> >
> >>> >>> >
> >>> >>> >
> >>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >>> >>> >
> >>> >>> > Or are we going to let this discussion die on the vine?
> >>> >>> >
> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> >>> >>> > <tomasz.gaw...@outlook.com> wrote:
> >>> >>> >> Maybe my mail was not clear enough.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
> >>> >>> >> framework.
> >>> >>> >> The
> >>> >>> >> idea with benchmarks was to show two things:
> >>> >>> >>
> >>> >>> >> - why some people are doing bad PR for Spark
> >>> >>> >>
> >>> >>> >> - how - in easy way - we can change it and show that Spark is
> >>> >>> >> still on
> >>> >>> >> the
> >>> >>> >> top
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
> >>> >>> >> they're the
> >>> >>> >> most important thing in Spark :) On the Spark main page there is
> >>> >>> >> still
> >>> >>> >> chart
> >>> >>> >> &qu

Re: Spark Improvement Proposals

2016-11-08 Thread Cody Koeninger
ile I was editing this, I thought we really needed a suggested
>>> > template
>>> > for design doc too. I will get to that too ...
>>> >
>>> >
>>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com>
>>> > wrote:
>>> >>
>>> >> Most things looked OK to me too, although I do plan to take a closer
>>> >> look
>>> >> after Nov 1st when we cut the release branch for 2.1.
>>> >>
>>> >>
>>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <van...@cloudera.com>
>>> >> wrote:
>>> >>>
>>> >>> The proposal looks OK to me. I assume, even though it's not
>>> >>> explicitly
>>> >>> called, that voting would happen by e-mail? A template for the
>>> >>> proposal document (instead of just a bullet nice) would also be nice,
>>> >>> but that can be done at any time.
>>> >>>
>>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
>>> >>> for a SIP, given the scope of the work. The document attached even
>>> >>> somewhat matches the proposed format. So if anyone wants to try out
>>> >>> the process...
>>> >>>
>>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org>
>>> >>> wrote:
>>> >>> > Now that spark summit europe is over, are any committers interested
>>> >>> > in
>>> >>> > moving forward with this?
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>> >>> >
>>> >>> > Or are we going to let this discussion die on the vine?
>>> >>> >
>>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>> >>> > <tomasz.gaw...@outlook.com> wrote:
>>> >>> >> Maybe my mail was not clear enough.
>>> >>> >>
>>> >>> >>
>>> >>> >> I didn't want to write "lets focus on Flink" or any other
>>> >>> >> framework.
>>> >>> >> The
>>> >>> >> idea with benchmarks was to show two things:
>>> >>> >>
>>> >>> >> - why some people are doing bad PR for Spark
>>> >>> >>
>>> >>> >> - how - in easy way - we can change it and show that Spark is
>>> >>> >> still on
>>> >>> >> the
>>> >>> >> top
>>> >>> >>
>>> >>> >>
>>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
>>> >>> >> they're the
>>> >>> >> most important thing in Spark :) On the Spark main page there is
>>> >>> >> still
>>> >>> >> chart
>>> >>> >> "Spark vs Hadoop". It is important to show that framework is not
>>> >>> >> the
>>> >>> >> same
>>> >>> >> Spark with other API, but much faster and optimized, comparable or
>>> >>> >> even
>>> >>> >> faster than other frameworks.
>>> >>> >>
>>> >>> >>
>>> >>> >> About real-time streaming, I think it would be just good to see it
>>> >>> >> in
>>> >>> >> Spark.
>>> >>> >> I very like current Spark model, but many voices that says "we
>>> >>> >> need
>>> >>> >> more" -
>>> >>> >> community should listen also them and try to help them. With SIPs
>>> >>> >> it
>>> >>> >> would
>>> >>> >> be easier, I've just posted this example as "thing that may be
>>> >>> >> changed
>>> >>> >> with
>>> >>> >> SIP".
>>> >>> >>
>>> >>> >>
>>> >>> >> I very like unification via Datasets, but there is a lot of
>>> >>> >> algorithms
>>> >>> >> inside - let's mak

Re: Spark Improvement Proposals

2016-11-07 Thread Reynold Xin
;> >>> for a SIP, given the scope of the work. The document attached even
>> >>> somewhat matches the proposed format. So if anyone wants to try out
>> >>> the process...
>> >>>
>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org>
>> >>> wrote:
>> >>> > Now that spark summit europe is over, are any committers interested
>> in
>> >>> > moving forward with this?
>> >>> >
>> >>> >
>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-i
>> mprovement-proposals.md
>> >>> >
>> >>> > Or are we going to let this discussion die on the vine?
>> >>> >
>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>> >>> > <tomasz.gaw...@outlook.com> wrote:
>> >>> >> Maybe my mail was not clear enough.
>> >>> >>
>> >>> >>
>> >>> >> I didn't want to write "lets focus on Flink" or any other
>> framework.
>> >>> >> The
>> >>> >> idea with benchmarks was to show two things:
>> >>> >>
>> >>> >> - why some people are doing bad PR for Spark
>> >>> >>
>> >>> >> - how - in easy way - we can change it and show that Spark is
>> still on
>> >>> >> the
>> >>> >> top
>> >>> >>
>> >>> >>
>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
>> >>> >> they're the
>> >>> >> most important thing in Spark :) On the Spark main page there is
>> still
>> >>> >> chart
>> >>> >> "Spark vs Hadoop". It is important to show that framework is not
>> the
>> >>> >> same
>> >>> >> Spark with other API, but much faster and optimized, comparable or
>> >>> >> even
>> >>> >> faster than other frameworks.
>> >>> >>
>> >>> >>
>> >>> >> About real-time streaming, I think it would be just good to see it
>> in
>> >>> >> Spark.
>> >>> >> I very like current Spark model, but many voices that says "we need
>> >>> >> more" -
>> >>> >> community should listen also them and try to help them. With SIPs
>> it
>> >>> >> would
>> >>> >> be easier, I've just posted this example as "thing that may be
>> changed
>> >>> >> with
>> >>> >> SIP".
>> >>> >>
>> >>> >>
>> >>> >> I very like unification via Datasets, but there is a lot of
>> algorithms
>> >>> >> inside - let's make easy API, but with strong background (articles,
>> >>> >> benchmarks, descriptions, etc) that shows that Spark is still
>> modern
>> >>> >> framework.
>> >>> >>
>> >>> >>
>> >>> >> Maybe now my intention will be clearer :) As I said organizational
>> >>> >> ideas
>> >>> >> were already mentioned and I agree with them, my mail was just to
>> show
>> >>> >> some
>> >>> >> aspects from my side, so from theside of developer and person who
>> is
>> >>> >> trying
>> >>> >> to help others with Spark (via StackOverflow or other ways)
>> >>> >>
>> >>> >>
>> >>> >> Pozdrawiam / Best regards,
>> >>> >>
>> >>> >> Tomasz
>> >>> >>
>> >>> >>
>> >>> >> 
>> >>> >> Od: Cody Koeninger <c...@koeninger.org>
>> >>> >> Wysłane: 17 października 2016 16:46
>> >>> >> Do: Debasish Das
>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>> >>> >> Temat: Re: Spark Improvement Proposals
>> >>> >>
>> >>> >> I think narrowly focusing on Flink or benchmarks is missing my
>> point.
>> >>> >>
>> >>> >> My point is evolve or die.  Spark's governance and organization is
>> >>> >&

Re: Spark Improvement Proposals

2016-11-07 Thread Reynold Xin
; > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >>> >
> >>> > Or are we going to let this discussion die on the vine?
> >>> >
> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> >>> > <tomasz.gaw...@outlook.com <javascript:;>> wrote:
> >>> >> Maybe my mail was not clear enough.
> >>> >>
> >>> >>
> >>> >> I didn't want to write "lets focus on Flink" or any other framework.
> >>> >> The
> >>> >> idea with benchmarks was to show two things:
> >>> >>
> >>> >> - why some people are doing bad PR for Spark
> >>> >>
> >>> >> - how - in easy way - we can change it and show that Spark is still
> on
> >>> >> the
> >>> >> top
> >>> >>
> >>> >>
> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
> >>> >> they're the
> >>> >> most important thing in Spark :) On the Spark main page there is
> still
> >>> >> chart
> >>> >> "Spark vs Hadoop". It is important to show that framework is not the
> >>> >> same
> >>> >> Spark with other API, but much faster and optimized, comparable or
> >>> >> even
> >>> >> faster than other frameworks.
> >>> >>
> >>> >>
> >>> >> About real-time streaming, I think it would be just good to see it
> in
> >>> >> Spark.
> >>> >> I very like current Spark model, but many voices that says "we need
> >>> >> more" -
> >>> >> community should listen also them and try to help them. With SIPs it
> >>> >> would
> >>> >> be easier, I've just posted this example as "thing that may be
> changed
> >>> >> with
> >>> >> SIP".
> >>> >>
> >>> >>
> >>> >> I very like unification via Datasets, but there is a lot of
> algorithms
> >>> >> inside - let's make easy API, but with strong background (articles,
> >>> >> benchmarks, descriptions, etc) that shows that Spark is still modern
> >>> >> framework.
> >>> >>
> >>> >>
> >>> >> Maybe now my intention will be clearer :) As I said organizational
> >>> >> ideas
> >>> >> were already mentioned and I agree with them, my mail was just to
> show
> >>> >> some
> >>> >> aspects from my side, so from theside of developer and person who is
> >>> >> trying
> >>> >> to help others with Spark (via StackOverflow or other ways)
> >>> >>
> >>> >>
> >>> >> Pozdrawiam / Best regards,
> >>> >>
> >>> >> Tomasz
> >>> >>
> >>> >>
> >>> >> 
> >>> >> Od: Cody Koeninger <c...@koeninger.org <javascript:;>>
> >>> >> Wysłane: 17 października 2016 16:46
> >>> >> Do: Debasish Das
> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org <javascript:;>
> >>> >> Temat: Re: Spark Improvement Proposals
> >>> >>
> >>> >> I think narrowly focusing on Flink or benchmarks is missing my
> point.
> >>> >>
> >>> >> My point is evolve or die.  Spark's governance and organization is
> >>> >> hampering its ability to evolve technologically, and it needs to
> >>> >> change.
> >>> >>
> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
> >>> >> <debasish.da...@gmail.com <javascript:;>>
> >>> >> wrote:
> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark in
> 2014
> >>> >>> as
> >>> >>> soon as I looked into it since compared to writing Java map-reduce
> >>> >>> and
> >>> >>> Cascading code, Spark made writing distributed code fun...But now
> as
> >>> >>> we
> >>> >>> went
> >>> >>> deeper with Spark and real-time streaming use-case gets more
> >>> >>>

Re: Odp.: Spark Improvement Proposals

2016-11-07 Thread Cody Koeninger
> - how - in easy way - we can change it and show that Spark is still on
>>> >> the
>>> >> top
>>> >>
>>> >>
>>> >> No more, no less. Benchmarks will be helpful, but I don't think
>>> >> they're the
>>> >> most important thing in Spark :) On the Spark main page there is still
>>> >> chart
>>> >> "Spark vs Hadoop". It is important to show that framework is not the
>>> >> same
>>> >> Spark with other API, but much faster and optimized, comparable or
>>> >> even
>>> >> faster than other frameworks.
>>> >>
>>> >>
>>> >> About real-time streaming, I think it would be just good to see it in
>>> >> Spark.
>>> >> I very like current Spark model, but many voices that says "we need
>>> >> more" -
>>> >> community should listen also them and try to help them. With SIPs it
>>> >> would
>>> >> be easier, I've just posted this example as "thing that may be changed
>>> >> with
>>> >> SIP".
>>> >>
>>> >>
>>> >> I very like unification via Datasets, but there is a lot of algorithms
>>> >> inside - let's make easy API, but with strong background (articles,
>>> >> benchmarks, descriptions, etc) that shows that Spark is still modern
>>> >> framework.
>>> >>
>>> >>
>>> >> Maybe now my intention will be clearer :) As I said organizational
>>> >> ideas
>>> >> were already mentioned and I agree with them, my mail was just to show
>>> >> some
>>> >> aspects from my side, so from theside of developer and person who is
>>> >> trying
>>> >> to help others with Spark (via StackOverflow or other ways)
>>> >>
>>> >>
>>> >> Pozdrawiam / Best regards,
>>> >>
>>> >> Tomasz
>>> >>
>>> >>
>>> >> 
>>> >> Od: Cody Koeninger <c...@koeninger.org>
>>> >> Wysłane: 17 października 2016 16:46
>>> >> Do: Debasish Das
>>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>> >> Temat: Re: Spark Improvement Proposals
>>> >>
>>> >> I think narrowly focusing on Flink or benchmarks is missing my point.
>>> >>
>>> >> My point is evolve or die.  Spark's governance and organization is
>>> >> hampering its ability to evolve technologically, and it needs to
>>> >> change.
>>> >>
>>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>> >> <debasish.da...@gmail.com>
>>> >> wrote:
>>> >>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014
>>> >>> as
>>> >>> soon as I looked into it since compared to writing Java map-reduce
>>> >>> and
>>> >>> Cascading code, Spark made writing distributed code fun...But now as
>>> >>> we
>>> >>> went
>>> >>> deeper with Spark and real-time streaming use-case gets more
>>> >>> prominent, I
>>> >>> think it is time to bring a messaging model in conjunction with the
>>> >>> batch/micro-batch API that Spark is good atakka-streams close
>>> >>> integration with spark micro-batching APIs looks like a great
>>> >>> direction to
>>> >>> stay in the game with Apache Flink...Spark 2.0 integrated streaming
>>> >>> with
>>> >>> batch with the assumption is that micro-batching is sufficient to run
>>> >>> SQL
>>> >>> commands on stream but do we really have time to do SQL processing at
>>> >>> streaming data within 1-2 seconds ?
>>> >>>
>>> >>> After reading the email chain, I started to look into Flink
>>> >>> documentation
>>> >>> and if you compare it with Spark documentation, I think we have major
>>> >>> work
>>> >>> to do detailing out Spark internals so that more people from
>>> >>> community
>>> >>> start
>>> >>> to take active role in improving the issues so that Spark stays
>>> >>> strong
>>> >>&

Re: Odp.: Spark Improvement Proposals

2016-11-07 Thread Reynold Xin
ut many voices that says "we need
>> more" -
>> >> community should listen also them and try to help them. With SIPs it
>> would
>> >> be easier, I've just posted this example as "thing that may be changed
>> with
>> >> SIP".
>> >>
>> >>
>> >> I very like unification via Datasets, but there is a lot of algorithms
>> >> inside - let's make easy API, but with strong background (articles,
>> >> benchmarks, descriptions, etc) that shows that Spark is still modern
>> >> framework.
>> >>
>> >>
>> >> Maybe now my intention will be clearer :) As I said organizational
>> ideas
>> >> were already mentioned and I agree with them, my mail was just to show
>> some
>> >> aspects from my side, so from theside of developer and person who is
>> trying
>> >> to help others with Spark (via StackOverflow or other ways)
>> >>
>> >>
>> >> Pozdrawiam / Best regards,
>> >>
>> >> Tomasz
>> >>
>> >>
>> >> 
>> >> Od: Cody Koeninger <c...@koeninger.org>
>> >> Wysłane: 17 października 2016 16:46
>> >> Do: Debasish Das
>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>> >> Temat: Re: Spark Improvement Proposals
>> >>
>> >> I think narrowly focusing on Flink or benchmarks is missing my point.
>> >>
>> >> My point is evolve or die.  Spark's governance and organization is
>> >> hampering its ability to evolve technologically, and it needs to
>> >> change.
>> >>
>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <
>> debasish.da...@gmail.com>
>> >> wrote:
>> >>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014
>> as
>> >>> soon as I looked into it since compared to writing Java map-reduce and
>> >>> Cascading code, Spark made writing distributed code fun...But now as
>> we
>> >>> went
>> >>> deeper with Spark and real-time streaming use-case gets more
>> prominent, I
>> >>> think it is time to bring a messaging model in conjunction with the
>> >>> batch/micro-batch API that Spark is good atakka-streams close
>> >>> integration with spark micro-batching APIs looks like a great
>> direction to
>> >>> stay in the game with Apache Flink...Spark 2.0 integrated streaming
>> with
>> >>> batch with the assumption is that micro-batching is sufficient to run
>> SQL
>> >>> commands on stream but do we really have time to do SQL processing at
>> >>> streaming data within 1-2 seconds ?
>> >>>
>> >>> After reading the email chain, I started to look into Flink
>> documentation
>> >>> and if you compare it with Spark documentation, I think we have major
>> work
>> >>> to do detailing out Spark internals so that more people from community
>> >>> start
>> >>> to take active role in improving the issues so that Spark stays strong
>> >>> compared to Flink.
>> >>>
>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>> >>>
>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>> >>>
>> >>> Spark is no longer an engine that works for micro-batch and batch...We
>> >>> (and
>> >>> I am sure many others) are pushing spark as an engine for stream and
>> query
>> >>> processing.we need to make it a state-of-the-art engine for high
>> speed
>> >>> streaming data and user queries as well !
>> >>>
>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <
>> tomasz.gaw...@outlook.com>
>> >>> wrote:
>> >>>>
>> >>>> Hi everyone,
>> >>>>
>> >>>> I'm quite late with my answer, but I think my suggestions may help a
>> >>>> little bit. :) Many technical and organizational topics were
>> mentioned,
>> >>>> but I want to focus on these negative posts about Spark and about
>> >>>> "haters"
>> >>>>
>> >>>> I really like Spark. Easy of use, speed, very good community - it's
>> >>>> everything here. But Every project has to "flight" on "

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-11-01 Thread Holden Karau
On that note there is some discussion on the Jira -
https://issues.apache.org/jira/browse/SPARK-13534 :)

On Mon, Oct 31, 2016 at 8:32 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> I believe Bryan is also working on this a little - and I'm a little busy
> with the other stuff but would love to stay in the loop on Arrow progress :)
>
>
> On Monday, October 31, 2016, mariusvniekerk <marius.v.niek...@gmail.com>
> wrote:
>
>> So i've been working on some very very early stage apache arrow
>> integration.
>> My current plan it to emulate some of how the R function execution works.
>> If there are any other people working on similar efforts it would be good
>> idea to combine efforts.
>>
>> I can see how much effort is involved in converting that PR to a spark
>> package so that people can try to use it.  I think this is something that
>> we
>> want some more community iteration on maybe?
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-developers
>> -list.1001551.n3.nabble.com/Python-Spark-Improvements-
>> forked-from-Spark-Improvement-Proposals-tp19422p19670.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Odp.: Spark Improvement Proposals

2016-11-01 Thread Reynold Xin
Most things looked OK to me too, although I do plan to take a closer look
after Nov 1st when we cut the release branch for 2.1.


On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> The proposal looks OK to me. I assume, even though it's not explicitly
> called, that voting would happen by e-mail? A template for the
> proposal document (instead of just a bullet nice) would also be nice,
> but that can be done at any time.
>
> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
> for a SIP, given the scope of the work. The document attached even
> somewhat matches the proposed format. So if anyone wants to try out
> the process...
>
> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
> > Now that spark summit europe is over, are any committers interested in
> > moving forward with this?
> >
> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >
> > Or are we going to let this discussion die on the vine?
> >
> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> > <tomasz.gaw...@outlook.com> wrote:
> >> Maybe my mail was not clear enough.
> >>
> >>
> >> I didn't want to write "lets focus on Flink" or any other framework. The
> >> idea with benchmarks was to show two things:
> >>
> >> - why some people are doing bad PR for Spark
> >>
> >> - how - in easy way - we can change it and show that Spark is still on
> the
> >> top
> >>
> >>
> >> No more, no less. Benchmarks will be helpful, but I don't think they're
> the
> >> most important thing in Spark :) On the Spark main page there is still
> chart
> >> "Spark vs Hadoop". It is important to show that framework is not the
> same
> >> Spark with other API, but much faster and optimized, comparable or even
> >> faster than other frameworks.
> >>
> >>
> >> About real-time streaming, I think it would be just good to see it in
> Spark.
> >> I very like current Spark model, but many voices that says "we need
> more" -
> >> community should listen also them and try to help them. With SIPs it
> would
> >> be easier, I've just posted this example as "thing that may be changed
> with
> >> SIP".
> >>
> >>
> >> I very like unification via Datasets, but there is a lot of algorithms
> >> inside - let's make easy API, but with strong background (articles,
> >> benchmarks, descriptions, etc) that shows that Spark is still modern
> >> framework.
> >>
> >>
> >> Maybe now my intention will be clearer :) As I said organizational ideas
> >> were already mentioned and I agree with them, my mail was just to show
> some
> >> aspects from my side, so from theside of developer and person who is
> trying
> >> to help others with Spark (via StackOverflow or other ways)
> >>
> >>
> >> Pozdrawiam / Best regards,
> >>
> >> Tomasz
> >>
> >>
> >> 
> >> Od: Cody Koeninger <c...@koeninger.org>
> >> Wysłane: 17 października 2016 16:46
> >> Do: Debasish Das
> >> DW: Tomasz Gawęda; dev@spark.apache.org
> >> Temat: Re: Spark Improvement Proposals
> >>
> >> I think narrowly focusing on Flink or benchmarks is missing my point.
> >>
> >> My point is evolve or die.  Spark's governance and organization is
> >> hampering its ability to evolve technologically, and it needs to
> >> change.
> >>
> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <debasish.da...@gmail.com
> >
> >> wrote:
> >>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014
> as
> >>> soon as I looked into it since compared to writing Java map-reduce and
> >>> Cascading code, Spark made writing distributed code fun...But now as we
> >>> went
> >>> deeper with Spark and real-time streaming use-case gets more
> prominent, I
> >>> think it is time to bring a messaging model in conjunction with the
> >>> batch/micro-batch API that Spark is good atakka-streams close
> >>> integration with spark micro-batching APIs looks like a great
> direction to
> >>> stay in the game with Apache Flink...Spark 2.0 integrated streaming
> with
> >>> batch with the assumption is that micro-batching is sufficient to run
> SQL
> >>> commands on stre

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-31 Thread Holden Karau
I believe Bryan is also working on this a little - and I'm a little busy
with the other stuff but would love to stay in the loop on Arrow progress :)

On Monday, October 31, 2016, mariusvniekerk <marius.v.niek...@gmail.com>
wrote:

> So i've been working on some very very early stage apache arrow
> integration.
> My current plan it to emulate some of how the R function execution works.
> If there are any other people working on similar efforts it would be good
> idea to combine efforts.
>
> I can see how much effort is involved in converting that PR to a spark
> package so that people can try to use it.  I think this is something that
> we
> want some more community iteration on maybe?
>
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Python-Spark-
> Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19670.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org <javascript:;>
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-31 Thread mariusvniekerk
So i've been working on some very very early stage apache arrow integration. 
My current plan it to emulate some of how the R function execution works.  
If there are any other people working on similar efforts it would be good
idea to combine efforts.

I can see how much effort is involved in converting that PR to a spark
package so that people can try to use it.  I think this is something that we
want some more community iteration on maybe?





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19670.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Odp.: Spark Improvement Proposals

2016-10-31 Thread Marcelo Vanzin
The proposal looks OK to me. I assume, even though it's not explicitly
called, that voting would happen by e-mail? A template for the
proposal document (instead of just a bullet nice) would also be nice,
but that can be done at any time.

BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
for a SIP, given the scope of the work. The document attached even
somewhat matches the proposed format. So if anyone wants to try out
the process...

On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org> wrote:
> Now that spark summit europe is over, are any committers interested in
> moving forward with this?
>
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>
> Or are we going to let this discussion die on the vine?
>
> On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> <tomasz.gaw...@outlook.com> wrote:
>> Maybe my mail was not clear enough.
>>
>>
>> I didn't want to write "lets focus on Flink" or any other framework. The
>> idea with benchmarks was to show two things:
>>
>> - why some people are doing bad PR for Spark
>>
>> - how - in easy way - we can change it and show that Spark is still on the
>> top
>>
>>
>> No more, no less. Benchmarks will be helpful, but I don't think they're the
>> most important thing in Spark :) On the Spark main page there is still chart
>> "Spark vs Hadoop". It is important to show that framework is not the same
>> Spark with other API, but much faster and optimized, comparable or even
>> faster than other frameworks.
>>
>>
>> About real-time streaming, I think it would be just good to see it in Spark.
>> I very like current Spark model, but many voices that says "we need more" -
>> community should listen also them and try to help them. With SIPs it would
>> be easier, I've just posted this example as "thing that may be changed with
>> SIP".
>>
>>
>> I very like unification via Datasets, but there is a lot of algorithms
>> inside - let's make easy API, but with strong background (articles,
>> benchmarks, descriptions, etc) that shows that Spark is still modern
>> framework.
>>
>>
>> Maybe now my intention will be clearer :) As I said organizational ideas
>> were already mentioned and I agree with them, my mail was just to show some
>> aspects from my side, so from theside of developer and person who is trying
>> to help others with Spark (via StackOverflow or other ways)
>>
>>
>> Pozdrawiam / Best regards,
>>
>> Tomasz
>>
>>
>> 
>> Od: Cody Koeninger <c...@koeninger.org>
>> Wysłane: 17 października 2016 16:46
>> Do: Debasish Das
>> DW: Tomasz Gawęda; dev@spark.apache.org
>> Temat: Re: Spark Improvement Proposals
>>
>> I think narrowly focusing on Flink or benchmarks is missing my point.
>>
>> My point is evolve or die.  Spark's governance and organization is
>> hampering its ability to evolve technologically, and it needs to
>> change.
>>
>> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
>>> soon as I looked into it since compared to writing Java map-reduce and
>>> Cascading code, Spark made writing distributed code fun...But now as we
>>> went
>>> deeper with Spark and real-time streaming use-case gets more prominent, I
>>> think it is time to bring a messaging model in conjunction with the
>>> batch/micro-batch API that Spark is good atakka-streams close
>>> integration with spark micro-batching APIs looks like a great direction to
>>> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
>>> batch with the assumption is that micro-batching is sufficient to run SQL
>>> commands on stream but do we really have time to do SQL processing at
>>> streaming data within 1-2 seconds ?
>>>
>>> After reading the email chain, I started to look into Flink documentation
>>> and if you compare it with Spark documentation, I think we have major work
>>> to do detailing out Spark internals so that more people from community
>>> start
>>> to take active role in improving the issues so that Spark stays strong
>>> compared to Flink.
>>>
>>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>>>
>>> Spark is no longer an engin

Re: Odp.: Spark Improvement Proposals

2016-10-31 Thread Ryan Blue
I agree, we should push forward on this. I think there is enough consensus
to call a vote, unless someone else thinks that there is more to discuss?

rb

On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Now that spark summit europe is over, are any committers interested in
> moving forward with this?
>
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
>
> Or are we going to let this discussion die on the vine?
>
> On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> <tomasz.gaw...@outlook.com> wrote:
> > Maybe my mail was not clear enough.
> >
> >
> > I didn't want to write "lets focus on Flink" or any other framework. The
> > idea with benchmarks was to show two things:
> >
> > - why some people are doing bad PR for Spark
> >
> > - how - in easy way - we can change it and show that Spark is still on
> the
> > top
> >
> >
> > No more, no less. Benchmarks will be helpful, but I don't think they're
> the
> > most important thing in Spark :) On the Spark main page there is still
> chart
> > "Spark vs Hadoop". It is important to show that framework is not the same
> > Spark with other API, but much faster and optimized, comparable or even
> > faster than other frameworks.
> >
> >
> > About real-time streaming, I think it would be just good to see it in
> Spark.
> > I very like current Spark model, but many voices that says "we need
> more" -
> > community should listen also them and try to help them. With SIPs it
> would
> > be easier, I've just posted this example as "thing that may be changed
> with
> > SIP".
> >
> >
> > I very like unification via Datasets, but there is a lot of algorithms
> > inside - let's make easy API, but with strong background (articles,
> > benchmarks, descriptions, etc) that shows that Spark is still modern
> > framework.
> >
> >
> > Maybe now my intention will be clearer :) As I said organizational ideas
> > were already mentioned and I agree with them, my mail was just to show
> some
> > aspects from my side, so from theside of developer and person who is
> trying
> > to help others with Spark (via StackOverflow or other ways)
> >
> >
> > Pozdrawiam / Best regards,
> >
> > Tomasz
> >
> >
> > 
> > Od: Cody Koeninger <c...@koeninger.org>
> > Wysłane: 17 października 2016 16:46
> > Do: Debasish Das
> > DW: Tomasz Gawęda; dev@spark.apache.org
> > Temat: Re: Spark Improvement Proposals
> >
> > I think narrowly focusing on Flink or benchmarks is missing my point.
> >
> > My point is evolve or die.  Spark's governance and organization is
> > hampering its ability to evolve technologically, and it needs to
> > change.
> >
> > On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <debasish.da...@gmail.com>
> > wrote:
> >> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
> >> soon as I looked into it since compared to writing Java map-reduce and
> >> Cascading code, Spark made writing distributed code fun...But now as we
> >> went
> >> deeper with Spark and real-time streaming use-case gets more prominent,
> I
> >> think it is time to bring a messaging model in conjunction with the
> >> batch/micro-batch API that Spark is good atakka-streams close
> >> integration with spark micro-batching APIs looks like a great direction
> to
> >> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
> >> batch with the assumption is that micro-batching is sufficient to run
> SQL
> >> commands on stream but do we really have time to do SQL processing at
> >> streaming data within 1-2 seconds ?
> >>
> >> After reading the email chain, I started to look into Flink
> documentation
> >> and if you compare it with Spark documentation, I think we have major
> work
> >> to do detailing out Spark internals so that more people from community
> >> start
> >> to take active role in improving the issues so that Spark stays strong
> >> compared to Flink.
> >>
> >> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
> >>
> >> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
> >>
> >> Spark is no longer an engine that works for micro-batch and batch...We
> >> (and
> >> I am sure many others) are pushing spark as an engine for stream and
>

Re: Odp.: Spark Improvement Proposals

2016-10-31 Thread Cody Koeninger
Now that spark summit europe is over, are any committers interested in
moving forward with this?

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

Or are we going to let this discussion die on the vine?

On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
<tomasz.gaw...@outlook.com> wrote:
> Maybe my mail was not clear enough.
>
>
> I didn't want to write "lets focus on Flink" or any other framework. The
> idea with benchmarks was to show two things:
>
> - why some people are doing bad PR for Spark
>
> - how - in easy way - we can change it and show that Spark is still on the
> top
>
>
> No more, no less. Benchmarks will be helpful, but I don't think they're the
> most important thing in Spark :) On the Spark main page there is still chart
> "Spark vs Hadoop". It is important to show that framework is not the same
> Spark with other API, but much faster and optimized, comparable or even
> faster than other frameworks.
>
>
> About real-time streaming, I think it would be just good to see it in Spark.
> I very like current Spark model, but many voices that says "we need more" -
> community should listen also them and try to help them. With SIPs it would
> be easier, I've just posted this example as "thing that may be changed with
> SIP".
>
>
> I very like unification via Datasets, but there is a lot of algorithms
> inside - let's make easy API, but with strong background (articles,
> benchmarks, descriptions, etc) that shows that Spark is still modern
> framework.
>
>
> Maybe now my intention will be clearer :) As I said organizational ideas
> were already mentioned and I agree with them, my mail was just to show some
> aspects from my side, so from theside of developer and person who is trying
> to help others with Spark (via StackOverflow or other ways)
>
>
> Pozdrawiam / Best regards,
>
> Tomasz
>
>
> ________
> Od: Cody Koeninger <c...@koeninger.org>
> Wysłane: 17 października 2016 16:46
> Do: Debasish Das
> DW: Tomasz Gawęda; dev@spark.apache.org
> Temat: Re: Spark Improvement Proposals
>
> I think narrowly focusing on Flink or benchmarks is missing my point.
>
> My point is evolve or die.  Spark's governance and organization is
> hampering its ability to evolve technologically, and it needs to
> change.
>
> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
>> soon as I looked into it since compared to writing Java map-reduce and
>> Cascading code, Spark made writing distributed code fun...But now as we
>> went
>> deeper with Spark and real-time streaming use-case gets more prominent, I
>> think it is time to bring a messaging model in conjunction with the
>> batch/micro-batch API that Spark is good atakka-streams close
>> integration with spark micro-batching APIs looks like a great direction to
>> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
>> batch with the assumption is that micro-batching is sufficient to run SQL
>> commands on stream but do we really have time to do SQL processing at
>> streaming data within 1-2 seconds ?
>>
>> After reading the email chain, I started to look into Flink documentation
>> and if you compare it with Spark documentation, I think we have major work
>> to do detailing out Spark internals so that more people from community
>> start
>> to take active role in improving the issues so that Spark stays strong
>> compared to Flink.
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>>
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>>
>> Spark is no longer an engine that works for micro-batch and batch...We
>> (and
>> I am sure many others) are pushing spark as an engine for stream and query
>> processing.we need to make it a state-of-the-art engine for high speed
>> streaming data and user queries as well !
>>
>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <tomasz.gaw...@outlook.com>
>> wrote:
>>>
>>> Hi everyone,
>>>
>>> I'm quite late with my answer, but I think my suggestions may help a
>>> little bit. :) Many technical and organizational topics were mentioned,
>>> but I want to focus on these negative posts about Spark and about
>>> "haters"
>>>
>>> I really like Spark. Easy of use, speed, very good community - it's
>>> everything here. But Every project has to "flight" on "framework market&qu

Odp.: Spark Improvement Proposals

2016-10-17 Thread Tomasz Gawęda
Maybe my mail was not clear enough.


I didn't want to write "lets focus on Flink" or any other framework. The idea 
with benchmarks was to show two things:

- why some people are doing bad PR for Spark

- how - in easy way - we can change it and show that Spark is still on the top


No more, no less. Benchmarks will be helpful, but I don't think they're the 
most important thing in Spark :) On the Spark main page there is still chart 
"Spark vs Hadoop". It is important to show that framework is not the same Spark 
with other API, but much faster and optimized, comparable or even faster than 
other frameworks.


About real-time streaming, I think it would be just good to see it in Spark. I 
very like current Spark model, but many voices that says "we need more" - 
community should listen also them and try to help them. With SIPs it would be 
easier, I've just posted this example as "thing that may be changed with SIP".


I very like unification via Datasets, but there is a lot of algorithms inside - 
let's make easy API, but with strong background (articles, benchmarks, 
descriptions, etc) that shows that Spark is still modern framework.


Maybe now my intention will be clearer :) As I said organizational ideas were 
already mentioned and I agree with them, my mail was just to show some aspects 
from my side, so from theside of developer and person who is trying to help 
others with Spark (via StackOverflow or other ways)


Pozdrawiam / Best regards,

Tomasz



Od: Cody Koeninger <c...@koeninger.org>
Wysłane: 17 października 2016 16:46
Do: Debasish Das
DW: Tomasz Gawęda; dev@spark.apache.org
Temat: Re: Spark Improvement Proposals

I think narrowly focusing on Flink or benchmarks is missing my point.

My point is evolve or die.  Spark's governance and organization is
hampering its ability to evolve technologically, and it needs to
change.

On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <debasish.da...@gmail.com> wrote:
> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
> soon as I looked into it since compared to writing Java map-reduce and
> Cascading code, Spark made writing distributed code fun...But now as we went
> deeper with Spark and real-time streaming use-case gets more prominent, I
> think it is time to bring a messaging model in conjunction with the
> batch/micro-batch API that Spark is good atakka-streams close
> integration with spark micro-batching APIs looks like a great direction to
> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
> batch with the assumption is that micro-batching is sufficient to run SQL
> commands on stream but do we really have time to do SQL processing at
> streaming data within 1-2 seconds ?
>
> After reading the email chain, I started to look into Flink documentation
> and if you compare it with Spark documentation, I think we have major work
> to do detailing out Spark internals so that more people from community start
> to take active role in improving the issues so that Spark stays strong
> compared to Flink.
>
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>
> Spark is no longer an engine that works for micro-batch and batch...We (and
> I am sure many others) are pushing spark as an engine for stream and query
> processing.we need to make it a state-of-the-art engine for high speed
> streaming data and user queries as well !
>
> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <tomasz.gaw...@outlook.com>
> wrote:
>>
>> Hi everyone,
>>
>> I'm quite late with my answer, but I think my suggestions may help a
>> little bit. :) Many technical and organizational topics were mentioned,
>> but I want to focus on these negative posts about Spark and about "haters"
>>
>> I really like Spark. Easy of use, speed, very good community - it's
>> everything here. But Every project has to "flight" on "framework market"
>> to be still no 1. I'm following many Spark and Big Data communities,
>> maybe my mail will inspire someone :)
>>
>> You (every Spark developer; so far I didn't have enough time to join
>> contributing to Spark) has done excellent job. So why are some people
>> saying that Flink (or other framework) is better, like it was posted in
>> this mailing list? No, not because that framework is better in all
>> cases.. In my opinion, many of these discussions where started after
>> Flink marketing-like posts. Please look at StackOverflow "Flink vs "
>> posts, almost every post in "winned" by Flink. Answers are sometimes
>> saying nothing about other frameworks, Flink's users (often 

Re: Spark Improvement Proposals

2016-10-17 Thread Cody Koeninger
I think narrowly focusing on Flink or benchmarks is missing my point.

My point is evolve or die.  Spark's governance and organization is
hampering its ability to evolve technologically, and it needs to
change.

On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das  wrote:
> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
> soon as I looked into it since compared to writing Java map-reduce and
> Cascading code, Spark made writing distributed code fun...But now as we went
> deeper with Spark and real-time streaming use-case gets more prominent, I
> think it is time to bring a messaging model in conjunction with the
> batch/micro-batch API that Spark is good atakka-streams close
> integration with spark micro-batching APIs looks like a great direction to
> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
> batch with the assumption is that micro-batching is sufficient to run SQL
> commands on stream but do we really have time to do SQL processing at
> streaming data within 1-2 seconds ?
>
> After reading the email chain, I started to look into Flink documentation
> and if you compare it with Spark documentation, I think we have major work
> to do detailing out Spark internals so that more people from community start
> to take active role in improving the issues so that Spark stays strong
> compared to Flink.
>
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>
> Spark is no longer an engine that works for micro-batch and batch...We (and
> I am sure many others) are pushing spark as an engine for stream and query
> processing.we need to make it a state-of-the-art engine for high speed
> streaming data and user queries as well !
>
> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda 
> wrote:
>>
>> Hi everyone,
>>
>> I'm quite late with my answer, but I think my suggestions may help a
>> little bit. :) Many technical and organizational topics were mentioned,
>> but I want to focus on these negative posts about Spark and about "haters"
>>
>> I really like Spark. Easy of use, speed, very good community - it's
>> everything here. But Every project has to "flight" on "framework market"
>> to be still no 1. I'm following many Spark and Big Data communities,
>> maybe my mail will inspire someone :)
>>
>> You (every Spark developer; so far I didn't have enough time to join
>> contributing to Spark) has done excellent job. So why are some people
>> saying that Flink (or other framework) is better, like it was posted in
>> this mailing list? No, not because that framework is better in all
>> cases.. In my opinion, many of these discussions where started after
>> Flink marketing-like posts. Please look at StackOverflow "Flink vs "
>> posts, almost every post in "winned" by Flink. Answers are sometimes
>> saying nothing about other frameworks, Flink's users (often PMC's) are
>> just posting same information about real-time streaming, about delta
>> iterations, etc. It look smart and very often it is marked as an aswer,
>> even if - in my opinion - there wasn't told all the truth.
>>
>>
>> My suggestion: I don't have enough money and knowledgle to perform huge
>> performance test. Maybe some company, that supports Spark (Databricks,
>> Cloudera? - just saying you're most visible in community :) ) could
>> perform performance test of:
>>
>> - streaming engine - probably Spark will loose because of mini-batch
>> model, however currently the difference should be much lower that in
>> previous versions
>>
>> - Machine Learning models
>>
>> - batch jobs
>>
>> - Graph jobs
>>
>> - SQL queries
>>
>> People will see that Spark is envolving and is also a modern framework,
>> because after reading posts mentioned above people may think "it is
>> outdated, future is in framework X".
>>
>> Matei Zaharia posted excellent blog post about how Spark Structured
>> Streaming beats every other framework in terms of easy-of-use and
>> reliability. Performance tests, done in various environments (in
>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>> cluster), could be also very good marketing stuff to say "hey, you're
>> telling that you're better, but Spark is still faster and is still
>> getting even more fast!". This would be based on facts (just numbers),
>> not opinions. It would be good for companies, for marketing puproses and
>> for every Spark developer
>>
>>
>> Second: real-time streaming. I've written some time ago about real-time
>> streaming support in Spark Structured Streaming. Some work should be
>> done to make SSS more low-latency, but I think it's possible. Maybe
>> Spark may look at Gearpump, which is also built on top of Akka? I don't
>> know yet, it is good topic for SIP. However I think that Spark should
>> have real-time streaming support. Currently I see many posts/comments
>> that "Spark has too big latency". 

Re: Spark Improvement Proposals(Internet mail)

2016-10-17 Thread 黄明
There’s no need to compare to Flink’s Streaming Model. Spark should focus more 
on how to go beyond itself.


From the beginning, Spark’s success comes from it’s unified model can satisfiy 
SQL,Streaming, Machine Learning Models and Graphs Jobs …… all in One.  But From 
1.6 to 2.0, the abstraction from RDD to DataFrame make no contribution to these 
two important areas (ML & Graph) with any substantial progress. Most things is 
for SQL and Streaming, which make Spark have to face the competition with 
Flink. But guys, these is not surposed to be the battle what Spark should face.


SIP is a good start. Voice from technical communication should be heard and 
accepted, not buried in the PR bodies. Nowadays, Spark don’t lack of committers 
or contributors. The right direction and focus area, will decide where it goes, 
what competitor it encounter, and finally what it can be.

---
Sincerely
Andy

 原始邮件
发件人: Debasish Das<debasish.da...@gmail.com>
收件人: Tomasz Gawęda<tomasz.gaw...@outlook.com>
抄送: dev@spark.apache.org<dev@spark.apache.org>; Cody 
Koeninger<c...@koeninger.org>
发送时间: 2016年10月17日(周一) 10:21
主题: Re: Spark Improvement Proposals(Internet mail)

Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as soon 
as I looked into it since compared to writing Java map-reduce and Cascading 
code, Spark made writing distributed code fun...But now as we went deeper with 
Spark and real-time streaming use-case gets more prominent, I think it is time 
to bring a messaging model in conjunction with the batch/micro-batch API that 
Spark is good atakka-streams close integration with spark micro-batching 
APIs looks like a great direction to stay in the game with Apache Flink...Spark 
2.0 integrated streaming with batch with the assumption is that micro-batching 
is sufficient to run SQL commands on stream but do we really have time to do 
SQL processing at streaming data within 1-2 seconds ?

After reading the email chain, I started to look into Flink documentation and 
if you compare it with Spark documentation, I think we have major work to do 
detailing out Spark internals so that more people from community start to take 
active role in improving the issues so that Spark stays strong compared to 
Flink.

https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals

https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals

Spark is no longer an engine that works for micro-batch and batch...We (and I 
am sure many others) are pushing spark as an engine for stream and query 
processing.we need to make it a state-of-the-art engine for high speed 
streaming data and user queries as well !

On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda 
<tomasz.gaw...@outlook.com<mailto:tomasz.gaw...@outlook.com>> wrote:
Hi everyone,

I'm quite late with my answer, but I think my suggestions may help a
little bit. :) Many technical and organizational topics were mentioned,
but I want to focus on these negative posts about Spark and about "haters"

I really like Spark. Easy of use, speed, very good community - it's
everything here. But Every project has to "flight" on "framework market"
to be still no 1. I'm following many Spark and Big Data communities,
maybe my mail will inspire someone :)

You (every Spark developer; so far I didn't have enough time to join
contributing to Spark) has done excellent job. So why are some people
saying that Flink (or other framework) is better, like it was posted in
this mailing list? No, not because that framework is better in all
cases.. In my opinion, many of these discussions where started after
Flink marketing-like posts. Please look at StackOverflow "Flink vs "
posts, almost every post in "winned" by Flink. Answers are sometimes
saying nothing about other frameworks, Flink's users (often PMC's) are
just posting same information about real-time streaming, about delta
iterations, etc. It look smart and very often it is marked as an aswer,
even if - in my opinion - there wasn't told all the truth.


My suggestion: I don't have enough money and knowledgle to perform huge
performance test. Maybe some company, that supports Spark (Databricks,
Cloudera? - just saying you're most visible in community :) ) could
perform performance test of:

- streaming engine - probably Spark will loose because of mini-batch
model, however currently the difference should be much lower that in
previous versions

- Machine Learning models

- batch jobs

- Graph jobs

- SQL queries

People will see that Spark is envolving and is also a modern framework,
because after reading posts mentioned above people may think "it is
outdated, future is in framework X".

Matei Zaharia posted excellent blog post about how Spark Structured
Streaming beats every other framework in terms of easy-of-use and
reliability. Performance tests, done in various environments (in
example: laptop, sma

Re: Spark Improvement Proposals

2016-10-16 Thread Debasish Das
Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
soon as I looked into it since compared to writing Java map-reduce and
Cascading code, Spark made writing distributed code fun...But now as we
went deeper with Spark and real-time streaming use-case gets more
prominent, I think it is time to bring a messaging model in conjunction
with the batch/micro-batch API that Spark is good atakka-streams close
integration with spark micro-batching APIs looks like a great direction to
stay in the game with Apache Flink...Spark 2.0 integrated streaming with
batch with the assumption is that micro-batching is sufficient to run SQL
commands on stream but do we really have time to do SQL processing at
streaming data within 1-2 seconds ?

After reading the email chain, I started to look into Flink documentation
and if you compare it with Spark documentation, I think we have major work
to do detailing out Spark internals so that more people from community
start to take active role in improving the issues so that Spark stays
strong compared to Flink.

https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals

https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals

Spark is no longer an engine that works for micro-batch and batch...We (and
I am sure many others) are pushing spark as an engine for stream and query
processing.we need to make it a state-of-the-art engine for high speed
streaming data and user queries as well !

On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda 
wrote:

> Hi everyone,
>
> I'm quite late with my answer, but I think my suggestions may help a
> little bit. :) Many technical and organizational topics were mentioned,
> but I want to focus on these negative posts about Spark and about "haters"
>
> I really like Spark. Easy of use, speed, very good community - it's
> everything here. But Every project has to "flight" on "framework market"
> to be still no 1. I'm following many Spark and Big Data communities,
> maybe my mail will inspire someone :)
>
> You (every Spark developer; so far I didn't have enough time to join
> contributing to Spark) has done excellent job. So why are some people
> saying that Flink (or other framework) is better, like it was posted in
> this mailing list? No, not because that framework is better in all
> cases.. In my opinion, many of these discussions where started after
> Flink marketing-like posts. Please look at StackOverflow "Flink vs "
> posts, almost every post in "winned" by Flink. Answers are sometimes
> saying nothing about other frameworks, Flink's users (often PMC's) are
> just posting same information about real-time streaming, about delta
> iterations, etc. It look smart and very often it is marked as an aswer,
> even if - in my opinion - there wasn't told all the truth.
>
>
> My suggestion: I don't have enough money and knowledgle to perform huge
> performance test. Maybe some company, that supports Spark (Databricks,
> Cloudera? - just saying you're most visible in community :) ) could
> perform performance test of:
>
> - streaming engine - probably Spark will loose because of mini-batch
> model, however currently the difference should be much lower that in
> previous versions
>
> - Machine Learning models
>
> - batch jobs
>
> - Graph jobs
>
> - SQL queries
>
> People will see that Spark is envolving and is also a modern framework,
> because after reading posts mentioned above people may think "it is
> outdated, future is in framework X".
>
> Matei Zaharia posted excellent blog post about how Spark Structured
> Streaming beats every other framework in terms of easy-of-use and
> reliability. Performance tests, done in various environments (in
> example: laptop, small 2 node cluster, 10-node cluster, 20-node
> cluster), could be also very good marketing stuff to say "hey, you're
> telling that you're better, but Spark is still faster and is still
> getting even more fast!". This would be based on facts (just numbers),
> not opinions. It would be good for companies, for marketing puproses and
> for every Spark developer
>
>
> Second: real-time streaming. I've written some time ago about real-time
> streaming support in Spark Structured Streaming. Some work should be
> done to make SSS more low-latency, but I think it's possible. Maybe
> Spark may look at Gearpump, which is also built on top of Akka? I don't
> know yet, it is good topic for SIP. However I think that Spark should
> have real-time streaming support. Currently I see many posts/comments
> that "Spark has too big latency". Spark Streaming is doing very good
> jobs with micro-batches, however I think it is possible to add also more
> real-time processing.
>
> Other people said much more and I agree with proposal of SIP. I'm also
> happy that PMC's are not saying that they will not listen to users, but
> they really want to make Spark better for every user.
>
>
> What do you think about these two topics? Especially I'm looking 

Re: Spark Improvement Proposals

2016-10-16 Thread Tomasz Gawęda
Hi everyone,

I'm quite late with my answer, but I think my suggestions may help a 
little bit. :) Many technical and organizational topics were mentioned, 
but I want to focus on these negative posts about Spark and about "haters"

I really like Spark. Easy of use, speed, very good community - it's 
everything here. But Every project has to "flight" on "framework market" 
to be still no 1. I'm following many Spark and Big Data communities, 
maybe my mail will inspire someone :)

You (every Spark developer; so far I didn't have enough time to join 
contributing to Spark) has done excellent job. So why are some people 
saying that Flink (or other framework) is better, like it was posted in 
this mailing list? No, not because that framework is better in all 
cases.. In my opinion, many of these discussions where started after 
Flink marketing-like posts. Please look at StackOverflow "Flink vs " 
posts, almost every post in "winned" by Flink. Answers are sometimes 
saying nothing about other frameworks, Flink's users (often PMC's) are 
just posting same information about real-time streaming, about delta 
iterations, etc. It look smart and very often it is marked as an aswer, 
even if - in my opinion - there wasn't told all the truth.


My suggestion: I don't have enough money and knowledgle to perform huge 
performance test. Maybe some company, that supports Spark (Databricks, 
Cloudera? - just saying you're most visible in community :) ) could 
perform performance test of:

- streaming engine - probably Spark will loose because of mini-batch 
model, however currently the difference should be much lower that in 
previous versions

- Machine Learning models

- batch jobs

- Graph jobs

- SQL queries

People will see that Spark is envolving and is also a modern framework, 
because after reading posts mentioned above people may think "it is 
outdated, future is in framework X".

Matei Zaharia posted excellent blog post about how Spark Structured 
Streaming beats every other framework in terms of easy-of-use and 
reliability. Performance tests, done in various environments (in 
example: laptop, small 2 node cluster, 10-node cluster, 20-node 
cluster), could be also very good marketing stuff to say "hey, you're 
telling that you're better, but Spark is still faster and is still 
getting even more fast!". This would be based on facts (just numbers), 
not opinions. It would be good for companies, for marketing puproses and 
for every Spark developer


Second: real-time streaming. I've written some time ago about real-time 
streaming support in Spark Structured Streaming. Some work should be 
done to make SSS more low-latency, but I think it's possible. Maybe 
Spark may look at Gearpump, which is also built on top of Akka? I don't 
know yet, it is good topic for SIP. However I think that Spark should 
have real-time streaming support. Currently I see many posts/comments 
that "Spark has too big latency". Spark Streaming is doing very good 
jobs with micro-batches, however I think it is possible to add also more 
real-time processing.

Other people said much more and I agree with proposal of SIP. I'm also 
happy that PMC's are not saying that they will not listen to users, but 
they really want to make Spark better for every user.


What do you think about these two topics? Especially I'm looking at Cody 
(who has started this topic) and PMCs :)

Pozdrawiam / Best regards,

Tomasz


W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
> I love Spark.  3 or 4 years ago it was the first distributed computing
> environment that felt usable, and the community was welcoming.
>
> But I just got back from the Reactive Summit, and this is what I observed:
>
> - Industry leaders on stage making fun of Spark's streaming model
> - Open source project leaders saying they looked at Spark's governance
> as a model to avoid
> - Users saying they chose Flink because it was technically superior
> and they couldn't get any answers on the Spark mailing lists
>
> Whether you agree with the substance of any of this, when this stuff
> gets repeated enough people will believe it.
>
> Right now Spark is suffering from its own success, and I think
> something needs to change.
>
> - We need a clear process for planning significant changes to the codebase.
> I'm not saying you need to adopt Kafka Improvement Proposals exactly,
> but you need a documented process with a clear outcome (e.g. a vote).
> Passing around google docs after an implementation has largely been
> decided on doesn't cut it.
>
> - All technical communication needs to be public.
> Things getting decided in private chat, or when 1/3 of the committers
> work for the same company and can just talk to each other...
> Yes, it's convenient, but it's ultimately detrimental to the health of
> the project.
> The way structured streaming has played out has shown that there are
> significant technical blind spots (myself included).
> One way to address that is to get the people who 

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-14 Thread mariusvniekerk
So for the jupyter integration pieces.

I've made a simple library ( https://github.com/MaxPoint/spylon
<https://github.com/MaxPoint/spylon>  ) which allows a simpler way of
creating a SparkContext (with all the parameters available to spark-submit)
as well as some usability enhancements, progress bars, tab completion for
spark configuration properties, easier loading of scala objects via py4j.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19449.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-13 Thread Holden Karau
Awesome, good points everyone. The ranking of the issues is super useful
and I'd also completely forgotten about the lack of built in UDAF support
which is rather important. There is a PR to make it easier to call/register
JVM UDFs from Python which will hopefully help a bit there too. I'm getting
on a flight to London for OSCON but I want to continueo encourage users to
chime in with their experiences (to that end I'm trying to re include user@
since it doesn't seem to have been posted there despite my initial attempt
to do so.)

On Thursday, October 13, 2016, assaf.mendelson <assaf.mendel...@rsa.com>
wrote:

> Hi,
>
> We are actually using pyspark heavily.
>
> I agree with all of your points,  for me I see the following as the main
> hurdles:
>
> 1.   Pyspark does not have support for UDAF. We have had multiple
> needs for UDAF and needed to go to java/scala to support these. Having
> python UDAF would have made life much easier (especially at earlier stages
> when we prototype).
>
> 2.   Performance. I cannot stress this enough. Currently we have
> engineers who take python UDFs and convert them to scala UDFs for
> performance. We are currently even looking at writing UDFs and UDAFs in a
> more native way (e.g. using expressions) to improve performance but working
> with pyspark can be really problematic.
>
>
>
> BTW, other than using jython or arrow, I believe there are a couple of
> other ways to get improve performance:
>
> 1.   Python provides tool to generate AST for python code (
> https://docs.python.org/2/library/ast.html). This means we can use the
> AST to construct scala code very similar to how expressions are build for
> native spark functions in scala. Of course doing full conversion is very
> hard but at least handling simple cases should be simple.
>
> 2.   The above would of course be limited if we use python packages
> but over time it is possible to add some “translation” tools (i.e. take
> python packages and find the appropriate scala equivalent. We can even
> provide this to the user to supply their own conversions thereby looking as
> a regular python code but being converted to scala code behind the scenes).
>
> 3.   In scala, it is possible to use codegen to actually generate
> code from a string. There is no reason why we can’t write the expression in
> python and provide a scala string. This would mean learning some scala but
> would mean we do not have to create a separate code tree.
>
>
>
> BTW, the fact that all of the tools to access java are marked as private
> has me a little worried. Nearly all of our UDFs (and all of our UDAFs) are
> written in scala for performance. The wrapping to provide them in python
> uses way too many private elements for my taste.
>
>
>
>
>
> *From:* msukmanowsky [via Apache Spark Developers List] [mailto:ml-node+
> <javascript:_e(%7B%7D,'cvml','ml-node%2B');>[hidden email]
> <http:///user/SendEmail.jtp?type=node=19431=0>]
> *Sent:* Thursday, October 13, 2016 3:51 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Python Spark Improvements (forked from Spark Improvement
> Proposals)
>
>
>
> As very heavy Spark users at Parse.ly, I just wanted to give a +1 to all
> of the issues raised by Holden and Ricardo. I'm also giving a talk at PyCon
> Canada on PySpark https://2016.pycon.ca/en/schedule/096-mike-sukmanowsky/.
>
>
> Being a Python shop, we were extremely pleased to learn about PySpark a
> few years ago as our main ETL pipeline used Apache Pig at the time. I was
> one of the only folks who understood Pig and Java so collaborating on this
> as a team was difficult.
>
> Spark provided a means for the entire team to collaborate, but we've hit
> our fair share of issues all of which are enumerated in this thread.
>
> Besides giving a +1 here, I think if I were to force rank these items for
> us, it'd be:
>
> 1. Configuration difficulties: we've lost literally weeks to
> troubleshooting memory issues for larger jobs. It took a long time to even
> understand *why* certain jobs were failing since Spark would just report
> executors being lost. Finally we tracked things down to understanding that
> spark.yarn.executor.memoryOverhead controls the portion of memory
> reserved for Python processes, but none of this is documented anywhere as
> far as I can tell. We discovered this via trial and error. Both
> documentation and better defaults for this setting when running a PySpark
> application are probably sufficient. We've also had a number of troubles
> with saving Parquet output as part of an ETL flow, but perhaps we'll save
> that for a blog post of its own.
>
> 2. Dependency management: I've tried to help move the conversation on
> https://issues.

RE: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-13 Thread assaf.mendelson
Hi,
We are actually using pyspark heavily.
I agree with all of your points,  for me I see the following as the main 
hurdles:

1.   Pyspark does not have support for UDAF. We have had multiple needs for 
UDAF and needed to go to java/scala to support these. Having python UDAF would 
have made life much easier (especially at earlier stages when we prototype).

2.   Performance. I cannot stress this enough. Currently we have engineers 
who take python UDFs and convert them to scala UDFs for performance. We are 
currently even looking at writing UDFs and UDAFs in a more native way (e.g. 
using expressions) to improve performance but working with pyspark can be 
really problematic.

BTW, other than using jython or arrow, I believe there are a couple of other 
ways to get improve performance:

1.   Python provides tool to generate AST for python code 
(https://docs.python.org/2/library/ast.html). This means we can use the AST to 
construct scala code very similar to how expressions are build for native spark 
functions in scala. Of course doing full conversion is very hard but at least 
handling simple cases should be simple.

2.   The above would of course be limited if we use python packages but 
over time it is possible to add some "translation" tools (i.e. take python 
packages and find the appropriate scala equivalent. We can even provide this to 
the user to supply their own conversions thereby looking as a regular python 
code but being converted to scala code behind the scenes).

3.   In scala, it is possible to use codegen to actually generate code from 
a string. There is no reason why we can't write the expression in python and 
provide a scala string. This would mean learning some scala but would mean we 
do not have to create a separate code tree.

BTW, the fact that all of the tools to access java are marked as private has me 
a little worried. Nearly all of our UDFs (and all of our UDAFs) are written in 
scala for performance. The wrapping to provide them in python uses way too many 
private elements for my taste.


From: msukmanowsky [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19426...@n3.nabble.com]
Sent: Thursday, October 13, 2016 3:51 AM
To: Mendelson, Assaf
Subject: Re: Python Spark Improvements (forked from Spark Improvement Proposals)

As very heavy Spark users at Parse.ly, I just wanted to give a +1 to all of the 
issues raised by Holden and Ricardo. I'm also giving a talk at PyCon Canada on 
PySpark https://2016.pycon.ca/en/schedule/096-mike-sukmanowsky/.

Being a Python shop, we were extremely pleased to learn about PySpark a few 
years ago as our main ETL pipeline used Apache Pig at the time. I was one of 
the only folks who understood Pig and Java so collaborating on this as a team 
was difficult.

Spark provided a means for the entire team to collaborate, but we've hit our 
fair share of issues all of which are enumerated in this thread.

Besides giving a +1 here, I think if I were to force rank these items for us, 
it'd be:

1. Configuration difficulties: we've lost literally weeks to troubleshooting 
memory issues for larger jobs. It took a long time to even understand *why* 
certain jobs were failing since Spark would just report executors being lost. 
Finally we tracked things down to understanding that 
spark.yarn.executor.memoryOverhead controls the portion of memory reserved for 
Python processes, but none of this is documented anywhere as far as I can tell. 
We discovered this via trial and error. Both documentation and better defaults 
for this setting when running a PySpark application are probably sufficient. 
We've also had a number of troubles with saving Parquet output as part of an 
ETL flow, but perhaps we'll save that for a blog post of its own.

2. Dependency management: I've tried to help move the conversation on 
https://issues.apache.org/jira/browse/SPARK-13587 but it seems we're a bit 
stalled. Installing the required dependencies for a PySpark application is a 
really messy ordeal right now.

3. Development workflow: I'd combine both "incomprehensible error messages" and 
"
difficulty using PySpark from outside of spark-submit / pyspark shell" here. 
When teaching PySpark to new users, I'm reminded of how much inside knowledge 
is needed to overcome esoteric errors. As one example is hitting 
"PicklingError: Could not pickle object as excessively deep recursion 
required." errors. New users often do something innocent like try to pickle a 
global logging object and hit this and begin the Google -> stackoverflow search 
to try to comprehend what's going on. You can lose days to errors like these 
and they completely kill the productivity flow and send you hunting for 
alternatives.

4. Speed/performance: we are trying to use DataFrame/DataSets where we can and 
do as much in Java as possible but when we do move to Python, we're well aware 
that we're about to take a hit on perfo

Re: Spark Improvement Proposals

2016-10-12 Thread kant kodali
t;> >>>> > I also like the names that are short and (mostly) unique, like SEP.
>> >>>> >
>> >>>> > Where I disagree is with the requirement that a committer must
>> >>>> > formally
>> >>>> > propose an enhancement. I don't see the value of restricting this:
>> if
>> >>>> > someone has the will to write up a proposal then they should be
>> >>>> > encouraged
>> >>>> > to do so and start a discussion about it. Even if there is a
>> political
>> >>>> > reality as Cody says, what is the value of codifying that in our
>> >>>> > process? I
>> >>>> > think restricting who can submit proposals would only undermine
>> them
>> >>>> > by
>> >>>> > pushing contributors out. Maybe I'm missing something here?
>> >>>> >
>> >>>> > rb
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <
>> c...@koeninger.org>
>> >>>> > wrote:
>> >>>> >>
>> >>>> >> Yes, users suggesting SIPs is a good thing and is explicitly
>> called
>> >>>> >> out in the linked document under the Who? section.  Formally
>> >>>> >> proposing
>> >>>> >> them, not so much, because of the political realities.
>> >>>> >>
>> >>>> >> Yes, implementation strategy definitely affects goals.  There are
>> all
>> >>>> >> kinds of examples of this, I'll pick one that's my fault so as to
>> >>>> >> avoid sounding like I'm blaming:
>> >>>> >>
>> >>>> >> When I implemented the Kafka DStream, one of my (not explicitly
>> >>>> >> agreed
>> >>>> >> upon by the community) goals was to make sure people could use the
>> >>>> >> Dstream with however they were already using Kafka at work.  The
>> lack
>> >>>> >> of explicit agreement on that goal led to all kinds of fighting
>> with
>> >>>> >> committers, that could have been avoided.  The lack of explicit
>> >>>> >> up-front strategy discussion led to the DStream not really working
>> >>>> >> with compacted topics.  I knew about compacted topics, but don't
>> have
>> >>>> >> a use for them, so had a blind spot there.  If there was explicit
>> >>>> >> up-front discussion that my strategy was "assume that batches can
>> be
>> >>>> >> defined on the driver solely by beginning and ending offsets",
>> >>>> >> there's
>> >>>> >> a greater chance that a user would have seen that and said, "hey,
>> >>>> >> what
>> >>>> >> about non-contiguous offsets in a compacted topic".
>> >>>> >>
>> >>>> >> This kind of thing is only going to happen smoothly if we have a
>> >>>> >> lightweight user-visible process with clear outcomes.
>> >>>> >>
>> >>>> >> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>> >>>> >> <assaf.mendel...@rsa.com> wrote:
>> >>>> >> > I agree with most of what Cody said.
>> >>>> >> >
>> >>>> >> > Two things:
>> >>>> >> >
>> >>>> >> > First we can always have other people suggest SIPs but mark
>> them as
>> >>>> >> > “unreviewed” and have committers basically move them forward.
>> The
>> >>>> >> > problem is
>> >>>> >> > that writing a good document takes time. This way we can
>> leverage
>> >>>> >> > non
>> >>>> >> > committers to do some of this work (it is just another way to
>> >>>> >> > contribute).
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > As for strategy, in many cases implementation strategy can
>> affect
>> >>>> >> > the
>> >>>> >> > goals.
>> >>>> &

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-12 Thread msukmanowsky
As very heavy Spark users at Parse.ly, I just wanted to give a +1 to all of
the issues raised by Holden and Ricardo. I'm also giving a talk at PyCon
Canada on PySpark https://2016.pycon.ca/en/schedule/096-mike-sukmanowsky/.

Being a Python shop, we were extremely pleased to learn about PySpark a few
years ago as our main ETL pipeline used Apache Pig at the time. I was one of
the only folks who understood Pig and Java so collaborating on this as a
team was difficult.

Spark provided a means for the entire team to collaborate, but we've hit our
fair share of issues all of which are enumerated in this thread.

Besides giving a +1 here, I think if I were to force rank these items for
us, it'd be:

1. Configuration difficulties: we've lost literally weeks to troubleshooting
memory issues for larger jobs. It took a long time to even understand *why*
certain jobs were failing since Spark would just report executors being
lost. Finally we tracked things down to understanding that
spark.yarn.executor.memoryOverhead controls the portion of memory reserved
for Python processes, but none of this is documented anywhere as far as I
can tell. We discovered this via trial and error. Both documentation and
better defaults for this setting when running a PySpark application are
probably sufficient. We've also had a number of troubles with saving Parquet
output as part of an ETL flow, but perhaps we'll save that for a blog post
of its own.

2. Dependency management: I've tried to help move the conversation on
https://issues.apache.org/jira/browse/SPARK-13587 but it seems we're a bit
stalled. Installing the required dependencies for a PySpark application is a
really messy ordeal right now.

3. Development workflow: I'd combine both "incomprehensible error messages"
and "
difficulty using PySpark from outside of spark-submit / pyspark shell" here.
When teaching PySpark to new users, I'm reminded of how much inside
knowledge is needed to overcome esoteric errors. As one example is hitting
"PicklingError: Could not pickle object as excessively deep recursion
required." errors. New users often do something innocent like try to pickle
a global logging object and hit this and begin the Google -> stackoverflow
search to try to comprehend what's going on. You can lose days to errors
like these and they completely kill the productivity flow and send you
hunting for alternatives.

4. Speed/performance: we are trying to use DataFrame/DataSets where we can
and do as much in Java as possible but when we do move to Python, we're well
aware that we're about to take a hit on performance. We're very keen to see
what Apache Arrow does for things here.

5. API difficulties: I agree that when coming from Python, you'd expect that
you can do the same kinds of operations on DataFrames in Spark that you can
with Pandas, but I personally haven't been too bothered by this. Maybe I'm
more used to this situation from using other frameworks that have similar
concepts but incompatible implementations.

We're big fans of PySpark and are happy to provide feedback and contribute
wherever we can.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-tp19422p19426.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2016-10-11 Thread Ryan Blue
gt; reality as Cody says, what is the value of codifying that in our
> >>>> > process? I
> >>>> > think restricting who can submit proposals would only undermine them
> >>>> > by
> >>>> > pushing contributors out. Maybe I'm missing something here?
> >>>> >
> >>>> > rb
> >>>> >
> >>>> >
> >>>> >
> >>>> > On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <c...@koeninger.org
> >
> >>>> > wrote:
> >>>> >>
> >>>> >> Yes, users suggesting SIPs is a good thing and is explicitly called
> >>>> >> out in the linked document under the Who? section.  Formally
> >>>> >> proposing
> >>>> >> them, not so much, because of the political realities.
> >>>> >>
> >>>> >> Yes, implementation strategy definitely affects goals.  There are
> all
> >>>> >> kinds of examples of this, I'll pick one that's my fault so as to
> >>>> >> avoid sounding like I'm blaming:
> >>>> >>
> >>>> >> When I implemented the Kafka DStream, one of my (not explicitly
> >>>> >> agreed
> >>>> >> upon by the community) goals was to make sure people could use the
> >>>> >> Dstream with however they were already using Kafka at work.  The
> lack
> >>>> >> of explicit agreement on that goal led to all kinds of fighting
> with
> >>>> >> committers, that could have been avoided.  The lack of explicit
> >>>> >> up-front strategy discussion led to the DStream not really working
> >>>> >> with compacted topics.  I knew about compacted topics, but don't
> have
> >>>> >> a use for them, so had a blind spot there.  If there was explicit
> >>>> >> up-front discussion that my strategy was "assume that batches can
> be
> >>>> >> defined on the driver solely by beginning and ending offsets",
> >>>> >> there's
> >>>> >> a greater chance that a user would have seen that and said, "hey,
> >>>> >> what
> >>>> >> about non-contiguous offsets in a compacted topic".
> >>>> >>
> >>>> >> This kind of thing is only going to happen smoothly if we have a
> >>>> >> lightweight user-visible process with clear outcomes.
> >>>> >>
> >>>> >> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
> >>>> >> <assaf.mendel...@rsa.com> wrote:
> >>>> >> > I agree with most of what Cody said.
> >>>> >> >
> >>>> >> > Two things:
> >>>> >> >
> >>>> >> > First we can always have other people suggest SIPs but mark them
> as
> >>>> >> > “unreviewed” and have committers basically move them forward. The
> >>>> >> > problem is
> >>>> >> > that writing a good document takes time. This way we can leverage
> >>>> >> > non
> >>>> >> > committers to do some of this work (it is just another way to
> >>>> >> > contribute).
> >>>> >> >
> >>>> >> >
> >>>> >> >
> >>>> >> > As for strategy, in many cases implementation strategy can affect
> >>>> >> > the
> >>>> >> > goals.
> >>>> >> > I will give  a small example: In the current structured streaming
> >>>> >> > strategy,
> >>>> >> > we group by the time to achieve a sliding window. This is
> >>>> >> > definitely an
> >>>> >> > implementation decision and not a goal. However, I can think of
> >>>> >> > several
> >>>> >> > aggregation functions which have the time inside their
> calculation
> >>>> >> > buffer.
> >>>> >> > For example, let’s say we want to return a set of all distinct
> >>>> >> > values.
> >>>> >> > One
> >>>> >> > way to implement this would be to make the set into a map and
> have
> >>>> >> > the
> >>>> >> > value
> >>>> >

Re: Improving governance / committers (split from Spark Improvement Proposals thread)

2016-10-10 Thread Holden Karau
I think it is really important to ensure that someone with a good
understanding of Kafka is empowered around this component with a formal
voice around - but I don't have much dev experience with our Kafka
connectors so I can't speak to the specifics around it personally.

More generally, I also feel pretty strongly about commit bits, and while
I've been going back through the old Python JIRAs and PRs it's seems we are
leaving some good stuff out just because of reviewer bandwidth (not to
mention the people that get turned away from contributing more after their
first interaction or lack their of). Certainly the Python reviewer(s) knows
their stuff - but it feels like for Python there just isn't enough
committer time available to handle the contributor interest. Although - to
be fair - this may be one of those cases where as we add more committers we
will have more contributors never having enough time, but I see that as a
positive cycle we should embrace.

I'm curious - are developers working more in other components feeling
similarly? I've sort of assumed so personally - but it would be nice to
hear others experiences as well.

Of course my disclaimer from the original conversation applies
<http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-tp19268p19284.html>
- I do very much "have a horse in the race" so I will avoid proposing new
criteria. I working on Spark is a core part of what I do most days, and
once my day job with Spark is done I go and do even more Spark like working
on a new Spark book focused on performance right now - and I very much do
want to see a healthy community flourish around Spark :)

More thoughts in-line:

On Sat, Oct 8, 2016 at 5:03 PM, Cody Koeninger <c...@koeninger.org> wrote:

> It's not about technical design disagreement as to matters of taste,
> it's about familiarity with the domain.  To make an analogy, it's as
> if a committer in MLlib was firmly intent on, I dunno, treating a
> collection of categorical variables as if it were an ordered range of
> continuous variables.  It's just wrong.  That kind of thing, to a
> greater or lesser degree, has been going on related to the Kafka
> modules, for years.
>
> On Sat, Oct 8, 2016 at 4:11 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
> > This makes a lot of sense; just to comment on a few things:
> >
> >> - More committers
> >> Just looking at the ratio of committers to open tickets, or committers
> >> to contributors, I don't think you have enough human power.
> >> I realize this is a touchy issue.  I don't have dog in this fight,
> >> because I'm not on either coast nor in a big company that views
> >> committership as a political thing.  I just think you need more people
> >> to do the work, and more diversity of viewpoint.
> >> It's unfortunate that the Apache governance process involves giving
> >> someone all the keys or none of the keys, but until someone really
> >> starts screwing up, I think it's better to err on the side of
> >> accepting hard-working people.
> >
> > This is something the PMC is actively discussing. Historically, we've
> added committers when people contributed a new module or feature, basically
> to the point where other developers are asking them to review changes in
> that area (https://cwiki.apache.org/confluence/display/SPARK/Committer
> s#Committers-BecomingaCommitter). For example, we added the original
> authors of GraphX when we merged in GraphX, the authors of new ML
> algorithms, etc. However, there's a good argument that some areas are
> simply not covered well now and we should add people there. Also, as the
> project has grown, there are also more people who focus on smaller fixes
> and are nonetheless contributing a lot.
>

I'm happy to hear this is something being actively discussed by the PMC.
I'm also glad the PMC took the time to create some documentation around
what it takes to be a committer - but, to me, it seems like there are maybe
some additional requirements or nuances to the requirements/process which
haven't quite been fully captured in the current wiki and I look forward to
seeing the result of the conversation and the clarity or changes it can
bring to the process.

I realize the default for the PMC may be to have the conversation around
this on private@ - but I think the dev (and maybe even user) community as a
whole is rather interested and we all could benefit by working together on
this (or at least being aware of the PMCs thoughts around this).With the
decisions and discussions around the committer process happen on the
private mailing list (or in person) its really difficult as an outsider (or
contributor interested in being a committer) feel that one has a good
understanding of what is going on. Sean Owen and 

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
; >> kinds of examples of this, I'll pick one that's my fault so as to
>>>> >> avoid sounding like I'm blaming:
>>>> >>
>>>> >> When I implemented the Kafka DStream, one of my (not explicitly
>>>> >> agreed
>>>> >> upon by the community) goals was to make sure people could use the
>>>> >> Dstream with however they were already using Kafka at work.  The lack
>>>> >> of explicit agreement on that goal led to all kinds of fighting with
>>>> >> committers, that could have been avoided.  The lack of explicit
>>>> >> up-front strategy discussion led to the DStream not really working
>>>> >> with compacted topics.  I knew about compacted topics, but don't have
>>>> >> a use for them, so had a blind spot there.  If there was explicit
>>>> >> up-front discussion that my strategy was "assume that batches can be
>>>> >> defined on the driver solely by beginning and ending offsets",
>>>> >> there's
>>>> >> a greater chance that a user would have seen that and said, "hey,
>>>> >> what
>>>> >> about non-contiguous offsets in a compacted topic".
>>>> >>
>>>> >> This kind of thing is only going to happen smoothly if we have a
>>>> >> lightweight user-visible process with clear outcomes.
>>>> >>
>>>> >> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>>> >> <assaf.mendel...@rsa.com> wrote:
>>>> >> > I agree with most of what Cody said.
>>>> >> >
>>>> >> > Two things:
>>>> >> >
>>>> >> > First we can always have other people suggest SIPs but mark them as
>>>> >> > “unreviewed” and have committers basically move them forward. The
>>>> >> > problem is
>>>> >> > that writing a good document takes time. This way we can leverage
>>>> >> > non
>>>> >> > committers to do some of this work (it is just another way to
>>>> >> > contribute).
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > As for strategy, in many cases implementation strategy can affect
>>>> >> > the
>>>> >> > goals.
>>>> >> > I will give  a small example: In the current structured streaming
>>>> >> > strategy,
>>>> >> > we group by the time to achieve a sliding window. This is
>>>> >> > definitely an
>>>> >> > implementation decision and not a goal. However, I can think of
>>>> >> > several
>>>> >> > aggregation functions which have the time inside their calculation
>>>> >> > buffer.
>>>> >> > For example, let’s say we want to return a set of all distinct
>>>> >> > values.
>>>> >> > One
>>>> >> > way to implement this would be to make the set into a map and have
>>>> >> > the
>>>> >> > value
>>>> >> > contain the last time seen. Multiplying it across the groupby would
>>>> >> > cost
>>>> >> > a
>>>> >> > lot in performance. So adding such a strategy would have a great
>>>> >> > effect
>>>> >> > on
>>>> >> > the type of aggregations and their performance which does affect
>>>> >> > the
>>>> >> > goal.
>>>> >> > Without adding the strategy, it is easy for whoever goes to the
>>>> >> > design
>>>> >> > document to not think about these cases. Furthermore, it might be
>>>> >> > decided
>>>> >> > that these cases are rare enough so that the strategy is still good
>>>> >> > enough
>>>> >> > but how would we know it without user feedback?
>>>> >> >
>>>> >> > I believe this example is exactly what Cody was talking about.
>>>> >> > Since
>>>> >> > many
>>>> >> > times implementation strategies have a large effect on the goal, we
>>>> >> > should
>>>> >> > have it discussed when discussing the goals. In addition, while it
>>>>

Re: Spark Improvement Proposals

2016-10-10 Thread Mark Hamstra
fined on the driver solely by beginning and ending offsets", there's
>>> >> a greater chance that a user would have seen that and said, "hey, what
>>> >> about non-contiguous offsets in a compacted topic".
>>> >>
>>> >> This kind of thing is only going to happen smoothly if we have a
>>> >> lightweight user-visible process with clear outcomes.
>>> >>
>>> >> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>> >> <assaf.mendel...@rsa.com> wrote:
>>> >> > I agree with most of what Cody said.
>>> >> >
>>> >> > Two things:
>>> >> >
>>> >> > First we can always have other people suggest SIPs but mark them as
>>> >> > “unreviewed” and have committers basically move them forward. The
>>> >> > problem is
>>> >> > that writing a good document takes time. This way we can leverage
>>> non
>>> >> > committers to do some of this work (it is just another way to
>>> >> > contribute).
>>> >> >
>>> >> >
>>> >> >
>>> >> > As for strategy, in many cases implementation strategy can affect
>>> the
>>> >> > goals.
>>> >> > I will give  a small example: In the current structured streaming
>>> >> > strategy,
>>> >> > we group by the time to achieve a sliding window. This is
>>> definitely an
>>> >> > implementation decision and not a goal. However, I can think of
>>> several
>>> >> > aggregation functions which have the time inside their calculation
>>> >> > buffer.
>>> >> > For example, let’s say we want to return a set of all distinct
>>> values.
>>> >> > One
>>> >> > way to implement this would be to make the set into a map and have
>>> the
>>> >> > value
>>> >> > contain the last time seen. Multiplying it across the groupby would
>>> cost
>>> >> > a
>>> >> > lot in performance. So adding such a strategy would have a great
>>> effect
>>> >> > on
>>> >> > the type of aggregations and their performance which does affect the
>>> >> > goal.
>>> >> > Without adding the strategy, it is easy for whoever goes to the
>>> design
>>> >> > document to not think about these cases. Furthermore, it might be
>>> >> > decided
>>> >> > that these cases are rare enough so that the strategy is still good
>>> >> > enough
>>> >> > but how would we know it without user feedback?
>>> >> >
>>> >> > I believe this example is exactly what Cody was talking about. Since
>>> >> > many
>>> >> > times implementation strategies have a large effect on the goal, we
>>> >> > should
>>> >> > have it discussed when discussing the goals. In addition, while it
>>> is
>>> >> > often
>>> >> > easy to throw out completely infeasible goals, it is often much
>>> harder
>>> >> > to
>>> >> > figure out that the goals are unfeasible without fine tuning.
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > Assaf.
>>> >> >
>>> >> >
>>> >> >
>>> >> > From: Cody Koeninger-2 [via Apache Spark Developers List]
>>> >> > [mailto:ml-node+[hidden email]]
>>> >> > Sent: Monday, October 10, 2016 2:25 AM
>>> >> > To: Mendelson, Assaf
>>> >> > Subject: Re: Spark Improvement Proposals
>>> >> >
>>> >> >
>>> >> >
>>> >> > Only committers should formally submit SIPs because in an apache
>>> >> > project only commiters have explicit political power.  If a user
>>> can't
>>> >> > find a commiter willing to sponsor an SIP idea, they have no way to
>>> >> > get the idea passed in any case.  If I can't find a committer to
>>> >> > sponsor this meta-SIP idea, I'm out of luck.
>>> >> >
>>> >> > I do not believe unrealistic goals can be found solely by
>>> inspection.

Re: Spark Improvement Proposals

2016-10-10 Thread Mark Hamstra
I'm not a fan of the SEP acronym.  Besides it prior established meaning of
"Somebody else's problem", the are other inappropriate or offensive
connotations such as this Australian slang that often gets shortened to
just "sep":  http://www.urbandictionary.com/define.php?term=Seppo

On Sun, Oct 9, 2016 at 4:00 PM, Nicholas Chammas  wrote:

> On Sun, Oct 9, 2016 at 5:19 PM Cody Koeninger  wrote:
>
>> Regarding name, if the SIP overlap is a concern, we can pick a different
>> name.
>>
>> My tongue in cheek suggestion would be
>>
>> Spark Lightweight Improvement process (SPARKLI)
>>
>
> If others share my minor concern about the SIP name, I propose Spark
> Enhancement Proposal (SEP), taking inspiration from the Python Enhancement
> Proposal name.
>
> So if we're going to number proposals like other projects do, they'd be
> numbered SEP-1, SEP-2, etc. This avoids the naming conflict with Scala SIPs.
>
> Another way to avoid a conflict is to stick with "Spark Improvement
> Proposal" but use SPIP as the acronym. So SPIP-1, SPIP-2, etc.
>
> Anyway, it's not a big deal. I just wanted to raise this point.
>
> Nick
>


Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
;>>>>>> +1 to votes to approve proposals. I agree that proposals should have an
>>>>>>> official mechanism to be accepted, and a vote is an established means
>>>>>>> of
>>>>>>> doing that well. I like that it includes a period to review the
>>>>>>> proposal and
>>>>>>> I think proposals should have been discussed enough ahead of a vote to
>>>>>>> survive the possibility of a veto.
>>>>>>>
>>>>>>> I also like the names that are short and (mostly) unique, like SEP.
>>>>>>>
>>>>>>> Where I disagree is with the requirement that a committer must formally
>>>>>>> propose an enhancement. I don't see the value of restricting this: if
>>>>>>> someone has the will to write up a proposal then they should be
>>>>>>> encouraged
>>>>>>> to do so and start a discussion about it. Even if there is a political
>>>>>>> reality as Cody says, what is the value of codifying that in our
>>>>>>> process? I
>>>>>>> think restricting who can submit proposals would only undermine them by
>>>>>>> pushing contributors out. Maybe I'm missing something here?
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <c...@koeninger.org>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Yes, users suggesting SIPs is a good thing and is explicitly called
>>>>>>>> out in the linked document under the Who? section.  Formally proposing
>>>>>>>> them, not so much, because of the political realities.
>>>>>>>>
>>>>>>>> Yes, implementation strategy definitely affects goals.  There are all
>>>>>>>> kinds of examples of this, I'll pick one that's my fault so as to
>>>>>>>> avoid sounding like I'm blaming:
>>>>>>>>
>>>>>>>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>>>>>>>> upon by the community) goals was to make sure people could use the
>>>>>>>> Dstream with however they were already using Kafka at work.  The lack
>>>>>>>> of explicit agreement on that goal led to all kinds of fighting with
>>>>>>>> committers, that could have been avoided.  The lack of explicit
>>>>>>>> up-front strategy discussion led to the DStream not really working
>>>>>>>> with compacted topics.  I knew about compacted topics, but don't have
>>>>>>>> a use for them, so had a blind spot there.  If there was explicit
>>>>>>>> up-front discussion that my strategy was "assume that batches can be
>>>>>>>> defined on the driver solely by beginning and ending offsets", there's
>>>>>>>> a greater chance that a user would have seen that and said, "hey, what
>>>>>>>> about non-contiguous offsets in a compacted topic".
>>>>>>>>
>>>>>>>> This kind of thing is only going to happen smoothly if we have a
>>>>>>>> lightweight user-visible process with clear outcomes.
>>>>>>>>
>>>>>>>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>>>>>>> <assaf.mendel...@rsa.com> wrote:
>>>>>>>>> I agree with most of what Cody said.
>>>>>>>>>
>>>>>>>>> Two things:
>>>>>>>>>
>>>>>>>>> First we can always have other people suggest SIPs but mark them as
>>>>>>>>> “unreviewed” and have committers basically move them forward. The
>>>>>>>>> problem is
>>>>>>>>> that writing a good document takes time. This way we can leverage
>>>>>>>>> non
>>>>>>>>> committers to do some of this work (it is just another way to
>>>>>>>>> contribute).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As for strategy, in many cases implementation strategy can affect
>>>>>>>&

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
 a compacted topic".
>> >>>
>> >>> This kind of thing is only going to happen smoothly if we have a
>> >>> lightweight user-visible process with clear outcomes.
>> >>>
>> >>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>> >>> <assaf.mendel...@rsa.com> wrote:
>> >>> > I agree with most of what Cody said.
>> >>> >
>> >>> > Two things:
>> >>> >
>> >>> > First we can always have other people suggest SIPs but mark them as
>> >>> > “unreviewed” and have committers basically move them forward. The
>> >>> > problem is
>> >>> > that writing a good document takes time. This way we can leverage
>> >>> > non
>> >>> > committers to do some of this work (it is just another way to
>> >>> > contribute).
>> >>> >
>> >>> >
>> >>> >
>> >>> > As for strategy, in many cases implementation strategy can affect
>> >>> > the
>> >>> > goals.
>> >>> > I will give  a small example: In the current structured streaming
>> >>> > strategy,
>> >>> > we group by the time to achieve a sliding window. This is definitely
>> >>> > an
>> >>> > implementation decision and not a goal. However, I can think of
>> >>> > several
>> >>> > aggregation functions which have the time inside their calculation
>> >>> > buffer.
>> >>> > For example, let’s say we want to return a set of all distinct
>> >>> > values.
>> >>> > One
>> >>> > way to implement this would be to make the set into a map and have
>> >>> > the
>> >>> > value
>> >>> > contain the last time seen. Multiplying it across the groupby would
>> >>> > cost a
>> >>> > lot in performance. So adding such a strategy would have a great
>> >>> > effect
>> >>> > on
>> >>> > the type of aggregations and their performance which does affect the
>> >>> > goal.
>> >>> > Without adding the strategy, it is easy for whoever goes to the
>> >>> > design
>> >>> > document to not think about these cases. Furthermore, it might be
>> >>> > decided
>> >>> > that these cases are rare enough so that the strategy is still good
>> >>> > enough
>> >>> > but how would we know it without user feedback?
>> >>> >
>> >>> > I believe this example is exactly what Cody was talking about. Since
>> >>> > many
>> >>> > times implementation strategies have a large effect on the goal, we
>> >>> > should
>> >>> > have it discussed when discussing the goals. In addition, while it
>> >>> > is
>> >>> > often
>> >>> > easy to throw out completely infeasible goals, it is often much
>> >>> > harder
>> >>> > to
>> >>> > figure out that the goals are unfeasible without fine tuning.
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > Assaf.
>> >>> >
>> >>> >
>> >>> >
>> >>> > From: Cody Koeninger-2 [via Apache Spark Developers List]
>> >>> > [mailto:ml-node+[hidden email]]
>> >>> > Sent: Monday, October 10, 2016 2:25 AM
>> >>> > To: Mendelson, Assaf
>> >>> > Subject: Re: Spark Improvement Proposals
>> >>> >
>> >>> >
>> >>> >
>> >>> > Only committers should formally submit SIPs because in an apache
>> >>> > project only commiters have explicit political power.  If a user
>> >>> > can't
>> >>> > find a commiter willing to sponsor an SIP idea, they have no way to
>> >>> > get the idea passed in any case.  If I can't find a committer to
>> >>> > sponsor this meta-SIP idea, I'm out of luck.
>> >>> >
>> >>> > I do not believe unrealistic goals can be found solely by
>> >>> > inspection.
>> >>> > We've managed to ignore unrealistic goals even after implementation!
&

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
ost a
>>> > lot in performance. So adding such a strategy would have a great effect
>>> > on
>>> > the type of aggregations and their performance which does affect the
>>> > goal.
>>> > Without adding the strategy, it is easy for whoever goes to the design
>>> > document to not think about these cases. Furthermore, it might be
>>> > decided
>>> > that these cases are rare enough so that the strategy is still good
>>> > enough
>>> > but how would we know it without user feedback?
>>> >
>>> > I believe this example is exactly what Cody was talking about. Since
>>> > many
>>> > times implementation strategies have a large effect on the goal, we
>>> > should
>>> > have it discussed when discussing the goals. In addition, while it is
>>> > often
>>> > easy to throw out completely infeasible goals, it is often much harder
>>> > to
>>> > figure out that the goals are unfeasible without fine tuning.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Assaf.
>>> >
>>> >
>>> >
>>> > From: Cody Koeninger-2 [via Apache Spark Developers List]
>>> > [mailto:ml-node+[hidden email]]
>>> > Sent: Monday, October 10, 2016 2:25 AM
>>> > To: Mendelson, Assaf
>>> > Subject: Re: Spark Improvement Proposals
>>> >
>>> >
>>> >
>>> > Only committers should formally submit SIPs because in an apache
>>> > project only commiters have explicit political power.  If a user can't
>>> > find a commiter willing to sponsor an SIP idea, they have no way to
>>> > get the idea passed in any case.  If I can't find a committer to
>>> > sponsor this meta-SIP idea, I'm out of luck.
>>> >
>>> > I do not believe unrealistic goals can be found solely by inspection.
>>> > We've managed to ignore unrealistic goals even after implementation!
>>> > Focusing on APIs can allow people to think they've solved something,
>>> > when there's really no way of implementing that API while meeting the
>>> > goals.  Rapid iteration is clearly the best way to address this, but
>>> > we've already talked about why that hasn't really worked.  If adding a
>>> > non-binding API section to the template is important to you, I'm not
>>> > against it, but I don't think it's sufficient.
>>> >
>>> > On your PRD vs design doc spectrum, I'm saying this is closer to a
>>> > PRD.  Clear agreement on goals is the most important thing and that's
>>> > why it's the thing I want binding agreement on.  But I cannot agree to
>>> > goals unless I have enough minimal technical info to judge whether the
>>> > goals are likely to actually be accomplished.
>>> >
>>> >
>>> >
>>> > On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]> wrote:
>>> >
>>> >
>>> >> Well, I think there are a few things here that don't make sense.
>>> >> First,
>>> >> why
>>> >> should only committers submit SIPs? Development in the project should
>>> >> be
>>> >> open to all contributors, whether they're committers or not. Second, I
>>> >> think
>>> >> unrealistic goals can be found just by inspecting the goals, and I'm
>>> >> not
>>> >> super worried that we'll accept a lot of SIPs that are then infeasible
>>> >> --
>>> >> we
>>> >> can then submit new ones. But this depends on whether you want this
>>> >> process
>>> >> to be a "design doc lite", where people also agree on implementation
>>> >> strategy, or just a way to agree on goals. This is what I asked
>>> >> earlier
>>> >> about PRDs vs design docs (and I'm open to either one but I'd just
>>> >> like
>>> >> clarity). Finally, both as a user and designer of software, I always
>>> >> want
>>> >> to
>>> >> give feedback on APIs, so I'd really like a culture of having those
>>> >> early.
>>> >> People don't argue about prettiness when they discuss APIs, they argue
>>> >> about
>>> >> the core concepts to expose in order to meet various goals, and then
>>> >> they're
>>> >> stuc

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
think about these cases. Furthermore, it might be
>> > decided
>> > that these cases are rare enough so that the strategy is still good
>> > enough
>> > but how would we know it without user feedback?
>> >
>> > I believe this example is exactly what Cody was talking about. Since
>> > many
>> > times implementation strategies have a large effect on the goal, we
>> > should
>> > have it discussed when discussing the goals. In addition, while it is
>> > often
>> > easy to throw out completely infeasible goals, it is often much harder
>> > to
>> > figure out that the goals are unfeasible without fine tuning.
>> >
>> >
>> >
>> >
>> >
>> > Assaf.
>> >
>> >
>> >
>> > From: Cody Koeninger-2 [via Apache Spark Developers List]
>> > [mailto:ml-node+[hidden email]]
>> > Sent: Monday, October 10, 2016 2:25 AM
>> > To: Mendelson, Assaf
>> > Subject: Re: Spark Improvement Proposals
>> >
>> >
>> >
>> > Only committers should formally submit SIPs because in an apache
>> > project only commiters have explicit political power.  If a user can't
>> > find a commiter willing to sponsor an SIP idea, they have no way to
>> > get the idea passed in any case.  If I can't find a committer to
>> > sponsor this meta-SIP idea, I'm out of luck.
>> >
>> > I do not believe unrealistic goals can be found solely by inspection.
>> > We've managed to ignore unrealistic goals even after implementation!
>> > Focusing on APIs can allow people to think they've solved something,
>> > when there's really no way of implementing that API while meeting the
>> > goals.  Rapid iteration is clearly the best way to address this, but
>> > we've already talked about why that hasn't really worked.  If adding a
>> > non-binding API section to the template is important to you, I'm not
>> > against it, but I don't think it's sufficient.
>> >
>> > On your PRD vs design doc spectrum, I'm saying this is closer to a
>> > PRD.  Clear agreement on goals is the most important thing and that's
>> > why it's the thing I want binding agreement on.  But I cannot agree to
>> > goals unless I have enough minimal technical info to judge whether the
>> > goals are likely to actually be accomplished.
>> >
>> >
>> >
>> > On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]> wrote:
>> >
>> >
>> >> Well, I think there are a few things here that don't make sense. First,
>> >> why
>> >> should only committers submit SIPs? Development in the project should
>> >> be
>> >> open to all contributors, whether they're committers or not. Second, I
>> >> think
>> >> unrealistic goals can be found just by inspecting the goals, and I'm
>> >> not
>> >> super worried that we'll accept a lot of SIPs that are then infeasible
>> >> --
>> >> we
>> >> can then submit new ones. But this depends on whether you want this
>> >> process
>> >> to be a "design doc lite", where people also agree on implementation
>> >> strategy, or just a way to agree on goals. This is what I asked earlier
>> >> about PRDs vs design docs (and I'm open to either one but I'd just like
>> >> clarity). Finally, both as a user and designer of software, I always
>> >> want
>> >> to
>> >> give feedback on APIs, so I'd really like a culture of having those
>> >> early.
>> >> People don't argue about prettiness when they discuss APIs, they argue
>> >> about
>> >> the core concepts to expose in order to meet various goals, and then
>> >> they're
>> >> stuck maintaining those for a long time.
>> >>
>> >> Matei
>> >>
>> >> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <[hidden email]> wrote:
>> >>
>> >> Users instead of people, sure.  Commiters and contributors are (or at
>> >> least
>> >> should be) a subset of users.
>> >>
>> >> Non goals, sure. I don't care what the name is, but we need to clearly
>> >> say
>> >> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>> >>
>> >> API, what I care most about is whether it allows me to accomplish the
>> >> goals.
>> >> Arguing about how ugly or

Re: Spark Improvement Proposals

2016-10-10 Thread Ryan Blue
+1 to votes to approve proposals. I agree that proposals should have an
official mechanism to be accepted, and a vote is an established means of
doing that well. I like that it includes a period to review the proposal
and I think proposals should have been discussed enough ahead of a vote to
survive the possibility of a veto.

I also like the names that are short and (mostly) unique, like SEP.

Where I disagree is with the requirement that a committer must formally
propose an enhancement. I don't see the value of restricting this: if
someone has the will to write up a proposal then they should be encouraged
to do so and start a discussion about it. Even if there is a political
reality as Cody says, what is the value of codifying that in our process? I
think restricting who can submit proposals would only undermine them by
pushing contributors out. Maybe I'm missing something here?

rb



On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Yes, users suggesting SIPs is a good thing and is explicitly called
> out in the linked document under the Who? section.  Formally proposing
> them, not so much, because of the political realities.
>
> Yes, implementation strategy definitely affects goals.  There are all
> kinds of examples of this, I'll pick one that's my fault so as to
> avoid sounding like I'm blaming:
>
> When I implemented the Kafka DStream, one of my (not explicitly agreed
> upon by the community) goals was to make sure people could use the
> Dstream with however they were already using Kafka at work.  The lack
> of explicit agreement on that goal led to all kinds of fighting with
> committers, that could have been avoided.  The lack of explicit
> up-front strategy discussion led to the DStream not really working
> with compacted topics.  I knew about compacted topics, but don't have
> a use for them, so had a blind spot there.  If there was explicit
> up-front discussion that my strategy was "assume that batches can be
> defined on the driver solely by beginning and ending offsets", there's
> a greater chance that a user would have seen that and said, "hey, what
> about non-contiguous offsets in a compacted topic".
>
> This kind of thing is only going to happen smoothly if we have a
> lightweight user-visible process with clear outcomes.
>
> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
> <assaf.mendel...@rsa.com> wrote:
> > I agree with most of what Cody said.
> >
> > Two things:
> >
> > First we can always have other people suggest SIPs but mark them as
> > “unreviewed” and have committers basically move them forward. The
> problem is
> > that writing a good document takes time. This way we can leverage non
> > committers to do some of this work (it is just another way to
> contribute).
> >
> >
> >
> > As for strategy, in many cases implementation strategy can affect the
> goals.
> > I will give  a small example: In the current structured streaming
> strategy,
> > we group by the time to achieve a sliding window. This is definitely an
> > implementation decision and not a goal. However, I can think of several
> > aggregation functions which have the time inside their calculation
> buffer.
> > For example, let’s say we want to return a set of all distinct values.
> One
> > way to implement this would be to make the set into a map and have the
> value
> > contain the last time seen. Multiplying it across the groupby would cost
> a
> > lot in performance. So adding such a strategy would have a great effect
> on
> > the type of aggregations and their performance which does affect the
> goal.
> > Without adding the strategy, it is easy for whoever goes to the design
> > document to not think about these cases. Furthermore, it might be decided
> > that these cases are rare enough so that the strategy is still good
> enough
> > but how would we know it without user feedback?
> >
> > I believe this example is exactly what Cody was talking about. Since many
> > times implementation strategies have a large effect on the goal, we
> should
> > have it discussed when discussing the goals. In addition, while it is
> often
> > easy to throw out completely infeasible goals, it is often much harder to
> > figure out that the goals are unfeasible without fine tuning.
> >
> >
> >
> >
> >
> > Assaf.
> >
> >
> >
> > From: Cody Koeninger-2 [via Apache Spark Developers List]
> > [mailto:ml-node+[hidden email]]
> > Sent: Monday, October 10, 2016 2:25 AM
> > To: Mendelson, Assaf
> > Subject: Re: Spark Improvement Proposals
> >
> >
> >
> > Only committers

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
Yes, users suggesting SIPs is a good thing and is explicitly called
out in the linked document under the Who? section.  Formally proposing
them, not so much, because of the political realities.

Yes, implementation strategy definitely affects goals.  There are all
kinds of examples of this, I'll pick one that's my fault so as to
avoid sounding like I'm blaming:

When I implemented the Kafka DStream, one of my (not explicitly agreed
upon by the community) goals was to make sure people could use the
Dstream with however they were already using Kafka at work.  The lack
of explicit agreement on that goal led to all kinds of fighting with
committers, that could have been avoided.  The lack of explicit
up-front strategy discussion led to the DStream not really working
with compacted topics.  I knew about compacted topics, but don't have
a use for them, so had a blind spot there.  If there was explicit
up-front discussion that my strategy was "assume that batches can be
defined on the driver solely by beginning and ending offsets", there's
a greater chance that a user would have seen that and said, "hey, what
about non-contiguous offsets in a compacted topic".

This kind of thing is only going to happen smoothly if we have a
lightweight user-visible process with clear outcomes.

On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
<assaf.mendel...@rsa.com> wrote:
> I agree with most of what Cody said.
>
> Two things:
>
> First we can always have other people suggest SIPs but mark them as
> “unreviewed” and have committers basically move them forward. The problem is
> that writing a good document takes time. This way we can leverage non
> committers to do some of this work (it is just another way to contribute).
>
>
>
> As for strategy, in many cases implementation strategy can affect the goals.
> I will give  a small example: In the current structured streaming strategy,
> we group by the time to achieve a sliding window. This is definitely an
> implementation decision and not a goal. However, I can think of several
> aggregation functions which have the time inside their calculation buffer.
> For example, let’s say we want to return a set of all distinct values. One
> way to implement this would be to make the set into a map and have the value
> contain the last time seen. Multiplying it across the groupby would cost a
> lot in performance. So adding such a strategy would have a great effect on
> the type of aggregations and their performance which does affect the goal.
> Without adding the strategy, it is easy for whoever goes to the design
> document to not think about these cases. Furthermore, it might be decided
> that these cases are rare enough so that the strategy is still good enough
> but how would we know it without user feedback?
>
> I believe this example is exactly what Cody was talking about. Since many
> times implementation strategies have a large effect on the goal, we should
> have it discussed when discussing the goals. In addition, while it is often
> easy to throw out completely infeasible goals, it is often much harder to
> figure out that the goals are unfeasible without fine tuning.
>
>
>
>
>
> Assaf.
>
>
>
> From: Cody Koeninger-2 [via Apache Spark Developers List]
> [mailto:ml-node+[hidden email]]
> Sent: Monday, October 10, 2016 2:25 AM
> To: Mendelson, Assaf
> Subject: Re: Spark Improvement Proposals
>
>
>
> Only committers should formally submit SIPs because in an apache
> project only commiters have explicit political power.  If a user can't
> find a commiter willing to sponsor an SIP idea, they have no way to
> get the idea passed in any case.  If I can't find a committer to
> sponsor this meta-SIP idea, I'm out of luck.
>
> I do not believe unrealistic goals can be found solely by inspection.
> We've managed to ignore unrealistic goals even after implementation!
> Focusing on APIs can allow people to think they've solved something,
> when there's really no way of implementing that API while meeting the
> goals.  Rapid iteration is clearly the best way to address this, but
> we've already talked about why that hasn't really worked.  If adding a
> non-binding API section to the template is important to you, I'm not
> against it, but I don't think it's sufficient.
>
> On your PRD vs design doc spectrum, I'm saying this is closer to a
> PRD.  Clear agreement on goals is the most important thing and that's
> why it's the thing I want binding agreement on.  But I cannot agree to
> goals unless I have enough minimal technical info to judge whether the
> goals are likely to actually be accomplished.
>
>
>
> On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]> wrote:
>
>
>> Well, I think there are a few things here that do

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
>
>> >
>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >
>> > "Goals: What must this allow people to do, that they can't currently?"
>> >
>> > Is it unclear that this is focusing specifically on people-visible
>> > behavior?
>> >
>> > Rejected goals -  are important because otherwise people keep trying
>> > to argue about scope.  Of course you can change things later with a
>> > different SIP and different vote, the point is to focus.
>> >
>> > Use cases - are something that people are going to bring up in
>> > discussion.  If they aren't clearly documented as a goal ("This must
>> > allow me to connect using SSL"), they should be added.
>> >
>> > Internal architecture - if the people who need specific behavior are
>> > implementers of other parts of the system, that's fine.
>> >
>> > Rejected strategies - If you have none of these, you have no evidence
>> > that the proponent didn't just go with the first thing they had in
>> > mind (or have already implemented), which is a big problem currently.
>> > Approval isn't binding as to specifics of implementation, so these
>> > aren't handcuffs.  The goals are the contract, the strategy is
>> > evidence that contract can actually be met.
>> >
>> > Design docs - I'm not touching design docs.  The markdown file I
>> > linked specifically says of the strategy section "This is not a full
>> > design document."  Is this unclear?  Design docs can be worked on
>> > obviously, but that's not what I'm concerned with here.
>> >
>> >
>> >
>> >
>> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <matei.zaha...@gmail.com>
>> > wrote:
>> >> Hi Cody,
>> >>
>> >> I think this would be a lot more concrete if we had a more detailed
>> >> template
>> >> for SIPs. Right now, it's not super clear what's in scope -- e.g. are
>> >> they
>> >> a way to solicit feedback on the user-facing behavior or on the
>> >> internals?
>> >> "Goals" can cover both things. I've been thinking of SIPs more as
>> >> Product
>> >> Requirements Docs (PRDs), which focus on *what* a code change should do
>> >> as
>> >> opposed to how.
>> >>
>> >> In particular, here are some things that you may or may not consider in
>> >> scope for SIPs:
>> >>
>> >> - Goals and non-goals: This is definitely in scope, and IMO should
>> >> focus on
>> >> user-visible behavior (e.g. "system supports SQL window functions" or
>> >> "system continues working if one node fails"). BTW I wouldn't say
>> >> "rejected
>> >> goals" because some of them might become goals later, so we're not
>> >> definitively rejecting them.
>> >>
>> >> - Public API: Probably should be included in most SIPs unless it's too
>> >> large
>> >> to fully specify then (e.g. "let's add an ML library").
>> >>
>> >> - Use cases: I usually find this very useful in PRDs to better
>> >> communicate
>> >> the goals.
>> >>
>> >> - Internal architecture: This is usually *not* a thing users can easily
>> >> comment on and it sounds more like a design doc item. Of course it's
>> >> important to show that the SIP is feasible to implement. One exception,
>> >> however, is that I think we'll have some SIPs primarily on internals
>> >> (e.g.
>> >> if somebody wants to refactor Spark's query optimizer or something).
>> >>
>> >> - Rejected strategies: I personally wouldn't put this, because what's
>> >> the
>> >> point of voting to reject a strategy before you've really begun
>> >> designing
>> >> and implementing something? What if you discover that the strategy is
>> >> actually better when you start doing stuff?
>> >>
>> >> At a super high level, it depends on whether you want the SIPs to be
>> >> PRDs
>> >> for getting some quick feedback on the goals of a feature before it is
>> >> designed, or something more like full-fledged design docs (just a more
>> >> visible design doc for bigger changes). I looked at Kafka's KIPs, and
>> >> they
>> >> actually seem to be more like de

Re: Spark Improvement Proposals

2016-10-09 Thread Nicholas Chammas
On Sun, Oct 9, 2016 at 5:19 PM Cody Koeninger  wrote:

> Regarding name, if the SIP overlap is a concern, we can pick a different
> name.
>
> My tongue in cheek suggestion would be
>
> Spark Lightweight Improvement process (SPARKLI)
>

If others share my minor concern about the SIP name, I propose Spark
Enhancement Proposal (SEP), taking inspiration from the Python Enhancement
Proposal name.

So if we're going to number proposals like other projects do, they'd be
numbered SEP-1, SEP-2, etc. This avoids the naming conflict with Scala SIPs.

Another way to avoid a conflict is to stick with "Spark Improvement
Proposal" but use SPIP as the acronym. So SPIP-1, SPIP-2, etc.

Anyway, it's not a big deal. I just wanted to raise this point.

Nick


Re: Spark Improvement Proposals

2016-10-09 Thread Matei Zaharia
lementation, so these
> > aren't handcuffs.  The goals are the contract, the strategy is
> > evidence that contract can actually be met.
> >
> > Design docs - I'm not touching design docs.  The markdown file I
> > linked specifically says of the strategy section "This is not a full
> > design document."  Is this unclear?  Design docs can be worked on
> > obviously, but that's not what I'm concerned with here.
> >
> >
> >
> >
> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <matei.zaha...@gmail.com 
> > <mailto:matei.zaha...@gmail.com>> wrote:
> >> Hi Cody,
> >>
> >> I think this would be a lot more concrete if we had a more detailed 
> >> template
> >> for SIPs. Right now, it's not super clear what's in scope -- e.g. are  they
> >> a way to solicit feedback on the user-facing behavior or on the internals?
> >> "Goals" can cover both things. I've been thinking of SIPs more as Product
> >> Requirements Docs (PRDs), which focus on *what* a code change should do as
> >> opposed to how.
> >>
> >> In particular, here are some things that you may or may not consider in
> >> scope for SIPs:
> >>
> >> - Goals and non-goals: This is definitely in scope, and IMO should focus on
> >> user-visible behavior (e.g. "system supports SQL window functions" or
> >> "system continues working if one node fails"). BTW I wouldn't say "rejected
> >> goals" because some of them might become goals later, so we're not
> >> definitively rejecting them.
> >>
> >> - Public API: Probably should be included in most SIPs unless it's too 
> >> large
> >> to fully specify then (e.g. "let's add an ML library").
> >>
> >> - Use cases: I usually find this very useful in PRDs to better communicate
> >> the goals.
> >>
> >> - Internal architecture: This is usually *not* a thing users can easily
> >> comment on and it sounds more like a design doc item. Of course it's
> >> important to show that the SIP is feasible to implement. One exception,
> >> however, is that I think we'll have some SIPs primarily on internals (e.g.
> >> if somebody wants to refactor Spark's query optimizer or something).
> >>
> >> - Rejected strategies: I personally wouldn't put this, because what's the
> >> point of voting to reject a strategy before you've really begun designing
> >> and implementing something? What if you discover that the strategy is
> >> actually better when you start doing stuff?
> >>
> >> At a super high level, it depends on whether you want the SIPs to be PRDs
> >> for getting some quick feedback on the goals of a feature before it is
> >> designed, or something more like full-fledged design docs (just a more
> >> visible design doc for bigger changes). I looked at Kafka's KIPs, and they
> >> actually seem to be more like design docs. This can work too but it does
> >> require more work from the proposer and it can lead to the same problems 
> >> you
> >> mentioned with people already having a design and implementation in mind.
> >>
> >> Basically, the question is, are you trying to iterate faster on design by
> >> adding a step for user feedback earlier? Or are you just trying to make
> >> design docs for key features more visible (and their approval more formal)?
> >>
> >> BTW note that in either case, I'd like to have a template for design docs
> >> too, which should also include goals. I think that would've avoided some of
> >> the issues you brought up.
> >>
> >> Matei
> >>
> >> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <c...@koeninger.org 
> >> <mailto:c...@koeninger.org>> wrote:
> >>
> >> Here's my specific proposal (meta-proposal?)
> >>
> >> Spark Improvement Proposals (SIP)
> >>
> >>
> >> Background:
> >>
> >> The current problem is that design and implementation of large features are
> >> often done in private, before soliciting user feedback.
> >>
> >> When feedback is solicited, it is often as to detailed design specifics, 
> >> not
> >> focused on goals.
> >>
> >> When implementation does take place after design, there is often
> >> disagreement as to what goals are or are not in scope.
> >>
> >> This results in commits that don't fully meet user needs.
> >>
> >>
> >

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
 >> they
>> >> a way to solicit feedback on the user-facing behavior or on the
>> >> internals?
>> >> "Goals" can cover both things. I've been thinking of SIPs more as
>> >> Product
>> >> Requirements Docs (PRDs), which focus on *what* a code change should do
>> >> as
>> >> opposed to how.
>> >>
>> >> In particular, here are some things that you may or may not consider in
>> >> scope for SIPs:
>> >>
>> >> - Goals and non-goals: This is definitely in scope, and IMO should
>> >> focus on
>> >> user-visible behavior (e.g. "system supports SQL window functions" or
>> >> "system continues working if one node fails"). BTW I wouldn't say
>> >> "rejected
>> >> goals" because some of them might become goals later, so we're not
>> >> definitively rejecting them.
>> >>
>> >> - Public API: Probably should be included in most SIPs unless it's too
>> >> large
>> >> to fully specify then (e.g. "let's add an ML library").
>> >>
>> >> - Use cases: I usually find this very useful in PRDs to better
>> >> communicate
>> >> the goals.
>> >>
>> >> - Internal architecture: This is usually *not* a thing users can easily
>> >> comment on and it sounds more like a design doc item. Of course it's
>> >> important to show that the SIP is feasible to implement. One exception,
>> >> however, is that I think we'll have some SIPs primarily on internals
>> >> (e.g.
>> >> if somebody wants to refactor Spark's query optimizer or something).
>> >>
>> >> - Rejected strategies: I personally wouldn't put this, because what's
>> >> the
>> >> point of voting to reject a strategy before you've really begun
>> >> designing
>> >> and implementing something? What if you discover that the strategy is
>> >> actually better when you start doing stuff?
>> >>
>> >> At a super high level, it depends on whether you want the SIPs to be
>> >> PRDs
>> >> for getting some quick feedback on the goals of a feature before it is
>> >> designed, or something more like full-fledged design docs (just a more
>> >> visible design doc for bigger changes). I looked at Kafka's KIPs, and
>> >> they
>> >> actually seem to be more like design docs. This can work too but it
>> >> does
>> >> require more work from the proposer and it can lead to the same
>> >> problems you
>> >> mentioned with people already having a design and implementation in
>> >> mind.
>> >>
>> >> Basically, the question is, are you trying to iterate faster on design
>> >> by
>> >> adding a step for user feedback earlier? Or are you just trying to make
>> >> design docs for key features more visible (and their approval more
>> >> formal)?
>> >>
>> >> BTW note that in either case, I'd like to have a template for design
>> >> docs
>> >> too, which should also include goals. I think that would've avoided
>> >> some of
>> >> the issues you brought up.
>> >>
>> >> Matei
>> >>
>> >> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <c...@koeninger.org> wrote:
>> >>
>> >> Here's my specific proposal (meta-proposal?)
>> >>
>> >> Spark Improvement Proposals (SIP)
>> >>
>> >>
>> >> Background:
>> >>
>> >> The current problem is that design and implementation of large features
>> >> are
>> >> often done in private, before soliciting user feedback.
>> >>
>> >> When feedback is solicited, it is often as to detailed design
>> >> specifics, not
>> >> focused on goals.
>> >>
>> >> When implementation does take place after design, there is often
>> >> disagreement as to what goals are or are not in scope.
>> >>
>> >> This results in commits that don't fully meet user needs.
>> >>
>> >>
>> >> Goals:
>> >>
>> >> - Ensure user, contributor, and committer goals are clearly identified
>> >> and
>> >> agreed upon, before implementation takes place.
>> >>
>> >> - Ensure that a technically feasible strategy is ch

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
supports SQL window functions" or
> >> "system continues working if one node fails"). BTW I wouldn't say
> "rejected
> >> goals" because some of them might become goals later, so we're not
> >> definitively rejecting them.
> >>
> >> - Public API: Probably should be included in most SIPs unless it's too
> large
> >> to fully specify then (e.g. "let's add an ML library").
> >>
> >> - Use cases: I usually find this very useful in PRDs to better
> communicate
> >> the goals.
> >>
> >> - Internal architecture: This is usually *not* a thing users can easily
> >> comment on and it sounds more like a design doc item. Of course it's
> >> important to show that the SIP is feasible to implement. One exception,
> >> however, is that I think we'll have some SIPs primarily on internals
> (e.g.
> >> if somebody wants to refactor Spark's query optimizer or something).
> >>
> >> - Rejected strategies: I personally wouldn't put this, because what's
> the
> >> point of voting to reject a strategy before you've really begun
> designing
> >> and implementing something? What if you discover that the strategy is
> >> actually better when you start doing stuff?
> >>
> >> At a super high level, it depends on whether you want the SIPs to be
> PRDs
> >> for getting some quick feedback on the goals of a feature before it is
> >> designed, or something more like full-fledged design docs (just a more
> >> visible design doc for bigger changes). I looked at Kafka's KIPs, and
> they
> >> actually seem to be more like design docs. This can work too but it does
> >> require more work from the proposer and it can lead to the same
> problems you
> >> mentioned with people already having a design and implementation in
> mind.
> >>
> >> Basically, the question is, are you trying to iterate faster on design
> by
> >> adding a step for user feedback earlier? Or are you just trying to make
> >> design docs for key features more visible (and their approval more
> formal)?
> >>
> >> BTW note that in either case, I'd like to have a template for design
> docs
> >> too, which should also include goals. I think that would've avoided
> some of
> >> the issues you brought up.
> >>
> >> Matei
> >>
> >> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <c...@koeninger.org> wrote:
> >>
> >> Here's my specific proposal (meta-proposal?)
> >>
> >> Spark Improvement Proposals (SIP)
> >>
> >>
> >> Background:
> >>
> >> The current problem is that design and implementation of large features
> are
> >> often done in private, before soliciting user feedback.
> >>
> >> When feedback is solicited, it is often as to detailed design
> specifics, not
> >> focused on goals.
> >>
> >> When implementation does take place after design, there is often
> >> disagreement as to what goals are or are not in scope.
> >>
> >> This results in commits that don't fully meet user needs.
> >>
> >>
> >> Goals:
> >>
> >> - Ensure user, contributor, and committer goals are clearly identified
> and
> >> agreed upon, before implementation takes place.
> >>
> >> - Ensure that a technically feasible strategy is chosen that is likely
> to
> >> meet the goals.
> >>
> >>
> >> Rejected Goals:
> >>
> >> - SIPs are not for detailed design.  Design by committee doesn't work.
> >>
> >> - SIPs are not for every change.  We dont need that much process.
> >>
> >>
> >> Strategy:
> >>
> >> My suggestion is outlined as a Spark Improvement Proposal process
> documented
> >> at
> >>
> >> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >>
> >> Specifics of Jira manipulation are an implementation detail we can
> figure
> >> out.
> >>
> >> I'm suggesting voting; the need here is for a _clear_ outcome.
> >>
> >>
> >> Rejected Strategies:
> >>
> >> Having someone who understands the problem implement it first works, but
> >> only if significant iteration after user feedback is allowed.
> >>
> >> Historically this has been problematic due to pressure to limit public
> api
> >> changes.
> >>
> >>
> >&g

Re: Spark Improvement Proposals

2016-10-09 Thread Ofir Manor
y
> "rejected
> >> goals" because some of them might become goals later, so we're not
> >> definitively rejecting them.
> >>
> >> - Public API: Probably should be included in most SIPs unless it's too
> large
> >> to fully specify then (e.g. "let's add an ML library").
> >>
> >> - Use cases: I usually find this very useful in PRDs to better
> communicate
> >> the goals.
> >>
> >> - Internal architecture: This is usually *not* a thing users can easily
> >> comment on and it sounds more like a design doc item. Of course it's
> >> important to show that the SIP is feasible to implement. One exception,
> >> however, is that I think we'll have some SIPs primarily on internals
> (e.g.
> >> if somebody wants to refactor Spark's query optimizer or something).
> >>
> >> - Rejected strategies: I personally wouldn't put this, because what's
> the
> >> point of voting to reject a strategy before you've really begun
> designing
> >> and implementing something? What if you discover that the strategy is
> >> actually better when you start doing stuff?
> >>
> >> At a super high level, it depends on whether you want the SIPs to be
> PRDs
> >> for getting some quick feedback on the goals of a feature before it is
> >> designed, or something more like full-fledged design docs (just a more
> >> visible design doc for bigger changes). I looked at Kafka's KIPs, and
> they
> >> actually seem to be more like design docs. This can work too but it does
> >> require more work from the proposer and it can lead to the same
> problems you
> >> mentioned with people already having a design and implementation in
> mind.
> >>
> >> Basically, the question is, are you trying to iterate faster on design
> by
> >> adding a step for user feedback earlier? Or are you just trying to make
> >> design docs for key features more visible (and their approval more
> formal)?
> >>
> >> BTW note that in either case, I'd like to have a template for design
> docs
> >> too, which should also include goals. I think that would've avoided
> some of
> >> the issues you brought up.
> >>
> >> Matei
> >>
> >> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <c...@koeninger.org> wrote:
> >>
> >> Here's my specific proposal (meta-proposal?)
> >>
> >> Spark Improvement Proposals (SIP)
> >>
> >>
> >> Background:
> >>
> >> The current problem is that design and implementation of large features
> are
> >> often done in private, before soliciting user feedback.
> >>
> >> When feedback is solicited, it is often as to detailed design
> specifics, not
> >> focused on goals.
> >>
> >> When implementation does take place after design, there is often
> >> disagreement as to what goals are or are not in scope.
> >>
> >> This results in commits that don't fully meet user needs.
> >>
> >>
> >> Goals:
> >>
> >> - Ensure user, contributor, and committer goals are clearly identified
> and
> >> agreed upon, before implementation takes place.
> >>
> >> - Ensure that a technically feasible strategy is chosen that is likely
> to
> >> meet the goals.
> >>
> >>
> >> Rejected Goals:
> >>
> >> - SIPs are not for detailed design.  Design by committee doesn't work.
> >>
> >> - SIPs are not for every change.  We dont need that much process.
> >>
> >>
> >> Strategy:
> >>
> >> My suggestion is outlined as a Spark Improvement Proposal process
> documented
> >> at
> >>
> >> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >>
> >> Specifics of Jira manipulation are an implementation detail we can
> figure
> >> out.
> >>
> >> I'm suggesting voting; the need here is for a _clear_ outcome.
> >>
> >>
> >> Rejected Strategies:
> >>
> >> Having someone who understands the problem implement it first works, but
> >> only if significant iteration after user feedback is allowed.
> >>
> >> Historically this has been problematic due to pressure to limit public
> api
> >> changes.
> >>
> >>
> >> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <r...@databricks.com>
> wrote:
> >>>
> >>> Alright looks lik

Re: Spark Improvement Proposals

2016-10-09 Thread Matei Zaharia
g stuff?
>> 
>> At a super high level, it depends on whether you want the SIPs to be PRDs
>> for getting some quick feedback on the goals of a feature before it is
>> designed, or something more like full-fledged design docs (just a more
>> visible design doc for bigger changes). I looked at Kafka's KIPs, and they
>> actually seem to be more like design docs. This can work too but it does
>> require more work from the proposer and it can lead to the same problems you
>> mentioned with people already having a design and implementation in mind.
>> 
>> Basically, the question is, are you trying to iterate faster on design by
>> adding a step for user feedback earlier? Or are you just trying to make
>> design docs for key features more visible (and their approval more formal)?
>> 
>> BTW note that in either case, I'd like to have a template for design docs
>> too, which should also include goals. I think that would've avoided some of
>> the issues you brought up.
>> 
>> Matei
>> 
>> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <c...@koeninger.org> wrote:
>> 
>> Here's my specific proposal (meta-proposal?)
>> 
>> Spark Improvement Proposals (SIP)
>> 
>> 
>> Background:
>> 
>> The current problem is that design and implementation of large features are
>> often done in private, before soliciting user feedback.
>> 
>> When feedback is solicited, it is often as to detailed design specifics, not
>> focused on goals.
>> 
>> When implementation does take place after design, there is often
>> disagreement as to what goals are or are not in scope.
>> 
>> This results in commits that don't fully meet user needs.
>> 
>> 
>> Goals:
>> 
>> - Ensure user, contributor, and committer goals are clearly identified and
>> agreed upon, before implementation takes place.
>> 
>> - Ensure that a technically feasible strategy is chosen that is likely to
>> meet the goals.
>> 
>> 
>> Rejected Goals:
>> 
>> - SIPs are not for detailed design.  Design by committee doesn't work.
>> 
>> - SIPs are not for every change.  We dont need that much process.
>> 
>> 
>> Strategy:
>> 
>> My suggestion is outlined as a Spark Improvement Proposal process documented
>> at
>> 
>> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> 
>> Specifics of Jira manipulation are an implementation detail we can figure
>> out.
>> 
>> I'm suggesting voting; the need here is for a _clear_ outcome.
>> 
>> 
>> Rejected Strategies:
>> 
>> Having someone who understands the problem implement it first works, but
>> only if significant iteration after user feedback is allowed.
>> 
>> Historically this has been problematic due to pressure to limit public api
>> changes.
>> 
>> 
>> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <r...@databricks.com> wrote:
>>> 
>>> Alright looks like there are quite a bit of support. We should wait to
>>> hear from more people too.
>>> 
>>> To push this forward, Cody and I will be working together in the next
>>> couple of weeks to come up with a concrete, detailed proposal on what this
>>> entails, and then we can discuss this the specific proposal as well.
>>> 
>>> 
>>> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>>> 
>>>> Yeah, in case it wasn't clear, I was talking about SIPs for major
>>>> user-facing or cross-cutting changes, not minor feature adds.
>>>> 
>>>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos
>>>> <stavros.kontopou...@lightbend.com> wrote:
>>>>> 
>>>>> +1 to the SIP label as long as it does not slow down things and it
>>>>> targets optimizing efforts, coordination etc. For example really small
>>>>> features should not need to go through this process (assuming they dont
>>>>> touch public interfaces)  or re-factorings and hope it will be kept this
>>>>> way. So as a guideline doc should be provided, like in the KIP case.
>>>>> 
>>>>> IMHO so far aside from tagging things and linking them elsewhere simply
>>>>> having design docs and prototypes implementations in PRs is not something
>>>>> that has not worked so far. What is really a pain in many projects out 
>>>>> there
>>>>> is discontinuity in progress of PRs, missing features, slow re

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
Regarding name, if the SIP overlap is a concern, we can pick a different name.
My tongue in cheek suggestion would be
Spark Lightweight Improvement process (SPARKLI)

On Sun, Oct 9, 2016 at 4:14 PM, Cody Koeninger <c...@koeninger.org> wrote:
> So to focus the discussion on the specific strategy I'm suggesting,
> documented at
>
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>
> "Goals: What must this allow people to do, that they can't currently?"
>
> Is it unclear that this is focusing specifically on people-visible behavior?
>
> Rejected goals -  are important because otherwise people keep trying
> to argue about scope.  Of course you can change things later with a
> different SIP and different vote, the point is to focus.
>
> Use cases - are something that people are going to bring up in
> discussion.  If they aren't clearly documented as a goal ("This must
> allow me to connect using SSL"), they should be added.
>
> Internal architecture - if the people who need specific behavior are
> implementers of other parts of the system, that's fine.
>
> Rejected strategies - If you have none of these, you have no evidence
> that the proponent didn't just go with the first thing they had in
> mind (or have already implemented), which is a big problem currently.
> Approval isn't binding as to specifics of implementation, so these
> aren't handcuffs.  The goals are the contract, the strategy is
> evidence that contract can actually be met.
>
> Design docs - I'm not touching design docs.  The markdown file I
> linked specifically says of the strategy section "This is not a full
> design document."  Is this unclear?  Design docs can be worked on
> obviously, but that's not what I'm concerned with here.
>
>
>
>
> On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
>> Hi Cody,
>>
>> I think this would be a lot more concrete if we had a more detailed template
>> for SIPs. Right now, it's not super clear what's in scope -- e.g. are  they
>> a way to solicit feedback on the user-facing behavior or on the internals?
>> "Goals" can cover both things. I've been thinking of SIPs more as Product
>> Requirements Docs (PRDs), which focus on *what* a code change should do as
>> opposed to how.
>>
>> In particular, here are some things that you may or may not consider in
>> scope for SIPs:
>>
>> - Goals and non-goals: This is definitely in scope, and IMO should focus on
>> user-visible behavior (e.g. "system supports SQL window functions" or
>> "system continues working if one node fails"). BTW I wouldn't say "rejected
>> goals" because some of them might become goals later, so we're not
>> definitively rejecting them.
>>
>> - Public API: Probably should be included in most SIPs unless it's too large
>> to fully specify then (e.g. "let's add an ML library").
>>
>> - Use cases: I usually find this very useful in PRDs to better communicate
>> the goals.
>>
>> - Internal architecture: This is usually *not* a thing users can easily
>> comment on and it sounds more like a design doc item. Of course it's
>> important to show that the SIP is feasible to implement. One exception,
>> however, is that I think we'll have some SIPs primarily on internals (e.g.
>> if somebody wants to refactor Spark's query optimizer or something).
>>
>> - Rejected strategies: I personally wouldn't put this, because what's the
>> point of voting to reject a strategy before you've really begun designing
>> and implementing something? What if you discover that the strategy is
>> actually better when you start doing stuff?
>>
>> At a super high level, it depends on whether you want the SIPs to be PRDs
>> for getting some quick feedback on the goals of a feature before it is
>> designed, or something more like full-fledged design docs (just a more
>> visible design doc for bigger changes). I looked at Kafka's KIPs, and they
>> actually seem to be more like design docs. This can work too but it does
>> require more work from the proposer and it can lead to the same problems you
>> mentioned with people already having a design and implementation in mind.
>>
>> Basically, the question is, are you trying to iterate faster on design by
>> adding a step for user feedback earlier? Or are you just trying to make
>> design docs for key features more visible (and their approval more formal)?
>>
>> BTW note that in either case, I'd like to have a template for design docs
>> too, which should also include goals. I think th

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
So to focus the discussion on the specific strategy I'm suggesting,
documented at

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

"Goals: What must this allow people to do, that they can't currently?"

Is it unclear that this is focusing specifically on people-visible behavior?

Rejected goals -  are important because otherwise people keep trying
to argue about scope.  Of course you can change things later with a
different SIP and different vote, the point is to focus.

Use cases - are something that people are going to bring up in
discussion.  If they aren't clearly documented as a goal ("This must
allow me to connect using SSL"), they should be added.

Internal architecture - if the people who need specific behavior are
implementers of other parts of the system, that's fine.

Rejected strategies - If you have none of these, you have no evidence
that the proponent didn't just go with the first thing they had in
mind (or have already implemented), which is a big problem currently.
Approval isn't binding as to specifics of implementation, so these
aren't handcuffs.  The goals are the contract, the strategy is
evidence that contract can actually be met.

Design docs - I'm not touching design docs.  The markdown file I
linked specifically says of the strategy section "This is not a full
design document."  Is this unclear?  Design docs can be worked on
obviously, but that's not what I'm concerned with here.




On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> Hi Cody,
>
> I think this would be a lot more concrete if we had a more detailed template
> for SIPs. Right now, it's not super clear what's in scope -- e.g. are  they
> a way to solicit feedback on the user-facing behavior or on the internals?
> "Goals" can cover both things. I've been thinking of SIPs more as Product
> Requirements Docs (PRDs), which focus on *what* a code change should do as
> opposed to how.
>
> In particular, here are some things that you may or may not consider in
> scope for SIPs:
>
> - Goals and non-goals: This is definitely in scope, and IMO should focus on
> user-visible behavior (e.g. "system supports SQL window functions" or
> "system continues working if one node fails"). BTW I wouldn't say "rejected
> goals" because some of them might become goals later, so we're not
> definitively rejecting them.
>
> - Public API: Probably should be included in most SIPs unless it's too large
> to fully specify then (e.g. "let's add an ML library").
>
> - Use cases: I usually find this very useful in PRDs to better communicate
> the goals.
>
> - Internal architecture: This is usually *not* a thing users can easily
> comment on and it sounds more like a design doc item. Of course it's
> important to show that the SIP is feasible to implement. One exception,
> however, is that I think we'll have some SIPs primarily on internals (e.g.
> if somebody wants to refactor Spark's query optimizer or something).
>
> - Rejected strategies: I personally wouldn't put this, because what's the
> point of voting to reject a strategy before you've really begun designing
> and implementing something? What if you discover that the strategy is
> actually better when you start doing stuff?
>
> At a super high level, it depends on whether you want the SIPs to be PRDs
> for getting some quick feedback on the goals of a feature before it is
> designed, or something more like full-fledged design docs (just a more
> visible design doc for bigger changes). I looked at Kafka's KIPs, and they
> actually seem to be more like design docs. This can work too but it does
> require more work from the proposer and it can lead to the same problems you
> mentioned with people already having a design and implementation in mind.
>
> Basically, the question is, are you trying to iterate faster on design by
> adding a step for user feedback earlier? Or are you just trying to make
> design docs for key features more visible (and their approval more formal)?
>
> BTW note that in either case, I'd like to have a template for design docs
> too, which should also include goals. I think that would've avoided some of
> the issues you brought up.
>
> Matei
>
> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <c...@koeninger.org> wrote:
>
> Here's my specific proposal (meta-proposal?)
>
> Spark Improvement Proposals (SIP)
>
>
> Background:
>
> The current problem is that design and implementation of large features are
> often done in private, before soliciting user feedback.
>
> When feedback is solicited, it is often as to detailed design specifics, not
> focused on goals.
>
> When implementation does take place aft

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
If there's confusion there, the document is specifically what I'm
proposing.  The email is just by way of introduction.

On Sun, Oct 9, 2016 at 3:47 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Oh, hmm… I guess I’m a little confused on the relation between Cody’s
> email and the document he linked to, which says:
>
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md#when
>
> SIPs should be used for significant user-facing or cross-cutting changes,
> not day-to-day improvements. When in doubt, if a committer thinks a change
> needs an SIP, it does.
>
> Nick
> ​
>
> On Sun, Oct 9, 2016 at 4:40 PM Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
>> Yup, but the example you gave is for alternatives about *user-facing
>> behavior*, not implementation. The current SIP doc describes "strategy"
>> more as implementation strategy. I'm just saying there are different
>> possible goals for these types of docs.
>>
>> BTW, PEPs and Scala SIPs focus primarily on user-facing behavior, but
>> also require a reference implementation. This is a bit different from what
>> Cody had in mind, I think.
>>
>>
>> Matei
>>
>> On Oct 9, 2016, at 1:25 PM, Nicholas Chammas <nicholas.cham...@gmail.com>
>> wrote:
>>
>>
>>- Rejected strategies: I personally wouldn’t put this, because what’s
>>the point of voting to reject a strategy before you’ve really begun
>>designing and implementing something? What if you discover that the
>>strategy is actually better when you start doing stuff?
>>
>> I would guess the point is to document alternatives that were discussed
>> and rejected, so that later on people can be pointed to that discussion and
>> the devs don’t have to repeat themselves unnecessarily every time someone
>> comes along and asks “Why didn’t you do this other thing?” That doesn’t
>> mean a rejected proposal can’t later be revisited and the SIP can’t be
>> updated.
>>
>> For reference from the Python community, PEP 492
>> <https://www.python.org/dev/peps/pep-0492/>, a Python Enhancement
>> Proposal for adding async and await syntax and “first-class” coroutines
>> to Python, has a section on rejected ideas
>> <https://www.python.org/dev/peps/pep-0492/#why-async-def> for the new
>> syntax. It captures a summary of what the devs discussed, but it doesn’t
>> mean the PEP can’t be updated and a previously rejected proposal can’t be
>> revived.
>>
>> At least in the Python community, a PEP serves not just as formal
>> starting point for a proposal (the “real” starting point is usually a
>> discussion on python-ideas or python-dev), but also as documentation of
>> what was agreed on and a living “spec” of sorts. So PEPs sometimes get
>> updated years after they are approved when revisions are agreed upon. PEPs
>> are also intended for wide consumption, vs. bug tracker issues which the
>> broader Python dev community are not expected to follow closely.
>>
>> Dunno if we want to follow a similar pattern for Spark, since the
>> project’s needs are different. But the Python community has used PEPs to
>> help organize and steer development since 2000; there are plenty of
>> examples there we can probably take inspiration from.
>>
>> By the way, can we call these things something other than Spark
>> Improvement Proposals? The acronym, SIP, conflicts with Scala SIPs
>> <http://docs.scala-lang.org/sips/index.html>. Since the Scala and Spark
>> communities have a lot of overlap, we don’t want, for example, names like
>> “SIP-10” to have an ambiguous meaning.
>>
>> Nick
>> ​
>>
>> On Sun, Oct 9, 2016 at 3:34 PM Matei Zaharia <matei.zaha...@gmail.com>
>> wrote:
>>
>>> Hi Cody,
>>>
>>> I think this would be a lot more concrete if we had a more detailed
>>> template for SIPs. Right now, it's not super clear what's in scope -- e.g.
>>> are  they a way to solicit feedback on the user-facing behavior or on the
>>> internals? "Goals" can cover both things. I've been thinking of SIPs more
>>> as Product Requirements Docs (PRDs), which focus on *what* a code change
>>> should do as opposed to how.
>>>
>>> In particular, here are some things that you may or may not consider in
>>> scope for SIPs:
>>>
>>> - Goals and non-goals: This is definitely in scope, and IMO should focus
>>> on user-visible behavior (e.g. "system supports SQL window functions" or
>>> "system 

Re: Spark Improvement Proposals

2016-10-09 Thread Nicholas Chammas
Oh, hmm… I guess I’m a little confused on the relation between Cody’s email
and the document he linked to, which says:

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md#when

SIPs should be used for significant user-facing or cross-cutting changes,
not day-to-day improvements. When in doubt, if a committer thinks a change
needs an SIP, it does.

Nick
​

On Sun, Oct 9, 2016 at 4:40 PM Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Yup, but the example you gave is for alternatives about *user-facing
> behavior*, not implementation. The current SIP doc describes "strategy"
> more as implementation strategy. I'm just saying there are different
> possible goals for these types of docs.
>
> BTW, PEPs and Scala SIPs focus primarily on user-facing behavior, but also
> require a reference implementation. This is a bit different from what Cody
> had in mind, I think.
>
>
> Matei
>
> On Oct 9, 2016, at 1:25 PM, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
>
>
>- Rejected strategies: I personally wouldn’t put this, because what’s
>the point of voting to reject a strategy before you’ve really begun
>designing and implementing something? What if you discover that the
>strategy is actually better when you start doing stuff?
>
> I would guess the point is to document alternatives that were discussed
> and rejected, so that later on people can be pointed to that discussion and
> the devs don’t have to repeat themselves unnecessarily every time someone
> comes along and asks “Why didn’t you do this other thing?” That doesn’t
> mean a rejected proposal can’t later be revisited and the SIP can’t be
> updated.
>
> For reference from the Python community, PEP 492
> <https://www.python.org/dev/peps/pep-0492/>, a Python Enhancement
> Proposal for adding async and await syntax and “first-class” coroutines
> to Python, has a section on rejected ideas
> <https://www.python.org/dev/peps/pep-0492/#why-async-def> for the new
> syntax. It captures a summary of what the devs discussed, but it doesn’t
> mean the PEP can’t be updated and a previously rejected proposal can’t be
> revived.
>
> At least in the Python community, a PEP serves not just as formal starting
> point for a proposal (the “real” starting point is usually a discussion on
> python-ideas or python-dev), but also as documentation of what was agreed
> on and a living “spec” of sorts. So PEPs sometimes get updated years after
> they are approved when revisions are agreed upon. PEPs are also intended
> for wide consumption, vs. bug tracker issues which the broader Python dev
> community are not expected to follow closely.
>
> Dunno if we want to follow a similar pattern for Spark, since the
> project’s needs are different. But the Python community has used PEPs to
> help organize and steer development since 2000; there are plenty of
> examples there we can probably take inspiration from.
>
> By the way, can we call these things something other than Spark
> Improvement Proposals? The acronym, SIP, conflicts with Scala SIPs
> <http://docs.scala-lang.org/sips/index.html>. Since the Scala and Spark
> communities have a lot of overlap, we don’t want, for example, names like
> “SIP-10” to have an ambiguous meaning.
>
> Nick
> ​
>
> On Sun, Oct 9, 2016 at 3:34 PM Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
> Hi Cody,
>
> I think this would be a lot more concrete if we had a more detailed
> template for SIPs. Right now, it's not super clear what's in scope -- e.g.
> are  they a way to solicit feedback on the user-facing behavior or on the
> internals? "Goals" can cover both things. I've been thinking of SIPs more
> as Product Requirements Docs (PRDs), which focus on *what* a code change
> should do as opposed to how.
>
> In particular, here are some things that you may or may not consider in
> scope for SIPs:
>
> - Goals and non-goals: This is definitely in scope, and IMO should focus
> on user-visible behavior (e.g. "system supports SQL window functions" or
> "system continues working if one node fails"). BTW I wouldn't say "rejected
> goals" because some of them might become goals later, so we're not
> definitively rejecting them.
>
> - Public API: Probably should be included in most SIPs unless it's too
> large to fully specify then (e.g. "let's add an ML library").
>
> - Use cases: I usually find this very useful in PRDs to better communicate
> the goals.
>
> - Internal architecture: This is usually *not* a thing users can easily
> comment on and it sounds more like a design doc item. Of course it's
> important to show that the SIP is feasible to implem

Re: Spark Improvement Proposals

2016-10-09 Thread Matei Zaharia
Yup, but the example you gave is for alternatives about *user-facing behavior*, 
not implementation. The current SIP doc describes "strategy" more as 
implementation strategy. I'm just saying there are different possible goals for 
these types of docs.

BTW, PEPs and Scala SIPs focus primarily on user-facing behavior, but also 
require a reference implementation. This is a bit different from what Cody had 
in mind, I think.

Matei

> On Oct 9, 2016, at 1:25 PM, Nicholas Chammas <nicholas.cham...@gmail.com> 
> wrote:
> 
> Rejected strategies: I personally wouldn’t put this, because what’s the point 
> of voting to reject a strategy before you’ve really begun designing and 
> implementing something? What if you discover that the strategy is actually 
> better when you start doing stuff?
> I would guess the point is to document alternatives that were discussed and 
> rejected, so that later on people can be pointed to that discussion and the 
> devs don’t have to repeat themselves unnecessarily every time someone comes 
> along and asks “Why didn’t you do this other thing?” That doesn’t mean a 
> rejected proposal can’t later be revisited and the SIP can’t be updated.
> 
> For reference from the Python community, PEP 492 
> <https://www.python.org/dev/peps/pep-0492/>, a Python Enhancement Proposal 
> for adding async and await syntax and “first-class” coroutines to Python, has 
> a section on rejected ideas 
> <https://www.python.org/dev/peps/pep-0492/#why-async-def> for the new syntax. 
> It captures a summary of what the devs discussed, but it doesn’t mean the PEP 
> can’t be updated and a previously rejected proposal can’t be revived.
> 
> At least in the Python community, a PEP serves not just as formal starting 
> point for a proposal (the “real” starting point is usually a discussion on 
> python-ideas or python-dev), but also as documentation of what was agreed on 
> and a living “spec” of sorts. So PEPs sometimes get updated years after they 
> are approved when revisions are agreed upon. PEPs are also intended for wide 
> consumption, vs. bug tracker issues which the broader Python dev community 
> are not expected to follow closely.
> 
> Dunno if we want to follow a similar pattern for Spark, since the project’s 
> needs are different. But the Python community has used PEPs to help organize 
> and steer development since 2000; there are plenty of examples there we can 
> probably take inspiration from.
> 
> By the way, can we call these things something other than Spark Improvement 
> Proposals? The acronym, SIP, conflicts with Scala SIPs 
> <http://docs.scala-lang.org/sips/index.html>. Since the Scala and Spark 
> communities have a lot of overlap, we don’t want, for example, names like 
> “SIP-10” to have an ambiguous meaning.
> 
> Nick
> 
> 
> On Sun, Oct 9, 2016 at 3:34 PM Matei Zaharia <matei.zaha...@gmail.com 
> <mailto:matei.zaha...@gmail.com>> wrote:
> Hi Cody,
> 
> I think this would be a lot more concrete if we had a more detailed template 
> for SIPs. Right now, it's not super clear what's in scope -- e.g. are  they a 
> way to solicit feedback on the user-facing behavior or on the internals? 
> "Goals" can cover both things. I've been thinking of SIPs more as Product 
> Requirements Docs (PRDs), which focus on *what* a code change should do as 
> opposed to how.
> 
> In particular, here are some things that you may or may not consider in scope 
> for SIPs:
> 
> - Goals and non-goals: This is definitely in scope, and IMO should focus on 
> user-visible behavior (e.g. "system supports SQL window functions" or "system 
> continues working if one node fails"). BTW I wouldn't say "rejected goals" 
> because some of them might become goals later, so we're not definitively 
> rejecting them.
> 
> - Public API: Probably should be included in most SIPs unless it's too large 
> to fully specify then (e.g. "let's add an ML library").
> 
> - Use cases: I usually find this very useful in PRDs to better communicate 
> the goals.
> 
> - Internal architecture: This is usually *not* a thing users can easily 
> comment on and it sounds more like a design doc item. Of course it's 
> important to show that the SIP is feasible to implement. One exception, 
> however, is that I think we'll have some SIPs primarily on internals (e.g. if 
> somebody wants to refactor Spark's query optimizer or something).
> 
> - Rejected strategies: I personally wouldn't put this, because what's the 
> point of voting to reject a strategy before you've really begun designing and 
> implementing something? What if you discover that the strategy is actually 
> better when you start doing stuff?
>

Re: Spark Improvement Proposals

2016-10-09 Thread Nicholas Chammas
   - Rejected strategies: I personally wouldn’t put this, because what’s
   the point of voting to reject a strategy before you’ve really begun
   designing and implementing something? What if you discover that the
   strategy is actually better when you start doing stuff?

I would guess the point is to document alternatives that were discussed and
rejected, so that later on people can be pointed to that discussion and the
devs don’t have to repeat themselves unnecessarily every time someone comes
along and asks “Why didn’t you do this other thing?” That doesn’t mean a
rejected proposal can’t later be revisited and the SIP can’t be updated.

For reference from the Python community, PEP 492
<https://www.python.org/dev/peps/pep-0492/>, a Python Enhancement Proposal
for adding async and await syntax and “first-class” coroutines to Python,
has a section on rejected ideas
<https://www.python.org/dev/peps/pep-0492/#why-async-def> for the new
syntax. It captures a summary of what the devs discussed, but it doesn’t
mean the PEP can’t be updated and a previously rejected proposal can’t be
revived.

At least in the Python community, a PEP serves not just as formal starting
point for a proposal (the “real” starting point is usually a discussion on
python-ideas or python-dev), but also as documentation of what was agreed
on and a living “spec” of sorts. So PEPs sometimes get updated years after
they are approved when revisions are agreed upon. PEPs are also intended
for wide consumption, vs. bug tracker issues which the broader Python dev
community are not expected to follow closely.

Dunno if we want to follow a similar pattern for Spark, since the project’s
needs are different. But the Python community has used PEPs to help
organize and steer development since 2000; there are plenty of examples
there we can probably take inspiration from.

By the way, can we call these things something other than Spark Improvement
Proposals? The acronym, SIP, conflicts with Scala SIPs
<http://docs.scala-lang.org/sips/index.html>. Since the Scala and Spark
communities have a lot of overlap, we don’t want, for example, names like
“SIP-10” to have an ambiguous meaning.

Nick
​

On Sun, Oct 9, 2016 at 3:34 PM Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Hi Cody,
>
> I think this would be a lot more concrete if we had a more detailed
> template for SIPs. Right now, it's not super clear what's in scope -- e.g.
> are  they a way to solicit feedback on the user-facing behavior or on the
> internals? "Goals" can cover both things. I've been thinking of SIPs more
> as Product Requirements Docs (PRDs), which focus on *what* a code change
> should do as opposed to how.
>
> In particular, here are some things that you may or may not consider in
> scope for SIPs:
>
> - Goals and non-goals: This is definitely in scope, and IMO should focus
> on user-visible behavior (e.g. "system supports SQL window functions" or
> "system continues working if one node fails"). BTW I wouldn't say "rejected
> goals" because some of them might become goals later, so we're not
> definitively rejecting them.
>
> - Public API: Probably should be included in most SIPs unless it's too
> large to fully specify then (e.g. "let's add an ML library").
>
> - Use cases: I usually find this very useful in PRDs to better communicate
> the goals.
>
> - Internal architecture: This is usually *not* a thing users can easily
> comment on and it sounds more like a design doc item. Of course it's
> important to show that the SIP is feasible to implement. One exception,
> however, is that I think we'll have some SIPs primarily on internals (e.g.
> if somebody wants to refactor Spark's query optimizer or something).
>
> - Rejected strategies: I personally wouldn't put this, because what's the
> point of voting to reject a strategy before you've really begun designing
> and implementing something? What if you discover that the strategy is
> actually better when you start doing stuff?
>
> At a super high level, it depends on whether you want the SIPs to be PRDs
> for getting some quick feedback on the goals of a feature before it is
> designed, or something more like full-fledged design docs (just a more
> visible design doc for bigger changes). I looked at Kafka's KIPs, and they
> actually seem to be more like design docs. This can work too but it does
> require more work from the proposer and it can lead to the same problems
> you mentioned with people already having a design and implementation in
> mind.
>
> Basically, the question is, are you trying to iterate faster on design by
> adding a step for user feedback earlier? Or are you just trying to make
> design docs for key features more visible (and their approval more formal)?
>
> BTW note that in eit

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
Here's my specific proposal (meta-proposal?)

Spark Improvement Proposals (SIP)


Background:

The current problem is that design and implementation of large features are
often done in private, before soliciting user feedback.

When feedback is solicited, it is often as to detailed design specifics,
not focused on goals.

When implementation does take place after design, there is often
disagreement as to what goals are or are not in scope.

This results in commits that don't fully meet user needs.


Goals:

- Ensure user, contributor, and committer goals are clearly identified and
agreed upon, before implementation takes place.

- Ensure that a technically feasible strategy is chosen that is likely to
meet the goals.


Rejected Goals:

- SIPs are not for detailed design.  Design by committee doesn't work.

- SIPs are not for every change.  We dont need that much process.


Strategy:

My suggestion is outlined as a Spark Improvement Proposal process
documented at

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

Specifics of Jira manipulation are an implementation detail we can figure
out.

I'm suggesting voting; the need here is for a _clear_ outcome.


Rejected Strategies:

Having someone who understands the problem implement it first works, but
only if significant iteration after user feedback is allowed.

Historically this has been problematic due to pressure to limit public api
changes.

On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <r...@databricks.com> wrote:

> Alright looks like there are quite a bit of support. We should wait to
> hear from more people too.
>
> To push this forward, Cody and I will be working together in the next
> couple of weeks to come up with a concrete, detailed proposal on what this
> entails, and then we can discuss this the specific proposal as well.
>
>
> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> Yeah, in case it wasn't clear, I was talking about SIPs for major
>> user-facing or cross-cutting changes, not minor feature adds.
>>
>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>>
>>> +1 to the SIP label as long as it does not slow down things and it
>>> targets optimizing efforts, coordination etc. For example really small
>>> features should not need to go through this process (assuming they dont
>>> touch public interfaces)  or re-factorings and hope it will be kept this
>>> way. So as a guideline doc should be provided, like in the KIP case.
>>>
>>> IMHO so far aside from tagging things and linking them elsewhere simply
>>> having design docs and prototypes implementations in PRs is not something
>>> that has not worked so far. What is really a pain in many projects out
>>> there is discontinuity in progress of PRs, missing features, slow reviews
>>> which is understandable to some extent... it is not only about Spark but
>>> things can be improved for sure for this project in particular as already
>>> stated.
>>>
>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> +1 to adding an SIP label and linking it from the website.  I think it
>>>> needs
>>>>
>>>> - template that focuses it towards soliciting user goals / non goals
>>>> - clear resolution as to which strategy was chosen to pursue.  I'd
>>>> recommend a vote.
>>>>
>>>> Matei asked me to clarify what I meant by changing interfaces, I think
>>>> it's directly relevant to the SIP idea so I'll clarify here, and split
>>>> a thread for the other discussion per Nicholas' request.
>>>>
>>>> I meant changing public user interfaces.  I think the first design is
>>>> unlikely to be right, because it's done at a time when you have the
>>>> least information.  As a user, I find it considerably more frustrating
>>>> to be unable to use a tool to get my job done, than I do having to
>>>> make minor changes to my code in order to take advantage of features.
>>>> I've seen committers be seriously reluctant to allow changes to
>>>> @experimental code that are needed in order for it to really work
>>>> right.  You need to be able to iterate, and if people on both sides of
>>>> the fence aren't going to respect that some newer apis are subject to
>>>> change, then why even mark them as such?
>>>>
>>>> Ideally a finished SIP should give me a checklist of things that an
>>>> implementation must do, and things that it doesn't need to do.
>>

Re: Improving governance / committers (split from Spark Improvement Proposals thread)

2016-10-08 Thread Cody Koeninger
It's not about technical design disagreement as to matters of taste,
it's about familiarity with the domain.  To make an analogy, it's as
if a committer in MLlib was firmly intent on, I dunno, treating a
collection of categorical variables as if it were an ordered range of
continuous variables.  It's just wrong.  That kind of thing, to a
greater or lesser degree, has been going on related to the Kafka
modules, for years.

On Sat, Oct 8, 2016 at 4:11 PM, Matei Zaharia  wrote:
> This makes a lot of sense; just to comment on a few things:
>
>> - More committers
>> Just looking at the ratio of committers to open tickets, or committers
>> to contributors, I don't think you have enough human power.
>> I realize this is a touchy issue.  I don't have dog in this fight,
>> because I'm not on either coast nor in a big company that views
>> committership as a political thing.  I just think you need more people
>> to do the work, and more diversity of viewpoint.
>> It's unfortunate that the Apache governance process involves giving
>> someone all the keys or none of the keys, but until someone really
>> starts screwing up, I think it's better to err on the side of
>> accepting hard-working people.
>
> This is something the PMC is actively discussing. Historically, we've added 
> committers when people contributed a new module or feature, basically to the 
> point where other developers are asking them to review changes in that area 
> (https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-BecomingaCommitter).
>  For example, we added the original authors of GraphX when we merged in 
> GraphX, the authors of new ML algorithms, etc. However, there's a good 
> argument that some areas are simply not covered well now and we should add 
> people there. Also, as the project has grown, there are also more people who 
> focus on smaller fixes and are nonetheless contributing a lot.
>
>> - Each major area of the code needs at least one person who cares
>> about it that is empowered with a vote, otherwise decisions get made
>> that don't make technical sense.
>> I don't know if anyone with a vote is shepherding GraphX (or maybe
>> it's just dead), the Mesos relationship has always been weird, no one
>> with a vote really groks Kafka.
>> marmbrus and zsxwing are getting there quickly on the Kafka side, and
>> I appreciate it, but it's been bad for a while.
>> Because I don't have any political power, my response to seeing things
>> that I know are technically dangerous has been to yell really loud
>> until someone listens, which sucks for everyone involved.
>> I already apologized to Michael privately; Ryan, I'm sorry, it's not about 
>> you.
>> This seems pretty straightforward to fix, if politically awkward:
>> those people exist, just give them a vote.
>> Failing that, listen the first or second time they say something not
>> the third or fourth, and if it doesn't make sense, ask.
>
> Just as a note here -- it's true that some areas are not super well covered, 
> but I also hope to avoid a situation where people have to yell to be listened 
> to. I can't say anything about *all* technical discussions we've ever had, 
> but historically, people have been able to comment on the design of many 
> things without yelling. This is actually important because a culture of 
> having to yell can drive away contributors. So it's awesome that you yelled 
> about the Kafka source stuff, but at the same time, hopefully we make these 
> types of things work without yelling. This would be a problem even if there 
> were committers with more expertise in each area -- what if someone disagrees 
> with the committers?
>
> Matei
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2016-10-08 Thread vaquar khan
+1 for SIP lebles,waiting for Reynolds detailed proposal .

Regards,
Vaquar khan

On 8 Oct 2016 16:22, "Matei Zaharia"  wrote:

> Sounds good. Just to comment on the compatibility part:
>
> > I meant changing public user interfaces.  I think the first design is
> > unlikely to be right, because it's done at a time when you have the
> > least information.  As a user, I find it considerably more frustrating
> > to be unable to use a tool to get my job done, than I do having to
> > make minor changes to my code in order to take advantage of features.
> > I've seen committers be seriously reluctant to allow changes to
> > @experimental code that are needed in order for it to really work
> > right.  You need to be able to iterate, and if people on both sides of
> > the fence aren't going to respect that some newer apis are subject to
> > change, then why even mark them as such?
> >
> > Ideally a finished SIP should give me a checklist of things that an
> > implementation must do, and things that it doesn't need to do.
> > Contributors/committers should be seriously discouraged from putting
> > out a version 0.1 that doesn't have at least a prototype
> > implementation of all those things, especially if they're then going
> > to argue against interface changes necessary to get the the rest of
> > the things done in the 0.2 version.
>
> Experimental APIs and alpha components are indeed supposed to be
> changeable (https://cwiki.apache.org/confluence/display/SPARK/
> Spark+Versioning+Policy). Maybe people are being too conservative in some
> cases, but I do want to note that regardless of what precise policy we try
> to write down, this type of issue will ultimately be a judgment call. Is it
> worth making a small cosmetic change in an API that's marked experimental,
> but has been used widely for a year? Perhaps not. Is it worth making it in
> something one month old, or even in an older API as we move to 2.0? Maybe
> yes. I think we should just discuss each one (start an email thread if
> resolving it on JIRA is too complex) and perhaps be more religious about
> making things non-experimental when we think they're done.
>
> Matei
>
>
> >
> >
> > On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin  wrote:
> >> I like the lightweight proposal to add a SIP label.
> >>
> >> During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
> >> track the list of major changes, but that never really materialized due
> to
> >> the overhead. Adding a SIP label on major JIRAs and then link to them
> >> prominently on the Spark website makes a lot of sense.
> >>
> >>
> >> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia  >
> >> wrote:
> >>>
> >>> For the improvement proposals, I think one major point was to make them
> >>> really visible to users who are not contributors, so we should do more
> than
> >>> sending stuff to dev@. One very lightweight idea is to have a new
> type of
> >>> JIRA called a SIP and have a link to a filter that shows all such
> JIRAs from
> >>> http://spark.apache.org. I also like the idea of SIP and design doc
> >>> templates (in fact many projects have them).
> >>>
> >>> Matei
> >>>
> >>> On Oct 7, 2016, at 10:38 AM, Reynold Xin  wrote:
> >>>
> >>> I called Cody last night and talked about some of the topics in his
> email.
> >>> It became clear to me Cody genuinely cares about the project.
> >>>
> >>> Some of the frustrations come from the success of the project itself
> >>> becoming very "hot", and it is difficult to get clarity from people who
> >>> don't dedicate all their time to Spark. In fact, it is in some ways
> similar
> >>> to scaling an engineering team in a successful startup: old processes
> that
> >>> worked well might not work so well when it gets to a certain size,
> cultures
> >>> can get diluted, building culture vs building process, etc.
> >>>
> >>> I also really like to have a more visible process for larger changes,
> >>> especially major user facing API changes. Historically we upload
> design docs
> >>> for major changes, but it is not always consistent and difficult to
> quality
> >>> of the docs, due to the volunteering nature of the organization.
> >>>
> >>> Some of the more concrete ideas we discussed focus on building a
> culture
> >>> to improve clarity:
> >>>
> >>> - Process: Large changes should have design docs posted on JIRA. One
> thing
> >>> Cody and I didn't discuss but an idea that just came to me is we should
> >>> create a design doc template for the project and ask everybody to
> follow.
> >>> The design doc template should also explicitly list goals and
> non-goals, to
> >>> make design doc more consistent.
> >>>
> >>> - Process: Email dev@ to solicit feedback. We have some this with some
> >>> changes, but again very inconsistent. Just posting something on JIRA
> isn't
> >>> sufficient, because there are simply too many JIRAs and the signal get
> lost
> >>> in the 

Re: Spark Improvement Proposals

2016-10-08 Thread Matei Zaharia
Sounds good. Just to comment on the compatibility part:

> I meant changing public user interfaces.  I think the first design is
> unlikely to be right, because it's done at a time when you have the
> least information.  As a user, I find it considerably more frustrating
> to be unable to use a tool to get my job done, than I do having to
> make minor changes to my code in order to take advantage of features.
> I've seen committers be seriously reluctant to allow changes to
> @experimental code that are needed in order for it to really work
> right.  You need to be able to iterate, and if people on both sides of
> the fence aren't going to respect that some newer apis are subject to
> change, then why even mark them as such?
> 
> Ideally a finished SIP should give me a checklist of things that an
> implementation must do, and things that it doesn't need to do.
> Contributors/committers should be seriously discouraged from putting
> out a version 0.1 that doesn't have at least a prototype
> implementation of all those things, especially if they're then going
> to argue against interface changes necessary to get the the rest of
> the things done in the 0.2 version.

Experimental APIs and alpha components are indeed supposed to be changeable 
(https://cwiki.apache.org/confluence/display/SPARK/Spark+Versioning+Policy). 
Maybe people are being too conservative in some cases, but I do want to note 
that regardless of what precise policy we try to write down, this type of issue 
will ultimately be a judgment call. Is it worth making a small cosmetic change 
in an API that's marked experimental, but has been used widely for a year? 
Perhaps not. Is it worth making it in something one month old, or even in an 
older API as we move to 2.0? Maybe yes. I think we should just discuss each one 
(start an email thread if resolving it on JIRA is too complex) and perhaps be 
more religious about making things non-experimental when we think they're done.

Matei


> 
> 
> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin  wrote:
>> I like the lightweight proposal to add a SIP label.
>> 
>> During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
>> track the list of major changes, but that never really materialized due to
>> the overhead. Adding a SIP label on major JIRAs and then link to them
>> prominently on the Spark website makes a lot of sense.
>> 
>> 
>> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia 
>> wrote:
>>> 
>>> For the improvement proposals, I think one major point was to make them
>>> really visible to users who are not contributors, so we should do more than
>>> sending stuff to dev@. One very lightweight idea is to have a new type of
>>> JIRA called a SIP and have a link to a filter that shows all such JIRAs from
>>> http://spark.apache.org. I also like the idea of SIP and design doc
>>> templates (in fact many projects have them).
>>> 
>>> Matei
>>> 
>>> On Oct 7, 2016, at 10:38 AM, Reynold Xin  wrote:
>>> 
>>> I called Cody last night and talked about some of the topics in his email.
>>> It became clear to me Cody genuinely cares about the project.
>>> 
>>> Some of the frustrations come from the success of the project itself
>>> becoming very "hot", and it is difficult to get clarity from people who
>>> don't dedicate all their time to Spark. In fact, it is in some ways similar
>>> to scaling an engineering team in a successful startup: old processes that
>>> worked well might not work so well when it gets to a certain size, cultures
>>> can get diluted, building culture vs building process, etc.
>>> 
>>> I also really like to have a more visible process for larger changes,
>>> especially major user facing API changes. Historically we upload design docs
>>> for major changes, but it is not always consistent and difficult to quality
>>> of the docs, due to the volunteering nature of the organization.
>>> 
>>> Some of the more concrete ideas we discussed focus on building a culture
>>> to improve clarity:
>>> 
>>> - Process: Large changes should have design docs posted on JIRA. One thing
>>> Cody and I didn't discuss but an idea that just came to me is we should
>>> create a design doc template for the project and ask everybody to follow.
>>> The design doc template should also explicitly list goals and non-goals, to
>>> make design doc more consistent.
>>> 
>>> - Process: Email dev@ to solicit feedback. We have some this with some
>>> changes, but again very inconsistent. Just posting something on JIRA isn't
>>> sufficient, because there are simply too many JIRAs and the signal get lost
>>> in the noise. While this is generally impossible to enforce because we can't
>>> force all volunteers to conform to a process (or they might not even be
>>> aware of this),  those who are more familiar with the project can help by
>>> emailing the dev@ when they see something that hasn't been.
>>> 
>>> - Culture: The design doc author(s) 

Re: Improving governance / committers (split from Spark Improvement Proposals thread)

2016-10-08 Thread Matei Zaharia
This makes a lot of sense; just to comment on a few things:

> - More committers
> Just looking at the ratio of committers to open tickets, or committers
> to contributors, I don't think you have enough human power.
> I realize this is a touchy issue.  I don't have dog in this fight,
> because I'm not on either coast nor in a big company that views
> committership as a political thing.  I just think you need more people
> to do the work, and more diversity of viewpoint.
> It's unfortunate that the Apache governance process involves giving
> someone all the keys or none of the keys, but until someone really
> starts screwing up, I think it's better to err on the side of
> accepting hard-working people.

This is something the PMC is actively discussing. Historically, we've added 
committers when people contributed a new module or feature, basically to the 
point where other developers are asking them to review changes in that area 
(https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-BecomingaCommitter).
 For example, we added the original authors of GraphX when we merged in GraphX, 
the authors of new ML algorithms, etc. However, there's a good argument that 
some areas are simply not covered well now and we should add people there. 
Also, as the project has grown, there are also more people who focus on smaller 
fixes and are nonetheless contributing a lot.

> - Each major area of the code needs at least one person who cares
> about it that is empowered with a vote, otherwise decisions get made
> that don't make technical sense.
> I don't know if anyone with a vote is shepherding GraphX (or maybe
> it's just dead), the Mesos relationship has always been weird, no one
> with a vote really groks Kafka.
> marmbrus and zsxwing are getting there quickly on the Kafka side, and
> I appreciate it, but it's been bad for a while.
> Because I don't have any political power, my response to seeing things
> that I know are technically dangerous has been to yell really loud
> until someone listens, which sucks for everyone involved.
> I already apologized to Michael privately; Ryan, I'm sorry, it's not about 
> you.
> This seems pretty straightforward to fix, if politically awkward:
> those people exist, just give them a vote.
> Failing that, listen the first or second time they say something not
> the third or fourth, and if it doesn't make sense, ask.

Just as a note here -- it's true that some areas are not super well covered, 
but I also hope to avoid a situation where people have to yell to be listened 
to. I can't say anything about *all* technical discussions we've ever had, but 
historically, people have been able to comment on the design of many things 
without yelling. This is actually important because a culture of having to yell 
can drive away contributors. So it's awesome that you yelled about the Kafka 
source stuff, but at the same time, hopefully we make these types of things 
work without yelling. This would be a problem even if there were committers 
with more expertise in each area -- what if someone disagrees with the 
committers?

Matei


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-08 Thread Matei Zaharia
I like this idea of asking them. BTW, one other thing we can do *provided the 
JIRAs are eventually under control* is to create a filter for old JIRAs that 
have not received a response in X amount of time and have the system 
automatically email the dev list with this report every month. Then everyone 
can see the list of items and maybe be reminded to take care to clean it up. 
This only works if the list is manageable and you actually want to read all of 
it.

Matei

> On Oct 8, 2016, at 9:01 AM, Cody Koeninger <c...@koeninger.org> wrote:
> 
> Yeah, I've interacted with other projects that used that system and it was 
> pleasant.
> 
> 1. "this is getting closed cause its stale, let us know if thats a problem"
> 2. "actually that matters to us"
> 3. "ok well leave it open"
> 
> I'd be fine with totally automating step 1 as long as a human was involved at 
> step 2 and 3
> 
> 
> On Saturday, October 8, 2016, assaf.mendelson <assaf.mendel...@rsa.com 
> <mailto:assaf.mendel...@rsa.com>> wrote:
> I don’t really have much experience with large open source projects but I 
> have some experience with having lots of issues with no one handling them. 
> Automation proved a good solution in my experience, but one thing that I 
> found which was really important is giving people a chance to say “don’t 
> close this please”.
> 
> Basically, because closing you can send an email to the reporter (and 
> probably people who are watching the issue) and tell them this is going to be 
> closed. Allow them an option to ping back saying “don’t close this please” 
> which would ping committers for input (as if there were 5+ votes as described 
> by Nick).
> 
> The main reason for this is that many times people fine solutions and the 
> issue does become stale but at other times, the issue is still important, it 
> is just that no one noticed it because of the noise of other issues.
> 
> Thanks,
> 
> Assaf.
> 
>  
> 
>  
> 
>  
> 
> From: Nicholas Chammas [via Apache Spark Developers List] [mailto:ml-node+ 
> <javascript:_e(%7B%7D,'cvml','ml-node%2B');>[hidden email] 
> <http://user/SendEmail.jtp?type=node=19322=0>] 
> Sent: Saturday, October 08, 2016 12:42 AM
> To: Mendelson, Assaf
> Subject: Re: Improving volunteer management / JIRAs (split from Spark 
> Improvement Proposals thread)
> 
>  
> 
> I agree with Cody and others that we need some automation — or at least an 
> adjusted process — to help us manage organic contributions better.
> 
> The objections about automated closing being potentially abrasive are 
> understood, but I wouldn’t accept that as a defeat for automation. Instead, 
> it seems like a constraint we should impose on any proposed solution: Make 
> sure it doesn’t turn contributors off. Rolling as we have been won’t cut it, 
> and I don’t think adding committers will ever be a sufficient solution to 
> this particular problem.
> 
> To me, it seems like we need a way to filter out viable contributions with 
> community support from other contributions when it comes to deciding that 
> automated action is appropriate. Our current tooling isn’t perfect, but 
> perhaps we can leverage it to create such a filter.
> 
> For example, consider the following strawman proposal for how to cut down on 
> the number of pending but unviable proposals, and simultaneously help 
> contributors organize to promote viable proposals and get the attention of 
> committers:
> 
> 1.  Have a bot scan for stale JIRA issues and PRs—i.e. they haven’t been 
> updated in 20+ days (or D+ days, if you prefer).
> 
> 2.  Depending on the level of community support, either close the item or 
> ping specific people for action. Specifically:
> a. If the JIRA/PR has no input from a committer and the JIRA/PR has 5+ votes 
> (or V+ votes), ping committers for input. (For PRs, you could count comments 
> from different people, or thumbs up on the initial PR post.)
> b. If the JIRA/PR has no input from a committer and the JIRA/PR has less than 
> V votes, close it with a gentle message asking the contributor to solicit 
> support from either the community or a committer, and try again later.
> c. If the JIRA/PR has input from a committer or committers, ping them for an 
> update.
> 
> This is just a rough idea. The point is that when contributors have stale 
> proposals that they don’t close, committers need to take action. A little 
> automation to selectively bring contributions to the attention of committers 
> can perhaps help them manage the backlog of stale contributions. The 
> “selective” part is implemented in this strawman proposal by using JIRA votes 
> as a crude proxy for when 

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-08 Thread Cody Koeninger
Yeah, I've interacted with other projects that used that system and it was
pleasant.

1. "this is getting closed cause its stale, let us know if thats a problem"
2. "actually that matters to us"
3. "ok well leave it open"

I'd be fine with totally automating step 1 as long as a human was involved
at step 2 and 3


On Saturday, October 8, 2016, assaf.mendelson <assaf.mendel...@rsa.com>
wrote:

> I don’t really have much experience with large open source projects but I
> have some experience with having lots of issues with no one handling them.
> Automation proved a good solution in my experience, but one thing that I
> found which was really important is giving people a chance to say “don’t
> close this please”.
>
> Basically, because closing you can send an email to the reporter (and
> probably people who are watching the issue) and tell them this is going to
> be closed. Allow them an option to ping back saying “don’t close this
> please” which would ping committers for input (as if there were 5+ votes as
> described by Nick).
>
> The main reason for this is that many times people fine solutions and the
> issue does become stale but at other times, the issue is still important,
> it is just that no one noticed it because of the noise of other issues.
>
> Thanks,
>
> Assaf.
>
>
>
>
>
>
>
> *From:* Nicholas Chammas [via Apache Spark Developers List] [mailto:
> ml-node+ <javascript:_e(%7B%7D,'cvml','ml-node%2B');>[hidden email]
> <http:///user/SendEmail.jtp?type=node=19322=0>]
> *Sent:* Saturday, October 08, 2016 12:42 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Improving volunteer management / JIRAs (split from Spark
> Improvement Proposals thread)
>
>
>
> I agree with Cody and others that we need some automation — or at least an
> adjusted process — to help us manage organic contributions better.
>
> The objections about automated closing being potentially abrasive are
> understood, but I wouldn’t accept that as a defeat for automation. Instead,
> it seems like a constraint we should impose on any proposed solution: Make
> sure it doesn’t turn contributors off. Rolling as we have been won’t cut
> it, and I don’t think adding committers will ever be a sufficient solution
> to this particular problem.
>
> To me, it seems like we need a way to filter out viable contributions with
> community support from other contributions when it comes to deciding that
> automated action is appropriate. Our current tooling isn’t perfect, but
> perhaps we can leverage it to create such a filter.
>
> For example, consider the following strawman proposal for how to cut down
> on the number of pending but unviable proposals, and simultaneously help
> contributors organize to promote viable proposals and get the attention of
> committers:
>
> 1.  Have a bot scan for *stale* JIRA issues and PRs—i.e. they haven’t
> been updated in 20+ days (or D+ days, if you prefer).
>
> 2.  Depending on the level of community support, either close the
> item or ping specific people for action. Specifically:
> a. If the JIRA/PR has no input from a committer and the JIRA/PR has 5+
> votes (or V+ votes), ping committers for input. (For PRs, you could count
> comments from different people, or thumbs up on the initial PR post.)
> b. If the JIRA/PR has no input from a committer and the JIRA/PR has less
> than V votes, close it with a gentle message asking the contributor to
> solicit support from either the community or a committer, and try again
> later.
> c. If the JIRA/PR has input from a committer or committers, ping them for
> an update.
>
> This is just a rough idea. The point is that when contributors have stale
> proposals that they don’t close, committers need to take action. A little
> automation to selectively bring contributions to the attention of
> committers can perhaps help them manage the backlog of stale contributions.
> The “selective” part is implemented in this strawman proposal by using JIRA
> votes as a crude proxy for when the community is interested in something,
> but it could be anything.
>
> Also, this doesn’t have to be used just to clear out stale proposals. Once
> the initial backlog is trimmed down, you could set D to 5 days and use
> this as a regular way to bring contributions to the attention of committers.
>
> I dunno if people think this is perhaps too complex, but at our scale I
> feel we need some kind of loose but automated system for funneling
> contributions through some kind of lifecycle. The status quo is just not
> that good (e.g. 474 open PRs <https://github.com/apache/spark/pulls>
> against Spark as of this moment).
>
> Nick
>
> ​
>
>
>
> O

Re: Spark Improvement Proposals

2016-10-07 Thread Reynold Xin
Alright looks like there are quite a bit of support. We should wait to hear
from more people too.

To push this forward, Cody and I will be working together in the next
couple of weeks to come up with a concrete, detailed proposal on what this
entails, and then we can discuss this the specific proposal as well.


On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger  wrote:

> Yeah, in case it wasn't clear, I was talking about SIPs for major
> user-facing or cross-cutting changes, not minor feature adds.
>
> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos  lightbend.com> wrote:
>
>> +1 to the SIP label as long as it does not slow down things and it
>> targets optimizing efforts, coordination etc. For example really small
>> features should not need to go through this process (assuming they dont
>> touch public interfaces)  or re-factorings and hope it will be kept this
>> way. So as a guideline doc should be provided, like in the KIP case.
>>
>> IMHO so far aside from tagging things and linking them elsewhere simply
>> having design docs and prototypes implementations in PRs is not something
>> that has not worked so far. What is really a pain in many projects out
>> there is discontinuity in progress of PRs, missing features, slow reviews
>> which is understandable to some extent... it is not only about Spark but
>> things can be improved for sure for this project in particular as already
>> stated.
>>
>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger 
>> wrote:
>>
>>> +1 to adding an SIP label and linking it from the website.  I think it
>>> needs
>>>
>>> - template that focuses it towards soliciting user goals / non goals
>>> - clear resolution as to which strategy was chosen to pursue.  I'd
>>> recommend a vote.
>>>
>>> Matei asked me to clarify what I meant by changing interfaces, I think
>>> it's directly relevant to the SIP idea so I'll clarify here, and split
>>> a thread for the other discussion per Nicholas' request.
>>>
>>> I meant changing public user interfaces.  I think the first design is
>>> unlikely to be right, because it's done at a time when you have the
>>> least information.  As a user, I find it considerably more frustrating
>>> to be unable to use a tool to get my job done, than I do having to
>>> make minor changes to my code in order to take advantage of features.
>>> I've seen committers be seriously reluctant to allow changes to
>>> @experimental code that are needed in order for it to really work
>>> right.  You need to be able to iterate, and if people on both sides of
>>> the fence aren't going to respect that some newer apis are subject to
>>> change, then why even mark them as such?
>>>
>>> Ideally a finished SIP should give me a checklist of things that an
>>> implementation must do, and things that it doesn't need to do.
>>> Contributors/committers should be seriously discouraged from putting
>>> out a version 0.1 that doesn't have at least a prototype
>>> implementation of all those things, especially if they're then going
>>> to argue against interface changes necessary to get the the rest of
>>> the things done in the 0.2 version.
>>>
>>>
>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin  wrote:
>>> > I like the lightweight proposal to add a SIP label.
>>> >
>>> > During Spark 2.0 development, Tom (Graves) and I suggested using wiki
>>> to
>>> > track the list of major changes, but that never really materialized
>>> due to
>>> > the overhead. Adding a SIP label on major JIRAs and then link to them
>>> > prominently on the Spark website makes a lot of sense.
>>> >
>>> >
>>> > On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <
>>> matei.zaha...@gmail.com>
>>> > wrote:
>>> >>
>>> >> For the improvement proposals, I think one major point was to make
>>> them
>>> >> really visible to users who are not contributors, so we should do
>>> more than
>>> >> sending stuff to dev@. One very lightweight idea is to have a new
>>> type of
>>> >> JIRA called a SIP and have a link to a filter that shows all such
>>> JIRAs from
>>> >> http://spark.apache.org. I also like the idea of SIP and design doc
>>> >> templates (in fact many projects have them).
>>> >>
>>> >> Matei
>>> >>
>>> >> On Oct 7, 2016, at 10:38 AM, Reynold Xin  wrote:
>>> >>
>>> >> I called Cody last night and talked about some of the topics in his
>>> email.
>>> >> It became clear to me Cody genuinely cares about the project.
>>> >>
>>> >> Some of the frustrations come from the success of the project itself
>>> >> becoming very "hot", and it is difficult to get clarity from people
>>> who
>>> >> don't dedicate all their time to Spark. In fact, it is in some ways
>>> similar
>>> >> to scaling an engineering team in a successful startup: old processes
>>> that
>>> >> worked well might not work so well when it gets to a certain size,
>>> cultures
>>> >> can get diluted, building culture vs building process, etc.
>>> >>
>>> >> 

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Nicholas Chammas
Ah yes, on a given JIRA issue the number of watchers is often a better
indicator of community interest than votes.

But yeah, it could be any metric or formula we want, as long as it yielded
a "reasonable" bar to cross for unsolicited contributions to get committer
review--or at the very least a comment from them saying yes/no/later.

On Fri, Oct 7, 2016 at 5:59 PM Cody Koeninger  wrote:

> I really like the idea of using jira votes (and/or watchers?) as a filter!
>
> On Fri, Oct 7, 2016 at 4:41 PM, Nicholas Chammas
>  wrote:
> > I agree with Cody and others that we need some automation — or at least
> an
> > adjusted process — to help us manage organic contributions better.
> >
> > The objections about automated closing being potentially abrasive are
> > understood, but I wouldn’t accept that as a defeat for automation.
> Instead,
> > it seems like a constraint we should impose on any proposed solution:
> Make
> > sure it doesn’t turn contributors off. Rolling as we have been won’t cut
> it,
> > and I don’t think adding committers will ever be a sufficient solution to
> > this particular problem.
> >
> > To me, it seems like we need a way to filter out viable contributions
> with
> > community support from other contributions when it comes to deciding that
> > automated action is appropriate. Our current tooling isn’t perfect, but
> > perhaps we can leverage it to create such a filter.
> >
> > For example, consider the following strawman proposal for how to cut
> down on
> > the number of pending but unviable proposals, and simultaneously help
> > contributors organize to promote viable proposals and get the attention
> of
> > committers:
> >
> > Have a bot scan for stale JIRA issues and PRs—i.e. they haven’t been
> updated
> > in 20+ days (or D+ days, if you prefer).
> > Depending on the level of community support, either close the item or
> ping
> > specific people for action. Specifically:
> > a. If the JIRA/PR has no input from a committer and the JIRA/PR has 5+
> votes
> > (or V+ votes), ping committers for input. (For PRs, you could count
> comments
> > from different people, or thumbs up on the initial PR post.)
> > b. If the JIRA/PR has no input from a committer and the JIRA/PR has less
> > than V votes, close it with a gentle message asking the contributor to
> > solicit support from either the community or a committer, and try again
> > later.
> > c. If the JIRA/PR has input from a committer or committers, ping them
> for an
> > update.
> >
> > This is just a rough idea. The point is that when contributors have stale
> > proposals that they don’t close, committers need to take action. A little
> > automation to selectively bring contributions to the attention of
> committers
> > can perhaps help them manage the backlog of stale contributions. The
> > “selective” part is implemented in this strawman proposal by using JIRA
> > votes as a crude proxy for when the community is interested in something,
> > but it could be anything.
> >
> > Also, this doesn’t have to be used just to clear out stale proposals.
> Once
> > the initial backlog is trimmed down, you could set D to 5 days and use
> this
> > as a regular way to bring contributions to the attention of committers.
> >
> > I dunno if people think this is perhaps too complex, but at our scale I
> feel
> > we need some kind of loose but automated system for funneling
> contributions
> > through some kind of lifecycle. The status quo is just not that good
> (e.g.
> > 474 open PRs against Spark as of this moment).
> >
> > Nick
> >
> >
> > On Fri, Oct 7, 2016 at 4:48 PM Cody Koeninger 
> wrote:
> >>
> >> Matei asked:
> >>
> >>
> >> > I agree about empowering people interested here to contribute, but I'm
> >> > wondering, do you think there are technical things that people don't
> want to
> >> > work on, or is it a matter of what there's been time to do?
> >>
> >>
> >> It's a matter of mismanagement and miscommunication.
> >>
> >> The structured streaming kafka jira sat with multiple unanswered
> >> requests for someone who was a committer to communicate whether they
> >> were working on it and what the plan was.  I could have done that
> >> implementation and had it in users' hands months ago.  I didn't
> >> pre-emptively do it because I didn't want to then have to argue with
> >> committers about why my code did or did not meet their uncommunicated
> >> expectations.
> >>
> >>
> >> I don't want to re-hash that particular circumstance, I just want to
> >> make sure it never happens again.
> >>
> >>
> >> Hopefully the SIP thread results in clearer expectations, but there
> >> are still some ideas on the table regarding management of volunteer
> >> contributions:
> >>
> >>
> >> - Closing stale jiras.  I hear the bots are impersonal argument, but
> >> the alternative of "someone cleans it up" is not sufficient right now
> >> (with apologies to Sean and all the other janitors).
> >>
> 

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Cody Koeninger
I really like the idea of using jira votes (and/or watchers?) as a filter!

On Fri, Oct 7, 2016 at 4:41 PM, Nicholas Chammas
 wrote:
> I agree with Cody and others that we need some automation — or at least an
> adjusted process — to help us manage organic contributions better.
>
> The objections about automated closing being potentially abrasive are
> understood, but I wouldn’t accept that as a defeat for automation. Instead,
> it seems like a constraint we should impose on any proposed solution: Make
> sure it doesn’t turn contributors off. Rolling as we have been won’t cut it,
> and I don’t think adding committers will ever be a sufficient solution to
> this particular problem.
>
> To me, it seems like we need a way to filter out viable contributions with
> community support from other contributions when it comes to deciding that
> automated action is appropriate. Our current tooling isn’t perfect, but
> perhaps we can leverage it to create such a filter.
>
> For example, consider the following strawman proposal for how to cut down on
> the number of pending but unviable proposals, and simultaneously help
> contributors organize to promote viable proposals and get the attention of
> committers:
>
> Have a bot scan for stale JIRA issues and PRs—i.e. they haven’t been updated
> in 20+ days (or D+ days, if you prefer).
> Depending on the level of community support, either close the item or ping
> specific people for action. Specifically:
> a. If the JIRA/PR has no input from a committer and the JIRA/PR has 5+ votes
> (or V+ votes), ping committers for input. (For PRs, you could count comments
> from different people, or thumbs up on the initial PR post.)
> b. If the JIRA/PR has no input from a committer and the JIRA/PR has less
> than V votes, close it with a gentle message asking the contributor to
> solicit support from either the community or a committer, and try again
> later.
> c. If the JIRA/PR has input from a committer or committers, ping them for an
> update.
>
> This is just a rough idea. The point is that when contributors have stale
> proposals that they don’t close, committers need to take action. A little
> automation to selectively bring contributions to the attention of committers
> can perhaps help them manage the backlog of stale contributions. The
> “selective” part is implemented in this strawman proposal by using JIRA
> votes as a crude proxy for when the community is interested in something,
> but it could be anything.
>
> Also, this doesn’t have to be used just to clear out stale proposals. Once
> the initial backlog is trimmed down, you could set D to 5 days and use this
> as a regular way to bring contributions to the attention of committers.
>
> I dunno if people think this is perhaps too complex, but at our scale I feel
> we need some kind of loose but automated system for funneling contributions
> through some kind of lifecycle. The status quo is just not that good (e.g.
> 474 open PRs against Spark as of this moment).
>
> Nick
>
>
> On Fri, Oct 7, 2016 at 4:48 PM Cody Koeninger  wrote:
>>
>> Matei asked:
>>
>>
>> > I agree about empowering people interested here to contribute, but I'm
>> > wondering, do you think there are technical things that people don't want 
>> > to
>> > work on, or is it a matter of what there's been time to do?
>>
>>
>> It's a matter of mismanagement and miscommunication.
>>
>> The structured streaming kafka jira sat with multiple unanswered
>> requests for someone who was a committer to communicate whether they
>> were working on it and what the plan was.  I could have done that
>> implementation and had it in users' hands months ago.  I didn't
>> pre-emptively do it because I didn't want to then have to argue with
>> committers about why my code did or did not meet their uncommunicated
>> expectations.
>>
>>
>> I don't want to re-hash that particular circumstance, I just want to
>> make sure it never happens again.
>>
>>
>> Hopefully the SIP thread results in clearer expectations, but there
>> are still some ideas on the table regarding management of volunteer
>> contributions:
>>
>>
>> - Closing stale jiras.  I hear the bots are impersonal argument, but
>> the alternative of "someone cleans it up" is not sufficient right now
>> (with apologies to Sean and all the other janitors).
>>
>> - Clear rejection of jiras.  This isn't mean, it's respectful.
>>
>> - Clear "I'm working on this", with clear removal and reassignment if
>> they go radio silent.  This could be keyed to automated check for
>> staleness.
>>
>> - Clear expectation that if someone is working on a jira, you can work
>> on your own alternative, but you need to communicate.
>>
>>
>> I'm sure I've missed some.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: 

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Nicholas Chammas
I agree with Cody and others that we need some automation — or at least an
adjusted process — to help us manage organic contributions better.

The objections about automated closing being potentially abrasive are
understood, but I wouldn’t accept that as a defeat for automation. Instead,
it seems like a constraint we should impose on any proposed solution: Make
sure it doesn’t turn contributors off. Rolling as we have been won’t cut
it, and I don’t think adding committers will ever be a sufficient solution
to this particular problem.

To me, it seems like we need a way to filter out viable contributions with
community support from other contributions when it comes to deciding that
automated action is appropriate. Our current tooling isn’t perfect, but
perhaps we can leverage it to create such a filter.

For example, consider the following strawman proposal for how to cut down
on the number of pending but unviable proposals, and simultaneously help
contributors organize to promote viable proposals and get the attention of
committers:

   1. Have a bot scan for *stale* JIRA issues and PRs—i.e. they haven’t
   been updated in 20+ days (or D+ days, if you prefer).
   2. Depending on the level of community support, either close the item or
   ping specific people for action. Specifically:
   a. If the JIRA/PR has no input from a committer and the JIRA/PR has 5+
   votes (or V+ votes), ping committers for input. (For PRs, you could
   count comments from different people, or thumbs up on the initial PR post.)
   b. If the JIRA/PR has no input from a committer and the JIRA/PR has less
   than V votes, close it with a gentle message asking the contributor to
   solicit support from either the community or a committer, and try again
   later.
   c. If the JIRA/PR has input from a committer or committers, ping them
   for an update.

This is just a rough idea. The point is that when contributors have stale
proposals that they don’t close, committers need to take action. A little
automation to selectively bring contributions to the attention of
committers can perhaps help them manage the backlog of stale contributions.
The “selective” part is implemented in this strawman proposal by using JIRA
votes as a crude proxy for when the community is interested in something,
but it could be anything.

Also, this doesn’t have to be used just to clear out stale proposals. Once
the initial backlog is trimmed down, you could set D to 5 days and use this
as a regular way to bring contributions to the attention of committers.

I dunno if people think this is perhaps too complex, but at our scale I
feel we need some kind of loose but automated system for funneling
contributions through some kind of lifecycle. The status quo is just not
that good (e.g. 474 open PRs 
against Spark as of this moment).

Nick
​

On Fri, Oct 7, 2016 at 4:48 PM Cody Koeninger  wrote:

> Matei asked:
>
>
> > I agree about empowering people interested here to contribute, but I'm
> wondering, do you think there are technical things that people don't want
> to work on, or is it a matter of what there's been time to do?
>
>
> It's a matter of mismanagement and miscommunication.
>
> The structured streaming kafka jira sat with multiple unanswered
> requests for someone who was a committer to communicate whether they
> were working on it and what the plan was.  I could have done that
> implementation and had it in users' hands months ago.  I didn't
> pre-emptively do it because I didn't want to then have to argue with
> committers about why my code did or did not meet their uncommunicated
> expectations.
>
>
> I don't want to re-hash that particular circumstance, I just want to
> make sure it never happens again.
>
>
> Hopefully the SIP thread results in clearer expectations, but there
> are still some ideas on the table regarding management of volunteer
> contributions:
>
>
> - Closing stale jiras.  I hear the bots are impersonal argument, but
> the alternative of "someone cleans it up" is not sufficient right now
> (with apologies to Sean and all the other janitors).
>
> - Clear rejection of jiras.  This isn't mean, it's respectful.
>
> - Clear "I'm working on this", with clear removal and reassignment if
> they go radio silent.  This could be keyed to automated check for
> staleness.
>
> - Clear expectation that if someone is working on a jira, you can work
> on your own alternative, but you need to communicate.
>
>
> I'm sure I've missed some.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark Improvement Proposals

2016-10-07 Thread Cody Koeninger
Yeah, in case it wasn't clear, I was talking about SIPs for major
user-facing or cross-cutting changes, not minor feature adds.

On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> +1 to the SIP label as long as it does not slow down things and it targets
> optimizing efforts, coordination etc. For example really small features
> should not need to go through this process (assuming they dont touch public
> interfaces)  or re-factorings and hope it will be kept this way. So as a
> guideline doc should be provided, like in the KIP case.
>
> IMHO so far aside from tagging things and linking them elsewhere simply
> having design docs and prototypes implementations in PRs is not something
> that has not worked so far. What is really a pain in many projects out
> there is discontinuity in progress of PRs, missing features, slow reviews
> which is understandable to some extent... it is not only about Spark but
> things can be improved for sure for this project in particular as already
> stated.
>
> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger 
> wrote:
>
>> +1 to adding an SIP label and linking it from the website.  I think it
>> needs
>>
>> - template that focuses it towards soliciting user goals / non goals
>> - clear resolution as to which strategy was chosen to pursue.  I'd
>> recommend a vote.
>>
>> Matei asked me to clarify what I meant by changing interfaces, I think
>> it's directly relevant to the SIP idea so I'll clarify here, and split
>> a thread for the other discussion per Nicholas' request.
>>
>> I meant changing public user interfaces.  I think the first design is
>> unlikely to be right, because it's done at a time when you have the
>> least information.  As a user, I find it considerably more frustrating
>> to be unable to use a tool to get my job done, than I do having to
>> make minor changes to my code in order to take advantage of features.
>> I've seen committers be seriously reluctant to allow changes to
>> @experimental code that are needed in order for it to really work
>> right.  You need to be able to iterate, and if people on both sides of
>> the fence aren't going to respect that some newer apis are subject to
>> change, then why even mark them as such?
>>
>> Ideally a finished SIP should give me a checklist of things that an
>> implementation must do, and things that it doesn't need to do.
>> Contributors/committers should be seriously discouraged from putting
>> out a version 0.1 that doesn't have at least a prototype
>> implementation of all those things, especially if they're then going
>> to argue against interface changes necessary to get the the rest of
>> the things done in the 0.2 version.
>>
>>
>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin  wrote:
>> > I like the lightweight proposal to add a SIP label.
>> >
>> > During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
>> > track the list of major changes, but that never really materialized due
>> to
>> > the overhead. Adding a SIP label on major JIRAs and then link to them
>> > prominently on the Spark website makes a lot of sense.
>> >
>> >
>> > On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia > >
>> > wrote:
>> >>
>> >> For the improvement proposals, I think one major point was to make them
>> >> really visible to users who are not contributors, so we should do more
>> than
>> >> sending stuff to dev@. One very lightweight idea is to have a new
>> type of
>> >> JIRA called a SIP and have a link to a filter that shows all such
>> JIRAs from
>> >> http://spark.apache.org. I also like the idea of SIP and design doc
>> >> templates (in fact many projects have them).
>> >>
>> >> Matei
>> >>
>> >> On Oct 7, 2016, at 10:38 AM, Reynold Xin  wrote:
>> >>
>> >> I called Cody last night and talked about some of the topics in his
>> email.
>> >> It became clear to me Cody genuinely cares about the project.
>> >>
>> >> Some of the frustrations come from the success of the project itself
>> >> becoming very "hot", and it is difficult to get clarity from people who
>> >> don't dedicate all their time to Spark. In fact, it is in some ways
>> similar
>> >> to scaling an engineering team in a successful startup: old processes
>> that
>> >> worked well might not work so well when it gets to a certain size,
>> cultures
>> >> can get diluted, building culture vs building process, etc.
>> >>
>> >> I also really like to have a more visible process for larger changes,
>> >> especially major user facing API changes. Historically we upload
>> design docs
>> >> for major changes, but it is not always consistent and difficult to
>> quality
>> >> of the docs, due to the volunteering nature of the organization.
>> >>
>> >> Some of the more concrete ideas we discussed focus on building a
>> culture
>> >> to improve clarity:
>> >>
>> >> - Process: Large changes should have design docs posted on JIRA. 

Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Cody Koeninger
Matei asked:


> I agree about empowering people interested here to contribute, but I'm 
> wondering, do you think there are technical things that people don't want to 
> work on, or is it a matter of what there's been time to do?


It's a matter of mismanagement and miscommunication.

The structured streaming kafka jira sat with multiple unanswered
requests for someone who was a committer to communicate whether they
were working on it and what the plan was.  I could have done that
implementation and had it in users' hands months ago.  I didn't
pre-emptively do it because I didn't want to then have to argue with
committers about why my code did or did not meet their uncommunicated
expectations.


I don't want to re-hash that particular circumstance, I just want to
make sure it never happens again.


Hopefully the SIP thread results in clearer expectations, but there
are still some ideas on the table regarding management of volunteer
contributions:


- Closing stale jiras.  I hear the bots are impersonal argument, but
the alternative of "someone cleans it up" is not sufficient right now
(with apologies to Sean and all the other janitors).

- Clear rejection of jiras.  This isn't mean, it's respectful.

- Clear "I'm working on this", with clear removal and reassignment if
they go radio silent.  This could be keyed to automated check for
staleness.

- Clear expectation that if someone is working on a jira, you can work
on your own alternative, but you need to communicate.


I'm sure I've missed some.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



  1   2   >