Re: Packages to release in 3.0.0-preview

2019-10-31 Thread Cody Koeninger
On Thu, Oct 31, 2019 at 4:30 PM Sean Owen  wrote:
>
> . But it'd be cooler to call these major
> releases!


Maybe this is just semantics, but my point is the Scala project
already does call 2.12 to 2.13 a major release

e.g. from https://www.scala-lang.org/download/

"Note that different *major* releases of Scala (e.g. Scala 2.11.x and
Scala 2.12.x) are not binary compatible with each other."

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Packages to release in 3.0.0-preview

2019-10-31 Thread Cody Koeninger
On Wed, Oct 30, 2019 at 5:57 PM Sean Owen  wrote:

> Or, frankly, maybe Scala should reconsider the mutual incompatibility
> between minor releases. These are basically major releases, and
> indeed, it causes exactly this kind of headache.
>


Not saying binary incompatibility is fun, but 2.12 to 2.13 is a major
release, it's not a minor release.  Scala pre-dates semantic
versioning, the second digit is for major releases.

scala 2.13.0 Jun 7, 2019
scala 2.12.0 Nov 2, 2016

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Structured streaming from Kafka by timestamp

2019-02-05 Thread Cody Koeninger
To be more explicit, the easiest thing to do in the short term is use
your own instance of KafkaConsumer to get the offsets for the
timestamps you're interested in, using offsetsForTimes, and use those
for the start / end offsets.  See
https://kafka.apache.org/10/javadoc/?org/apache/kafka/clients/consumer/KafkaConsumer.html

Even if you are interested in implementing timestamp filter pushdown,
you need to get that basic usage working first, so I'd start there.

On Fri, Feb 1, 2019 at 11:08 AM Tomas Bartalos  wrote:
>
> Hello,
>
> sorry for my late answer.
> You're right, what I'm doing is a one time query, not a structured streaming. 
> Probably it will be best to describe my use case:
> I'd like to expose live data (via jdbc/odbc) residing in Kafka with the power 
> of spark's distributed sql engine. As jdbc server I use spark thrift server.
> Since timestamp pushdown is not possible :-(, this is a very cumbersome task.
> Let's say I want to inspect last 5 minutes of kafka. First I have to find out 
> offsetFrom per each partition that corresponds to now() - 5 minutes.
> Then I can register a kafka table:
>
> CREATE TABLE ticket_kafka_x USING kafka OPTIONS (kafka.bootstrap.servers 
> 'server1,server2,...',
>
> subscribe 'my_topic',
>
> startingOffsets '{"my_topic" : {"0" : 48532124, "1" : 49029703, "2" : 
> 49456213, "3" : 48400521}}');
>
>
> Then I can issue queries against this table (Data in Kafka is stored in Avro 
> format but I've created custom genericUDF to deserialize the data).
>
> select event.id as id, explode(event.picks) as picks from (
>
> select from_avro(value) as event from ticket_kafka_x where timestamp > 
> from_unixtime(unix_timestamp() - 5 * 60, "-MM-dd HH:mm:ss")
>
> ) limit 100;
>
>
> Whats even more irritating after few minutes I have to re-create this table 
> to reflect the last 5 minutes interval, otherwise the query performance would 
> suffer from increasing data to filter.
>
> Colleague of mine was able to make direct queries with timestamp pushdown in 
> latest Hive.
> How difficult is it to implement this feature in spark, could you lead me to 
> code where I could have a look ?
>
> Thank you,
>
>
> pi 25. 1. 2019 o 0:32 Shixiong(Ryan) Zhu  napísal(a):
>>
>> Hey Tomas,
>>
>> From your description, you just ran a batch query rather than a Structured 
>> Streaming query. The Kafka data source doesn't support filter push down 
>> right now. But that's definitely doable. One workaround here is setting 
>> proper  "startingOffsets" and "endingOffsets" options when loading from 
>> Kafka.
>>
>> Best Regards,
>>
>> Ryan
>>
>>
>> On Thu, Jan 24, 2019 at 10:15 AM Gabor Somogyi  
>> wrote:
>>>
>>> Hi Tomas,
>>>
>>> As a general note don't fully understand your use-case. You've mentioned 
>>> structured streaming but your query is more like a one-time SQL statement.
>>> Kafka doesn't support predicates how it's integrated with spark. What can 
>>> be done from spark perspective is to look for an offset for a specific 
>>> lowest timestamp and start the reading from there.
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Thu, Jan 24, 2019 at 6:38 PM Tomas Bartalos  
>>> wrote:

 Hello,

 I'm trying to read Kafka via spark structured streaming. I'm trying to 
 read data within specific time range:

 select count(*) from kafka_table where timestamp > cast('2019-01-23 1:00' 
 as TIMESTAMP) and timestamp < cast('2019-01-23 1:01' as TIMESTAMP);


 The problem is that timestamp query is not pushed-down to Kafka, so Spark 
 tries to read the whole topic from beginning.


 explain query:

 

  +- *(1) Filter ((isnotnull(timestamp#57) && (timestamp#57 > 
 15351480)) && (timestamp#57 < 15352344))


 Scan KafkaRelation(strategy=Subscribe[keeper.Ticket.avro.v1---production], 
 start=EarliestOffsetRangeLimit, end=LatestOffsetRangeLimit) 
 [key#52,value#53,topic#54,partition#55,offset#56L,timestamp#57,timestampType#58]
  PushedFilters: [], ReadSchema: 
 struct>>>

 Obviously the query takes forever to complete. Is there a solution to this 
 ?

 I'm using kafka and kafka-client version 1.1.1


 BR,

 Tomas

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Cody Koeninger
I feel like I've already said my piece on
https://github.com/apache/spark/pull/22138 let me know if you have
more questions.

As for SS in general, I don't have a production SS deployment, so I'm
less comfortable with reviewing large changes to it.  But if no other
committers are working on it...

On Sun, Jan 13, 2019 at 5:19 PM Sean Owen  wrote:
>
> Yes you're preaching to the choir here. SS does seem somewhat
> abandoned by those that have worked on it. I have also been at times
> frustrated that some areas fall into this pattern.
>
> There isn't a way to make people work on it, and I personally am not
> interested in it nor have a background in SS.
>
> I did leave some comments on your PR and will see if we can get
> comfortable with merging it, as I presume you are pretty knowledgeable
> about the change.
>
> On Sun, Jan 13, 2019 at 4:55 PM Jungtaek Lim  wrote:
> >
> > Sean, this is actually a fail-back on pinging committers. I know who can 
> > review and merge in SS area, and pinged to them, didn't work. Even there's 
> > a PR which approach was encouraged by committer and reviewed the first 
> > phase, and no review.
> >
> > That's not the first time I have faced the situation, and I used the 
> > fail-back approach at that time. (You can see there was no response even in 
> > the mail thread.) Not sure which approach worked.
> > https://lists.apache.org/thread.html/c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@%3Cdev.spark.apache.org%3E
> >
> > I've observed that only (critical) bugfixes are being reviewed and merged 
> > in time for SS area. For other stuffs like new features and improvements, 
> > both discussions and PRs were pretty less popular from committers: though 
> > there was even participation/approve from non-committer community. I don't 
> > think SS is the thing to be turned into maintenance.
> >
> > I guess PMC members should try to resolve such situation, as it will 
> > (slowly and quietly) make some issues like contributors leaving, module 
> > stopped growing up, etc.. The problem will grow up like a snowball: getting 
> > bigger and bigger. I don't mind if there's no interest on both contributors 
> > and committers for such module, but SS is not. Maybe either other 
> > committers who weren't familiar with should try to get familiar and cover 
> > the area, or the area needs more committers.
> >
> > -Jungtaek Lim (HeartSaVioR)
> >
> > 2019년 1월 13일 (일) 오후 11:37, Sean Owen 님이 작성:
> >>
> >> Jungtaek, the best strategy is to find who wrote the code you are
> >> modifying (use Github history or git blame) and ping them directly on
> >> the PR. I don't know this code well myself.
> >> It also helps if you can address why the functionality is important,
> >> and describe compatibility implications.
> >>
> >> Most PRs are not merged, note. Not commenting on this particular one,
> >> but it's not a 'bug' if it's not being merged.
> >>
> >> On Sun, Jan 13, 2019 at 12:29 AM Jungtaek Lim  wrote:
> >> >
> >> > I'm sorry but let me remind this, as non-SS PRs are being reviewed 
> >> > accordingly, whereas many of SS PRs (regardless of who create) are still 
> >> > not reviewed and merged in time.
> >> >
> >> > 2019년 1월 3일 (목) 오전 7:57, Jungtaek Lim 님이 작성:
> >> >>
> >> >> Spark devs, happy new year!
> >> >>
> >> >> I would like to remind this kindly, since there was actually no review 
> >> >> after initiating the thread.
> >> >>
> >> >> Thanks,
> >> >> Jungtaek Lim (HeartSaVioR)
> >> >>
> >> >> 2018년 12월 12일 (수) 오후 11:12, Vaclav Kosar 님이 작성:
> >> >>>
> >> >>> I am also waiting for any finalization of my PR [3]. I seems that SS 
> >> >>> PRs are not being reviewed much these days.
> >> >>>
> >> >>> [3] https://github.com/apache/spark/pull/21919
> >> >>>
> >> >>>
> >> >>> On 12. 12. 18 14:37, Dongjin Lee wrote:
> >> >>>
> >> >>> If it is possible, could you review my PR on Kafka's header 
> >> >>> functionality[^1] also? It was added in Kafka 0.11.0.0 but still not 
> >> >>> supported in Spark.
> >> >>>
> >> >>> Thanks,
> >> >>> Dongjin
> >> >>>
> >> >>> [^1]: https://github.com/apache/spark/pull/22282
> >> >>> [^2]: https://issues.apache.org/jira/browse/KAFKA-4208
> >> >>>
> >> >>> On Wed, Dec 12, 2018 at 6:43 PM Jungtaek Lim  wrote:
> >> 
> >>  Hi devs,
> >> 
> >>  Would I kindly ask for reviewing on PRs for Structured Streaming? I 
> >>  have 5 open pull requests on SS side [1] (earliest PR was opened 
> >>  around 4 months so far), and there looks like couple of PR for others 
> >>  [2] which looks good to be reviewed, too.
> >> 
> >>  Thanks in advance,
> >>  Jungtaek Lim (HeartSaVioR)
> >> 
> >>  1. 
> >>  https://github.com/apache/spark/pulls?utf8=%E2%9C%93=is%3Aopen+is%3Apr+author%3AHeartSaVioR+%5BSS%5D
> >>  2. 
> >>  https://github.com/apache/spark/pulls?utf8=%E2%9C%93=is%3Aopen+is%3Apr+%5BSS%5D+
> >> 
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Dongjin Lee
> >> >>>
> >> >>> A hitchhiker in the mathematical world.

Re: Automated formatting

2018-11-26 Thread Cody Koeninger
That seems like a good first step.

Opened a PR / jira ticket with that approach at

https://github.com/apache/spark/pull/23148

If anyone tests this and finds a file that doesn't format well (e.g.
fails scalastyle afterwards) just let me know, happy to tweak scalafmt
config options.
On Thu, Nov 22, 2018 at 7:32 PM Matei Zaharia  wrote:
>
> Can we start by just recommending to contributors that they do this manually? 
> Then if it seems to work fine, we can try to automate it.
>
> > On Nov 22, 2018, at 4:40 PM, Cody Koeninger  wrote:
> >
> > I believe scalafmt only works on scala sources.  There are a few
> > plugins for formatting java sources, but I'm less familiar with them.
> > On Thu, Nov 22, 2018 at 11:39 AM Mridul Muralidharan  
> > wrote:
> >>
> >> Is this handling only scala or java as well ?
> >>
> >> Regards,
> >> Mridul
> >>
> >> On Thu, Nov 22, 2018 at 9:11 AM Cody Koeninger  wrote:
> >>>
> >>> Plugin invocation is ./build/mvn mvn-scalafmt_2.12:format
> >>>
> >>> It takes about 5 seconds, and errors out on the first different file
> >>> that doesn't match formatting.
> >>>
> >>> I made a shell wrapper so that contributors can just run
> >>>
> >>> ./dev/scalafmt
> >>>
> >>> to actually format in place the files that have changed (or pass
> >>> through commandline args if they want to do something different)
> >>>
> >>> On Wed, Nov 21, 2018 at 3:36 PM Sean Owen  wrote:
> >>>>
> >>>> I know the PR builder runs SBT, but I presume this would just be a
> >>>> separate mvn job that runs. If it doesn't take long and only checks
> >>>> the right diff, seems worth a shot. What's the invocation that Shane
> >>>> could add (after this change goes in)
> >>>> On Wed, Nov 21, 2018 at 3:27 PM Cody Koeninger  
> >>>> wrote:
> >>>>>
> >>>>> There's a mvn plugin (sbt as well, but it requires sbt 1.0+) so it
> >>>>> should be runnable from the PR builder
> >>>>>
> >>>>> Super basic example with a minimal config that's close to current
> >>>>> style guide here:
> >>>>>
> >>>>> https://github.com/apache/spark/compare/master...koeninger:scalafmt
> >>>>>
> >>>>> I imagine tracking down the corner cases in the config, especially
> >>>>> around interactions with scalastyle, may take a bit of work.  Happy to
> >>>>> do it, but not if there's significant concern about style related
> >>>>> changes in PRs.
> >>>>> On Wed, Nov 21, 2018 at 2:42 PM Sean Owen  wrote:
> >>>>>>
> >>>>>> Yeah fair, maybe mostly consistent in broad strokes but not in the 
> >>>>>> details.
> >>>>>> Is this something that can be just run in the PR builder? if the rules
> >>>>>> are simple and not too hard to maintain, seems like a win.
> >>>>>> On Wed, Nov 21, 2018 at 2:26 PM Cody Koeninger  
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Definitely not suggesting a mass reformat, just on a per-PR basis.
> >>>>>>>
> >>>>>>> scalafmt --diff  will reformat only the files that differ from git 
> >>>>>>> head
> >>>>>>> scalafmt --test --diff won't modify files, just throw an exception if
> >>>>>>> they don't match format
> >>>>>>>
> >>>>>>> I don't think code is consistently formatted now.
> >>>>>>> I tried scalafmt on the most recent PR I looked at, and it caught
> >>>>>>> stuff as basic as newlines before curly brace in existing code.
> >>>>>>> I've had different reviewers for PRs that were literal backports or
> >>>>>>> cut & paste of each other come up with different formatting nits.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Nov 21, 2018 at 12:03 PM Sean Owen  wrote:
> >>>>>>>>
> >>>>>>>> I think reformatting the whole code base might be too much. If there
> >>>>>>>> are some more targeted cleanups, sure. We do have some links to style
> >>>>>>>> guides buried somewhere in the docs, although the conv

Re: Automated formatting

2018-11-22 Thread Cody Koeninger
I believe scalafmt only works on scala sources.  There are a few
plugins for formatting java sources, but I'm less familiar with them.
On Thu, Nov 22, 2018 at 11:39 AM Mridul Muralidharan  wrote:
>
> Is this handling only scala or java as well ?
>
> Regards,
> Mridul
>
> On Thu, Nov 22, 2018 at 9:11 AM Cody Koeninger  wrote:
>>
>> Plugin invocation is ./build/mvn mvn-scalafmt_2.12:format
>>
>> It takes about 5 seconds, and errors out on the first different file
>> that doesn't match formatting.
>>
>> I made a shell wrapper so that contributors can just run
>>
>> ./dev/scalafmt
>>
>> to actually format in place the files that have changed (or pass
>> through commandline args if they want to do something different)
>>
>> On Wed, Nov 21, 2018 at 3:36 PM Sean Owen  wrote:
>> >
>> > I know the PR builder runs SBT, but I presume this would just be a
>> > separate mvn job that runs. If it doesn't take long and only checks
>> > the right diff, seems worth a shot. What's the invocation that Shane
>> > could add (after this change goes in)
>> > On Wed, Nov 21, 2018 at 3:27 PM Cody Koeninger  wrote:
>> > >
>> > > There's a mvn plugin (sbt as well, but it requires sbt 1.0+) so it
>> > > should be runnable from the PR builder
>> > >
>> > > Super basic example with a minimal config that's close to current
>> > > style guide here:
>> > >
>> > > https://github.com/apache/spark/compare/master...koeninger:scalafmt
>> > >
>> > > I imagine tracking down the corner cases in the config, especially
>> > > around interactions with scalastyle, may take a bit of work.  Happy to
>> > > do it, but not if there's significant concern about style related
>> > > changes in PRs.
>> > > On Wed, Nov 21, 2018 at 2:42 PM Sean Owen  wrote:
>> > > >
>> > > > Yeah fair, maybe mostly consistent in broad strokes but not in the 
>> > > > details.
>> > > > Is this something that can be just run in the PR builder? if the rules
>> > > > are simple and not too hard to maintain, seems like a win.
>> > > > On Wed, Nov 21, 2018 at 2:26 PM Cody Koeninger  
>> > > > wrote:
>> > > > >
>> > > > > Definitely not suggesting a mass reformat, just on a per-PR basis.
>> > > > >
>> > > > > scalafmt --diff  will reformat only the files that differ from git 
>> > > > > head
>> > > > > scalafmt --test --diff won't modify files, just throw an exception if
>> > > > > they don't match format
>> > > > >
>> > > > > I don't think code is consistently formatted now.
>> > > > > I tried scalafmt on the most recent PR I looked at, and it caught
>> > > > > stuff as basic as newlines before curly brace in existing code.
>> > > > > I've had different reviewers for PRs that were literal backports or
>> > > > > cut & paste of each other come up with different formatting nits.
>> > > > >
>> > > > >
>> > > > > On Wed, Nov 21, 2018 at 12:03 PM Sean Owen  wrote:
>> > > > > >
>> > > > > > I think reformatting the whole code base might be too much. If 
>> > > > > > there
>> > > > > > are some more targeted cleanups, sure. We do have some links to 
>> > > > > > style
>> > > > > > guides buried somewhere in the docs, although the conventions are
>> > > > > > pretty industry standard.
>> > > > > >
>> > > > > > I *think* the code is pretty consistently formatted now, and would
>> > > > > > expect contributors to follow formatting they see, so ideally the
>> > > > > > surrounding code alone is enough to give people guidance. In 
>> > > > > > practice,
>> > > > > > we're always going to have people format differently no matter 
>> > > > > > what I
>> > > > > > think so it's inevitable.
>> > > > > >
>> > > > > > Is there a way to just check style on PR changes? that's fine.
>> > > > > > On Wed, Nov 21, 2018 at 11:40 AM Cody Koeninger 
>> > > > > >  wrote:
>> > > > > > >
>> > > > > > > Is there any appetite for revisiting automating formatting?
>> > > > > > >
>> > > > > > > I know over the years various people have expressed opposition 
>> > > > > > > to it
>> > > > > > > as unnecessary churn in diffs, but having every new contributor
>> > > > > > > greeted with "nit: 4 space indentation for argument lists" isn't 
>> > > > > > > very
>> > > > > > > welcoming.
>> > > > > > >
>> > > > > > > -
>> > > > > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > > > > > >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Automated formatting

2018-11-22 Thread Cody Koeninger
Plugin invocation is ./build/mvn mvn-scalafmt_2.12:format

It takes about 5 seconds, and errors out on the first different file
that doesn't match formatting.

I made a shell wrapper so that contributors can just run

./dev/scalafmt

to actually format in place the files that have changed (or pass
through commandline args if they want to do something different)

On Wed, Nov 21, 2018 at 3:36 PM Sean Owen  wrote:
>
> I know the PR builder runs SBT, but I presume this would just be a
> separate mvn job that runs. If it doesn't take long and only checks
> the right diff, seems worth a shot. What's the invocation that Shane
> could add (after this change goes in)
> On Wed, Nov 21, 2018 at 3:27 PM Cody Koeninger  wrote:
> >
> > There's a mvn plugin (sbt as well, but it requires sbt 1.0+) so it
> > should be runnable from the PR builder
> >
> > Super basic example with a minimal config that's close to current
> > style guide here:
> >
> > https://github.com/apache/spark/compare/master...koeninger:scalafmt
> >
> > I imagine tracking down the corner cases in the config, especially
> > around interactions with scalastyle, may take a bit of work.  Happy to
> > do it, but not if there's significant concern about style related
> > changes in PRs.
> > On Wed, Nov 21, 2018 at 2:42 PM Sean Owen  wrote:
> > >
> > > Yeah fair, maybe mostly consistent in broad strokes but not in the 
> > > details.
> > > Is this something that can be just run in the PR builder? if the rules
> > > are simple and not too hard to maintain, seems like a win.
> > > On Wed, Nov 21, 2018 at 2:26 PM Cody Koeninger  wrote:
> > > >
> > > > Definitely not suggesting a mass reformat, just on a per-PR basis.
> > > >
> > > > scalafmt --diff  will reformat only the files that differ from git head
> > > > scalafmt --test --diff won't modify files, just throw an exception if
> > > > they don't match format
> > > >
> > > > I don't think code is consistently formatted now.
> > > > I tried scalafmt on the most recent PR I looked at, and it caught
> > > > stuff as basic as newlines before curly brace in existing code.
> > > > I've had different reviewers for PRs that were literal backports or
> > > > cut & paste of each other come up with different formatting nits.
> > > >
> > > >
> > > > On Wed, Nov 21, 2018 at 12:03 PM Sean Owen  wrote:
> > > > >
> > > > > I think reformatting the whole code base might be too much. If there
> > > > > are some more targeted cleanups, sure. We do have some links to style
> > > > > guides buried somewhere in the docs, although the conventions are
> > > > > pretty industry standard.
> > > > >
> > > > > I *think* the code is pretty consistently formatted now, and would
> > > > > expect contributors to follow formatting they see, so ideally the
> > > > > surrounding code alone is enough to give people guidance. In practice,
> > > > > we're always going to have people format differently no matter what I
> > > > > think so it's inevitable.
> > > > >
> > > > > Is there a way to just check style on PR changes? that's fine.
> > > > > On Wed, Nov 21, 2018 at 11:40 AM Cody Koeninger  
> > > > > wrote:
> > > > > >
> > > > > > Is there any appetite for revisiting automating formatting?
> > > > > >
> > > > > > I know over the years various people have expressed opposition to it
> > > > > > as unnecessary churn in diffs, but having every new contributor
> > > > > > greeted with "nit: 4 space indentation for argument lists" isn't 
> > > > > > very
> > > > > > welcoming.
> > > > > >
> > > > > > -
> > > > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > > > >

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Automated formatting

2018-11-21 Thread Cody Koeninger
There's a mvn plugin (sbt as well, but it requires sbt 1.0+) so it
should be runnable from the PR builder

Super basic example with a minimal config that's close to current
style guide here:

https://github.com/apache/spark/compare/master...koeninger:scalafmt

I imagine tracking down the corner cases in the config, especially
around interactions with scalastyle, may take a bit of work.  Happy to
do it, but not if there's significant concern about style related
changes in PRs.
On Wed, Nov 21, 2018 at 2:42 PM Sean Owen  wrote:
>
> Yeah fair, maybe mostly consistent in broad strokes but not in the details.
> Is this something that can be just run in the PR builder? if the rules
> are simple and not too hard to maintain, seems like a win.
> On Wed, Nov 21, 2018 at 2:26 PM Cody Koeninger  wrote:
> >
> > Definitely not suggesting a mass reformat, just on a per-PR basis.
> >
> > scalafmt --diff  will reformat only the files that differ from git head
> > scalafmt --test --diff won't modify files, just throw an exception if
> > they don't match format
> >
> > I don't think code is consistently formatted now.
> > I tried scalafmt on the most recent PR I looked at, and it caught
> > stuff as basic as newlines before curly brace in existing code.
> > I've had different reviewers for PRs that were literal backports or
> > cut & paste of each other come up with different formatting nits.
> >
> >
> > On Wed, Nov 21, 2018 at 12:03 PM Sean Owen  wrote:
> > >
> > > I think reformatting the whole code base might be too much. If there
> > > are some more targeted cleanups, sure. We do have some links to style
> > > guides buried somewhere in the docs, although the conventions are
> > > pretty industry standard.
> > >
> > > I *think* the code is pretty consistently formatted now, and would
> > > expect contributors to follow formatting they see, so ideally the
> > > surrounding code alone is enough to give people guidance. In practice,
> > > we're always going to have people format differently no matter what I
> > > think so it's inevitable.
> > >
> > > Is there a way to just check style on PR changes? that's fine.
> > > On Wed, Nov 21, 2018 at 11:40 AM Cody Koeninger  
> > > wrote:
> > > >
> > > > Is there any appetite for revisiting automating formatting?
> > > >
> > > > I know over the years various people have expressed opposition to it
> > > > as unnecessary churn in diffs, but having every new contributor
> > > > greeted with "nit: 4 space indentation for argument lists" isn't very
> > > > welcoming.
> > > >
> > > > -
> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > >

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Automated formatting

2018-11-21 Thread Cody Koeninger
Definitely not suggesting a mass reformat, just on a per-PR basis.

scalafmt --diff  will reformat only the files that differ from git head
scalafmt --test --diff won't modify files, just throw an exception if
they don't match format

I don't think code is consistently formatted now.
I tried scalafmt on the most recent PR I looked at, and it caught
stuff as basic as newlines before curly brace in existing code.
I've had different reviewers for PRs that were literal backports or
cut & paste of each other come up with different formatting nits.


On Wed, Nov 21, 2018 at 12:03 PM Sean Owen  wrote:
>
> I think reformatting the whole code base might be too much. If there
> are some more targeted cleanups, sure. We do have some links to style
> guides buried somewhere in the docs, although the conventions are
> pretty industry standard.
>
> I *think* the code is pretty consistently formatted now, and would
> expect contributors to follow formatting they see, so ideally the
> surrounding code alone is enough to give people guidance. In practice,
> we're always going to have people format differently no matter what I
> think so it's inevitable.
>
> Is there a way to just check style on PR changes? that's fine.
> On Wed, Nov 21, 2018 at 11:40 AM Cody Koeninger  wrote:
> >
> > Is there any appetite for revisiting automating formatting?
> >
> > I know over the years various people have expressed opposition to it
> > as unnecessary churn in diffs, but having every new contributor
> > greeted with "nit: 4 space indentation for argument lists" isn't very
> > welcoming.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Automated formatting

2018-11-21 Thread Cody Koeninger
Is there any appetite for revisiting automating formatting?

I know over the years various people have expressed opposition to it
as unnecessary churn in diffs, but having every new contributor
greeted with "nit: 4 space indentation for argument lists" isn't very
welcoming.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Structured Streaming] Kafka group.id is fixed

2018-11-19 Thread Cody Koeninger
Anastasios it looks like you already identified the two lines that
need to change, the string interpolation that depends on
UUID.randomUUID and metadataPath.hashCode.

I'd factor that out into a function that returns the group id.  That
function would also need to take the "parameters" variable (the map of
user-provided options) and look for a prefix for the group id,
defaulting to the current behavior.

If you have questions, feel free to ping me on the jira, or get as far
as you can and submit a PR for more discussion.
On Mon, Nov 19, 2018 at 2:38 PM Anastasios Zouzias  wrote:
>
> Hi Tom,
>
> I initiated an issue here: https://issues.apache.org/jira/browse/SPARK-26121
>
> Feel free to edit/update the ticket. If someone familiar with the codebase 
> has any suggestion on the proper way of fixing this, I could work on it.
>
> Best,
> Anastasios
>
> On Mon, Nov 19, 2018 at 4:31 PM Tom Graves  wrote:
>>
>> This makes sense to me and was going to propose something similar in order 
>> to be able to use the kafka acls more effectively as well, can you file a 
>> jira for it?
>>
>> Tom
>>
>> On Friday, November 9, 2018, 2:26:12 AM CST, Anastasios Zouzias 
>>  wrote:
>>
>>
>> Hi all,
>>
>> I run in the following situation with Spark Structure Streaming (SS) using 
>> Kafka.
>>
>> In a project that I work on, there is already a secured Kafka setup where 
>> ops can issue an SSL certificate per "group.id", which should be predefined 
>> (or hopefully its prefix to be predefined).
>>
>> On the other hand, Spark SS fixes the group.id to
>>
>> val uniqueGroupId = 
>> s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
>>
>> see, i.e.,
>>
>> https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L124
>>
>> I guess Spark developers had a good reason to fix it, but is it possible to 
>> make configurable the prefix of the above uniqueGroupId 
>> ("spark-kafka-source")? If so, I could prepare a PR on it.
>>
>> The rational is that we do not want all spark-jobs to use the same 
>> certificate on group-ids of the form (spark-kafka-source-*).
>>
>>
>> Best regards,
>> Anastasios Zouzias
>
>
>
> --
> -- Anastasios Zouzias

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: DataSourceV2 sync tomorrow

2018-11-13 Thread Cody Koeninger
Am I the only one for whom the livestream link didn't work last time?
Would like to be able to at least watch the discussion this time
around.
On Tue, Nov 13, 2018 at 6:01 PM Ryan Blue  wrote:
>
> Hi everyone,
> I just wanted to send out a reminder that there’s a DSv2 sync tomorrow at 
> 17:00 PST, which is 01:00 UTC.
>
> Here are some of the topics under discussion in the last couple of weeks:
>
> Read API for v2 - see Wenchen’s doc
> Capabilities API - see the dev list thread
> Using CatalogTableIdentifier to reliably separate v2 code paths - see PR 
> #21978
> A replacement for InternalRow
>
> I know that a lot of people are also interested in combining the source API 
> for micro-batch and continuous streaming. Wenchen and I have been discussing 
> a way to do that and Wenchen has added it to the Read API doc as Alternative 
> #2. I think this would be a good thing to plan on discussing.
>
> rb
>
> Here’s some additional background on combining micro-batch and continuous 
> APIs:
>
> The basic idea is to update how tasks end so that the same tasks can be used 
> in micro-batch or streaming. For tasks that are naturally limited like data 
> files, when the data is exhausted, Spark stops reading. For tasks that are 
> not limited, like a Kafka partition, Spark decides when to stop in 
> micro-batch mode by hitting a pre-determined LocalOffset or Spark can just 
> keep running in continuous mode.
>
> Note that a task deciding to stop can happen in both modes, either when a 
> task is exhausted in micro-batch or when a stream needs to be reconfigured in 
> continuous.
>
> Here’s the task reader API. The offset returned is optional so that a task 
> can avoid stopping if there isn’t a resumeable offset, like if it is in the 
> middle of an input file:
>
> interface StreamPartitionReader extends InputPartitionReader {
>   Optional currentOffset();
>   boolean next() // from InputPartitionReader
>   T get()// from InputPartitionReader
> }
>
> The streaming code would look something like this:
>
> Stream stream = scan.toStream()
> StreamReaderFactory factory = stream.createReaderFactory()
>
> while (true) {
>   Offset start = stream.currentOffset()
>   Offset end = if (isContinuousMode) {
> None
>   } else {
> // rate limiting would happen here
> Some(stream.latestOffset())
>   }
>
>   InputPartition[] parts = stream.planInputPartitions(start)
>
>   // returns when needsReconfiguration is true or all tasks finish
>   runTasks(parts, factory, end)
>
>   // the stream's current offset has been updated at the last epoch
> }
>
> --
> Ryan Blue
> Software Engineer
> Netflix

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Structured Streaming] Kafka group.id is fixed

2018-11-09 Thread Cody Koeninger
That sounds reasonable to me
On Fri, Nov 9, 2018 at 2:26 AM Anastasios Zouzias  wrote:
>
> Hi all,
>
> I run in the following situation with Spark Structure Streaming (SS) using 
> Kafka.
>
> In a project that I work on, there is already a secured Kafka setup where ops 
> can issue an SSL certificate per "group.id", which should be predefined (or 
> hopefully its prefix to be predefined).
>
> On the other hand, Spark SS fixes the group.id to
>
> val uniqueGroupId = 
> s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
>
> see, i.e.,
>
> https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L124
>
> I guess Spark developers had a good reason to fix it, but is it possible to 
> make configurable the prefix of the above uniqueGroupId 
> ("spark-kafka-source")? If so, I could prepare a PR on it.
>
> The rational is that we do not want all spark-jobs to use the same 
> certificate on group-ids of the form (spark-kafka-source-*).
>
>
> Best regards,
> Anastasios Zouzias

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Nightly Builds in the docs (in spark-nightly/spark-master-bin/latest? Can't seem to find it)

2018-08-31 Thread Cody Koeninger
Just got a question about this on the user list as well.

Worth removing that link to pwendell's directory from the docs?

On Sun, Jan 21, 2018 at 12:13 PM, Jacek Laskowski  wrote:
> Hi,
>
> http://spark.apache.org/developer-tools.html#nightly-builds reads:
>
>> Spark nightly packages are available at:
>> Latest master build:
>> https://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest
>
> but the URL gives 404. Is this intended?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-08-31 Thread Cody Koeninger
+1 to Sean's comment

On Fri, Aug 31, 2018 at 2:48 PM, Reynold Xin  wrote:
> Yup all good points. One way I've done it in the past is to have an appendix
> section for design sketch, as an expansion to the question "- What is new in
> your approach and why do you think it will be successful?"
>
> On Fri, Aug 31, 2018 at 12:47 PM Marcelo Vanzin
>  wrote:
>>
>> I like the questions (aside maybe from the cost one which perhaps does
>> not matter much here), especially since they encourage explaining
>> things in a more plain language than generally used by specs.
>>
>> But I don't think we can ignore design aspects; it's been my
>> observation that a good portion of SPIPs, when proposed, already have
>> at the very least some sort of implementation (even if it's a barely
>> working p.o.c.), so it would also be good to have that information up
>> front if it's available.
>>
>> (So I guess I'm just repeating Sean's reply.)
>>
>> On Fri, Aug 31, 2018 at 11:23 AM Reynold Xin  wrote:
>> >
>> > I helped craft the current SPIP template last year. I was recently
>> > (re-)introduced to the Heilmeier Catechism, a set of questions DARPA
>> > developed to evaluate proposals. The set of questions are:
>> >
>> > - What are you trying to do? Articulate your objectives using absolutely
>> > no jargon.
>> > - How is it done today, and what are the limits of current practice?
>> > - What is new in your approach and why do you think it will be
>> > successful?
>> > - Who cares? If you are successful, what difference will it make?
>> > - What are the risks?
>> > - How much will it cost?
>> > - How long will it take?
>> > - What are the mid-term and final “exams” to check for success?
>> >
>> > When I read the above list, it resonates really well because they are
>> > almost always the same set of questions I ask myself and others before I
>> > decide whether something is worth doing. In some ways, our SPIP template
>> > tries to capture some of these (e.g. target persona), but are not as
>> > explicit and well articulated.
>> >
>> > What do people think about replacing the current SPIP template with the
>> > above?
>> >
>> > At a high level, I think the Heilmeier's Catechism emphasizes less about
>> > the "how", and more the "why" and "what", which is what I'd argue SPIPs
>> > should be about. The hows should be left in design docs for larger 
>> > projects.
>> >
>> >
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Migrating from kafka08 client to kafka010

2018-08-02 Thread Cody Koeninger
Short answer is it isn't necessary.

Long answer is that you aren't just changing from 08 to 10, you're
changing from the receiver based implementation to the direct stream.
Read these:

https://github.com/koeninger/kafka-exactly-once
http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html#approach-2-direct-approach-no-receivers
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

On Thu, Aug 2, 2018 at 2:02 AM, sandeep_katta
 wrote:
> Hi All,
>
> Recently I started migrating the code from kafka08 to kafka010.
>
> in 08  *topics * argument takes care of consuming number of partitions for
> each topic.
>
>   def createStream(
>   ssc: StreamingContext,
>   zkQuorum: String,
>   groupId: String,
>   topics: Map[String, Int],
>   storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
> ): ReceiverInputDStream[(String, String)]
>
>
> How to pass this configuration w.r.t kafka010 ?
>
> sample code w.r.t kafka010,I find no way or the API to set this paramater
>
>  val kafkaParams = Map[String, Object]("group.id" -> groupId,
> "bootstrap.servers" -> bootstrapServer,
> "value.deserializer" -> classOf[StringDeserializer],
> "key.deserializer" -> classOf[StringDeserializer])
> val messages = KafkaUtils.createDirectStream[String, String](
> ssc,
> LocationStrategies.PreferConsistent,
> ConsumerStrategies.Subscribe[String, String](topicArr.toSet,
> kafkaParams))
>
> Regards
> Sandeep Katta
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] SPIP: Standardize SQL logical plans

2018-07-17 Thread Cody Koeninger
According to

http://spark.apache.org/improvement-proposals.html

the shepherd should be a PMC member, not necessarily the person who
proposed the SPIP

On Tue, Jul 17, 2018 at 9:13 AM, Wenchen Fan  wrote:
> I don't know an official answer, but conventionally people who propose the
> SPIP would call the vote and "shepherd" the project. Other people can jump
> in during the development. I'm interested in the new API and like to work on
> it after the vote passes.
>
> Thanks,
> Wenchen
>
> On Fri, Jul 13, 2018 at 7:25 AM Ryan Blue  wrote:
>>
>> Thanks! I'm all for calling a vote on the SPIP. If I understand the
>> process correctly, the intent is for a "shepherd" to do it. I'm happy to
>> call a vote, or feel free if you'd like to play that role.
>>
>> Other comments:
>> * DeleteData API: I completely agree that we need to have a proposal for
>> it. I think the SQL side is easier because DELETE FROM is already a
>> statement. We just need to be able to identify v2 tables to use it. I'll
>> come up with something and send a proposal to the dev list.
>> * Table create/drop/alter/load API: I think we have agreement around the
>> proposed DataSourceV2 API, but we need to decide how the public API will
>> work and how this will fit in with ExternalCatalog (see the other thread for
>> discussion there). Do you think we need to get that entire SPIP approved
>> before we can start getting the API in? If so, what do you think needs to be
>> decided to get it ready?
>>
>> Thanks!
>>
>> rb
>>
>> On Wed, Jul 11, 2018 at 8:24 PM Wenchen Fan  wrote:
>>>
>>> Hi Ryan,
>>>
>>> Great job on this! Shall we call a vote for the plan standardization
>>> SPIP? I think this is a good idea and we should do it.
>>>
>>> Notes:
>>> We definitely need new user-facing APIs to produce these new logical
>>> plans like DeleteData. But we need a design doc for these new APIs after the
>>> SPIP passed.
>>> We definitely need the data source to provide the ability to
>>> create/drop/alter/lookup tables, but that belongs to the other SPIP and
>>> should be voted separately.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Fri, Apr 20, 2018 at 5:01 AM Ryan Blue 
>>> wrote:

 Hi everyone,

 A few weeks ago, I wrote up a proposal to standardize SQL logical plans
 and a supporting design doc for data source catalog APIs. From the comments
 on those docs, it looks like we mostly have agreement around standardizing
 plans and around the data source catalog API.

 We still need to work out details, like the transactional API extension,
 but I'd like to get started implementing those proposals so we have
 something working for the 2.4.0 release. I'm starting this thread because I
 think we're about ready to vote on the proposal and I'd like to get any
 remaining discussion going or get anyone that missed this to read through
 the docs.

 Thanks!

 rb

 --
 Ryan Blue
 Software Engineer
 Netflix
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Time for 2.3.1?

2018-05-11 Thread Cody Koeninger
Sounds good, I'd like to add SPARK-24067 today assuming there's no objections

On Thu, May 10, 2018 at 1:22 PM, Henry Robinson  wrote:
> +1, I'd like to get a release out with SPARK-23852 fixed. The Parquet
> community are about to release 1.8.3 - the voting period closes tomorrow -
> and I've tested it with Spark 2.3 and confirmed the bug is fixed. Hopefully
> it is released and I can post the version change to branch-2.3 before you
> start to roll the RC this weekend.
>
> Henry
>
> On 10 May 2018 at 11:09, Marcelo Vanzin  wrote:
>>
>> Hello all,
>>
>> It's been a while since we shipped 2.3.0 and lots of important bug
>> fixes have gone into the branch since then. I took a look at Jira and
>> it seems there's not a lot of things explicitly targeted at 2.3.1 -
>> the only potential blocker (a parquet issue) is being worked on since
>> a new parquet with the fix was just released.
>>
>> So I'd like to propose to release 2.3.1 soon. If there are important
>> fixes that should go into the release, please let those be known (by
>> replying here or updating the bug in Jira), otherwise I'm volunteering
>> to prepare the first RC soon-ish (around the weekend).
>>
>> Thanks!
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Process for backports?

2018-04-24 Thread Cody Koeninger
 https://issues.apache.org/jira/browse/SPARK-24067

is asking to backport a change to the 2.3 branch.

My questions

- In general are there any concerns about what qualifies for backporting?
This adds a configuration variable but shouldn't change default behavior.

- Is a separate jira + pr actually necessary?
Seems like the merge_spark_pr.py script is set up to handle cherry
picking the original merged PR in a case like this.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Cody Koeninger
Congrats!

On Mon, Apr 2, 2018 at 12:28 AM, Wenchen Fan  wrote:
> Hi all,
>
> The Spark PMC recently added Zhenhua Wang as a committer on the project.
> Zhenhua is the major contributor of the CBO project, and has been
> contributing across several areas of Spark for a while, focusing especially
> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
>
> Wenchen

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcoming some new committers

2018-03-02 Thread Cody Koeninger
Congrats to the new committers, and I appreciate the vote of confidence.

On Fri, Mar 2, 2018 at 4:41 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> Hi everyone,
>
> The Spark PMC has recently voted to add several new committers to the 
> project, based on their contributions to Spark 2.3 and other past work:
>
> - Anirudh Ramanathan (contributor to Kubernetes support)
> - Bryan Cutler (contributor to PySpark and Arrow support)
> - Cody Koeninger (contributor to streaming and Kafka support)
> - Erik Erlandson (contributor to Kubernetes support)
> - Matt Cheah (contributor to Kubernetes support and other parts of Spark)
> - Seth Hendrickson (contributor to MLlib and PySpark)
>
> Please join me in welcoming Anirudh, Bryan, Cody, Erik, Matt and Seth as 
> committers!
>
> Matei
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-01 Thread Cody Koeninger
Was there any answer to my question around the effect of changes to
the sink api regarding access to underlying offsets?

On Wed, Nov 1, 2017 at 11:32 AM, Reynold Xin  wrote:
> Most of those should be answered by the attached design sketch in the JIRA
> ticket.
>
> On Wed, Nov 1, 2017 at 5:29 PM Debasish Das 
> wrote:
>>
>> +1
>>
>> Is there any design doc related to API/internal changes ? Will CP be the
>> default in structured streaming or it's a mode in conjunction with exisiting
>> behavior.
>>
>> Thanks.
>> Deb
>>
>> On Nov 1, 2017 8:37 AM, "Reynold Xin"  wrote:
>>
>> Earlier I sent out a discussion thread for CP in Structured Streaming:
>>
>> https://issues.apache.org/jira/browse/SPARK-20928
>>
>> It is meant to be a very small, surgical change to Structured Streaming to
>> enable ultra-low latency. This is great timing because we are also designing
>> and implementing data source API v2. If designed properly, we can have the
>> same data source API working for both streaming and batch.
>>
>>
>> Following the SPIP process, I'm putting this SPIP up for a vote.
>>
>> +1: Let's go ahead and design / implement the SPIP.
>> +0: Don't really care.
>> -1: I do not think this is a good idea for the following reasons.
>>
>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Kafka API tries connecting to dead node for every batch, which increases the processing time

2017-10-16 Thread Cody Koeninger
Have you tried the 0.10 integration?

I'm not sure how you would know whether a broker is up or down without
attempting to connect to it.  Do you have an alternative suggestion?
Not sure how much interest there is in patches to the 0.8 integration
at this point.



On Mon, Oct 16, 2017 at 9:23 AM, Suprith T Jain <t.supr...@gmail.com> wrote:
> Yes I tried that. But it's not that effective.
>
> In fact kafka SimpleConsumer tries to reconnect in case of socket error
> (sendRequest method). So it ll always be twice the timeout for every window
> and for every node that is down.
>
>
> On 16-Oct-2017 7:34 PM, "Cody Koeninger" <c...@koeninger.org> wrote:
>>
>> Have you tried adjusting the timeout?
>>
>> On Mon, Oct 16, 2017 at 8:08 AM, Suprith T Jain <t.supr...@gmail.com>
>> wrote:
>> > Hi guys,
>> >
>> > I have a 3 node cluster and i am running a spark streaming job. consider
>> > the
>> > below example
>> >
>> > /*spark-submit* --master yarn-cluster --class
>> > com.huawei.bigdata.spark.examples.FemaleInfoCollectionPrint --jars
>> >
>> > /opt/client/Spark/spark/lib/streamingClient/kafka-clients-0.8.2.1.jar,/opt/client/Spark/spark/lib/streamingClient/kafka_2.10-0.8.2.1.jar,/opt/client/Spark/spark/lib/streamingClient/spark-streaming-kafka_2.10-1.5.1.jar
>> > /opt/SparkStreamingExample-1.0.jar  /tmp/test 10 test
>> > 189.132.190.106:21005,189.132.190.145:21005,10.1.1.1:21005/
>> >
>> > In this case, suppose node 10.1.1.1 is down. Then for every window
>> > batch,
>> > spark tries to send a request  to all the nodes.
>> > This code is in the class org.apache.spark.streaming.kafka.KafkaCluster
>> >
>> > Function : getPartitionMetadata()
>> > Line : val resp: TopicMetadataResponse = consumer.send(req)
>> >
>> > The function getPartitionMetadata() is called from getPartitions() and
>> > findLeaders() which gets called for every batch.
>> >
>> > Hence, if the node is down, the connection fails and it wits till the
>> > timeout to happen before continuing which adds to the processing time.
>> >
>> > Question :
>> > Is there any way to avoid this ?
>> > In simple words, i do not want spark to send request to the node that is
>> > down for every batch. How can i achieve this ?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Kafka API tries connecting to dead node for every batch, which increases the processing time

2017-10-16 Thread Cody Koeninger
Have you tried adjusting the timeout?

On Mon, Oct 16, 2017 at 8:08 AM, Suprith T Jain  wrote:
> Hi guys,
>
> I have a 3 node cluster and i am running a spark streaming job. consider the
> below example
>
> /*spark-submit* --master yarn-cluster --class
> com.huawei.bigdata.spark.examples.FemaleInfoCollectionPrint --jars
> /opt/client/Spark/spark/lib/streamingClient/kafka-clients-0.8.2.1.jar,/opt/client/Spark/spark/lib/streamingClient/kafka_2.10-0.8.2.1.jar,/opt/client/Spark/spark/lib/streamingClient/spark-streaming-kafka_2.10-1.5.1.jar
> /opt/SparkStreamingExample-1.0.jar  /tmp/test 10 test
> 189.132.190.106:21005,189.132.190.145:21005,10.1.1.1:21005/
>
> In this case, suppose node 10.1.1.1 is down. Then for every window batch,
> spark tries to send a request  to all the nodes.
> This code is in the class org.apache.spark.streaming.kafka.KafkaCluster
>
> Function : getPartitionMetadata()
> Line : val resp: TopicMetadataResponse = consumer.send(req)
>
> The function getPartitionMetadata() is called from getPartitions() and
> findLeaders() which gets called for every batch.
>
> Hence, if the node is down, the connection fails and it wits till the
> timeout to happen before continuing which adds to the processing time.
>
> Question :
> Is there any way to avoid this ?
> In simple words, i do not want spark to send request to the node that is
> down for every batch. How can i achieve this ?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Easy way to get offset metatada with Spark Streaming API

2017-09-11 Thread Cody Koeninger
https://issues-test.apache.org/jira/browse/SPARK-18258

On Mon, Sep 11, 2017 at 7:15 AM, Dmitry Naumenko  wrote:
> Hi all,
>
> It started as a discussion in
> https://stackoverflow.com/questions/46153105/how-to-get-kafka-offsets-with-spark-structured-streaming-api.
>
> So the problem that there is no support in Public API to obtain the Kafka
> (or Kineses) offsets. For example, if you want to save offsets in external
> storage in Custom Sink, you should :
> 1) preserve topic, partition and offset across all transform operations of
> Dataset (based on hard-coded Kafka schema)
> 2) make a manual group by partition/offset with aggregate max offset
>
> Structured Streaming doc says "Every streaming source is assumed to have
> offsets", so why it's not a part of Public API? What do you think about
> supporting it?
>
> Dmitry

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Putting Kafka 0.8 behind an (opt-in) profile

2017-09-06 Thread Cody Koeninger
I kind of doubt the kafka 0.10 integration is going to change much at
all before the upgrade to 0.11

On Wed, Sep 6, 2017 at 8:57 AM, Sean Owen <so...@cloudera.com> wrote:
> Thanks, I can do that. We're then in the funny position of having one
> deprecated Kafka API, and one experimental one.
>
> Is the Kafka 0.10 integration as stable as it is going to be, and worth
> marking as such for 2.3.0?
>
>
> On Tue, Sep 5, 2017 at 4:12 PM Cody Koeninger <c...@koeninger.org> wrote:
>>
>> +1 to going ahead and giving a deprecation warning now
>>
>> On Tue, Sep 5, 2017 at 6:39 AM, Sean Owen <so...@cloudera.com> wrote:
>> > On the road to Scala 2.12, we'll need to make Kafka 0.8 support optional
>> > in
>> > the build, because it is not available for Scala 2.12.
>> >
>> > https://github.com/apache/spark/pull/19134  adds that profile. I mention
>> > it
>> > because this means that Kafka 0.8 becomes "opt-in" and has to be
>> > explicitly
>> > enabled, and that may have implications for downstream builds.
>> >
>> > Yes, we can add true. It however only
>> > has
>> > effect when no other profiles are set, which makes it more deceptive
>> > than
>> > useful IMHO. (We don't use it otherwise.)
>> >
>> > Reviewers may want to check my work especially as regards the Python
>> > test
>> > support and SBT build.
>> >
>> >
>> > Another related question is: when is 0.8 support deprecated, removed? It
>> > seems sudden to remove it in 2.3.0. Maybe deprecation is in order. The
>> > driver is that Kafka 0.11 and 1.0 will possibly require yet another
>> > variant
>> > of streaming support (not sure yet), and 3 versions is too many.
>> > Deprecating
>> > now opens more options sooner.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Putting Kafka 0.8 behind an (opt-in) profile

2017-09-05 Thread Cody Koeninger
+1 to going ahead and giving a deprecation warning now

On Tue, Sep 5, 2017 at 6:39 AM, Sean Owen  wrote:
> On the road to Scala 2.12, we'll need to make Kafka 0.8 support optional in
> the build, because it is not available for Scala 2.12.
>
> https://github.com/apache/spark/pull/19134  adds that profile. I mention it
> because this means that Kafka 0.8 becomes "opt-in" and has to be explicitly
> enabled, and that may have implications for downstream builds.
>
> Yes, we can add true. It however only has
> effect when no other profiles are set, which makes it more deceptive than
> useful IMHO. (We don't use it otherwise.)
>
> Reviewers may want to check my work especially as regards the Python test
> support and SBT build.
>
>
> Another related question is: when is 0.8 support deprecated, removed? It
> seems sudden to remove it in 2.3.0. Maybe deprecation is in order. The
> driver is that Kafka 0.11 and 1.0 will possibly require yet another variant
> of streaming support (not sure yet), and 3 versions is too many. Deprecating
> now opens more options sooner.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark streaming Kafka 0.11 integration

2017-09-05 Thread Cody Koeninger
Here's the jira for upgrading to a 0.10.x point release, which is
effectively the discussion of upgrading to 0.11 now

https://issues.apache.org/jira/browse/SPARK-18057

On Tue, Sep 5, 2017 at 1:27 AM, matus.cimerman  wrote:
> Hi guys,
>
> is there any plans to support Kafka 0.11 integration for Spark streaming
> applications? I see it doesn't support yet. If there is any way how can I
> help/contribute, I'll be happy if you point me right direction so that I can
> give a hand.
>
> Sincerely,
> Matus Cimerman
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread Cody Koeninger
Just wanted to point out that because the jira isn't labeled SPIP, it
won't have shown up linked from

http://spark.apache.org/improvement-proposals.html

On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan  wrote:
> Hi all,
>
> It has been almost 2 weeks since I proposed the data source V2 for
> discussion, and we already got some feedbacks on the JIRA ticket and the
> prototype PR, so I'd like to call for a vote.
>
> The full document of the Data Source API V2 is:
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit
>
> Note that, this vote should focus on high-level design/framework, not
> specified APIs, as we can always change/improve specified APIs during
> development.
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Thanks!

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPARK-19547

2017-06-08 Thread Cody Koeninger
Can you explain in more detail what you mean by "distribute Kafka
topics among different instances of same consumer group"?

If you're trying to run multiple streams using the same consumer
group, it's already documented that you shouldn't do that.

On Thu, Jun 8, 2017 at 12:43 AM, Rastogi, Pankaj
 wrote:
> Hi,
>  I have been trying to distribute Kafka topics among different instances of
> same consumer group. I am using KafkaDirectStream API for creating DStreams.
> After the second consumer group comes up, Kafka does partition rebalance and
> then Spark driver of the first consumer dies with the following exception:
>
> java.lang.IllegalStateException: No current assignment for partition
> myTopic_5-0
> at
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:264)
> at
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:336)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1236)
> at
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
> at
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
> at scala.Option.orElse(Option.scala:257)
> at
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
> at
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
> at
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:42)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
> at
> org.apache.spark.streaming.dstream.TransformedDStream.createRDDWithLocalProperties(TransformedDStream.scala:65)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
> at scala.Option.orElse(Option.scala:257)
> at
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
> at
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
> at
> 

Re: [KafkaSourceProvider] Why topic option and column without reverting to path as the least priority?

2017-05-01 Thread Cody Koeninger
Yeah, seems reasonable.

On Mon, May 1, 2017 at 12:40 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> Hi,
>
> Thanks Cody and Michael! I didn't expect to get two answers so quickly and
> from THE brains behind spark - Kafka integration. #impressed
>
> Yes, Michael has nailed it. Using save's path was so natural to me after
> months with Spark that I was surprised to not have seen it instead of the
> custom and surely not very obvious topic.
>
> Imagine my day today when I'd discovered that I could use KafkaSource in
> batch queries and then suddenly found out about no support for path in save.
> I'm not faint-hearted so I survived :-)
>
> I think that change would make KafkaSource even cooler. Please add support
> if possible (and make it part of the upcoming 2.2.0, too!)
>
> Thanks.
>
> Jacek
>
> On 1 May 2017 7:26 p.m., "Michael Armbrust" <mich...@databricks.com> wrote:
>>
>> He's just suggesting that since the DataStreamWriter start() method can
>> fill in an option named "path", we should make that a synonym for "topic".
>> Then you could do something like.
>>
>> df.writeStream.format("kafka").start("topic")
>>
>> Seems reasonable if people don't think that is confusing.
>>
>> On Mon, May 1, 2017 at 8:43 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>>
>>> I'm confused about what you're suggesting.  Are you saying that a
>>> Kafka sink should take a filesystem path as an option?
>>>
>>> On Mon, May 1, 2017 at 8:52 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>>> > Hi,
>>> >
>>> > I've just found out that KafkaSourceProvider supports topic option
>>> > that sets the Kafka topic to save a DataFrame to.
>>> >
>>> > You can also use topic column to assign rows to topics.
>>> >
>>> > Given the features, I've been wondering why "path" option is not
>>> > supported (even of least precedence) so when no topic column or option
>>> > are defined, save(path: String) would be the least priority.
>>> >
>>> > WDYT?
>>> >
>>> > It looks pretty trivial to support --> see KafkaSourceProvider at
>>> > lines [1] and [2] if I'm not mistaken.
>>> >
>>> > [1]
>>> > https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L145
>>> > [2]
>>> > https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L163
>>> >
>>> > Pozdrawiam,
>>> > Jacek Laskowski
>>> > 
>>> > https://medium.com/@jaceklaskowski/
>>> > Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
>>> > Follow me at https://twitter.com/jaceklaskowski
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Question about upgrading Kafka client version

2017-03-10 Thread Cody Koeninger
There are existing tickets on the issues around kafka versions, e.g.
https://issues.apache.org/jira/browse/SPARK-18057 that haven't gotten
any committer weigh-in on direction.

On Thu, Mar 9, 2017 at 12:52 PM, Oscar Batori  wrote:
> Guys,
>
> To change the subject from meta-voting...
>
> We are doing Spark Streaming against a Kafka setup, everything is pretty
> standard, and pretty current. In particular we are using Spark 2.1, and
> Kafka 0.10.1, with batch windows that are quite large (5-10 minutes). The
> problem we are having is pretty well described in the following excerpt from
> the Spark documentation:
> "For possible kafkaParams, see Kafka consumer config docs. If your Spark
> batch duration is larger than the default Kafka heartbeat session timeout
> (30 seconds), increase heartbeat.interval.ms and session.timeout.ms
> appropriately. For batches larger than 5 minutes, this will require changing
> group.max.session.timeout.ms on the broker. Note that the example sets
> enable.auto.commit to false, for discussion see Storing Offsets below."
>
> In our case "group.max.session.timeout.ms" is set to default value, and our
> processing time per batch easily exceeds that value. I did some further
> hunting around and found the following SO post:
> "KIP-62, decouples heartbeats from calls to poll() via a background
> heartbeat thread. This, allow for a longer processing time (ie, time between
> two consecutive poll()) than heartbeat interval."
>
> This pretty accurately describes our scenario: effectively our per batch
> processing time is 2-6 minutes, well within the batch window, but in excess
> of the max session timeout between polls, causing the consumer to be kicked
> out of the group.
>
> Are there any plans to move the Kafka client up to 0.10.1 and make this
> feature available to consumers? Or have I missed some helpful configuration
> that would ameliorate this problem? I recognize changing
> "group.max.session.timeout.ms" is one solution, though it seems doing
> heartbeat checking outside of implicitly piggy backing on polling seems more
> elegant.
>
> -Oscar
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-03-10 Thread Cody Koeninger
Can someone with filter share permissions can make a filter for open
SPIP and one for closed SPIP and share it?

e.g.

project = SPARK AND status in (Open, Reopened, "In Progress") AND
labels=SPIP ORDER BY createdDate DESC

and another with the status closed equivalent

I just made an open ticket with the SPIP label show it should show up

On Fri, Mar 10, 2017 at 11:19 AM, Reynold Xin <r...@databricks.com> wrote:
> We can just start using spip label and link to it.
>
>
>
> On Fri, Mar 10, 2017 at 9:18 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> So to be clear, if I translate that google doc to markup and submit a
>> PR, you will merge it?
>>
>> If we're just using "spip" label, that's probably fine, but we still
>> need shared filters for open and closed SPIPs so the page can link to
>> them.
>>
>> I do not believe I have jira permissions to share filters, I just
>> attempted to edit one of mine and do not see an add shares field.
>>
>> On Fri, Mar 10, 2017 at 10:54 AM, Sean Owen <so...@cloudera.com> wrote:
>> > Sure, that seems OK to me. I can merge anything like that.
>> > I think anyone can make a new label in JIRA; I don't know if even the
>> > admins
>> > can make a new issue type unfortunately. We may just have to mention a
>> > convention involving title and label or something.
>> >
>> > On Fri, Mar 10, 2017 at 4:52 PM Cody Koeninger <c...@koeninger.org>
>> > wrote:
>> >>
>> >> I think it ought to be its own page, linked from the more / community
>> >> menu dropdowns.
>> >>
>> >> We also need the jira tag, and for the page to clearly link to filters
>> >> that show proposed / completed SPIPs
>> >>
>> >> On Fri, Mar 10, 2017 at 3:39 AM, Sean Owen <so...@cloudera.com> wrote:
>> >> > Alrighty, if nobody is objecting, and nobody calls for a VOTE, then,
>> >> > let's
>> >> > say this document is the SPIP 1.0 process.
>> >> >
>> >> > I think the next step is just to translate the text to some suitable
>> >> > location. I suggest adding it to
>> >> > https://github.com/apache/spark-website/blob/asf-site/contributing.md
>> >> >
>> >> > On Thu, Mar 9, 2017 at 4:55 PM Sean Owen <so...@cloudera.com> wrote:
>> >> >>
>> >> >> I think a VOTE is over-thinking it, and is rarely used, but, can't
>> >> >> hurt.
>> >> >> Nah, anyone can call a vote. This really isn't that formal. We just
>> >> >> want to
>> >> >> declare and document consensus.
>> >> >>
>> >> >> I think SPIP is just a remix of existing process anyway, and don't
>> >> >> think
>> >> >> it will actually do much anyway, which is why I am sanguine about
>> >> >> the
>> >> >> whole
>> >> >> thing.
>> >> >>
>> >> >> To bring this to a conclusion, I will just put the contents of the
>> >> >> doc
>> >> >> in
>> >> >> an email tomorrow for a VOTE. Raise any objections now.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-03-10 Thread Cody Koeninger
So to be clear, if I translate that google doc to markup and submit a
PR, you will merge it?

If we're just using "spip" label, that's probably fine, but we still
need shared filters for open and closed SPIPs so the page can link to
them.

I do not believe I have jira permissions to share filters, I just
attempted to edit one of mine and do not see an add shares field.

On Fri, Mar 10, 2017 at 10:54 AM, Sean Owen <so...@cloudera.com> wrote:
> Sure, that seems OK to me. I can merge anything like that.
> I think anyone can make a new label in JIRA; I don't know if even the admins
> can make a new issue type unfortunately. We may just have to mention a
> convention involving title and label or something.
>
> On Fri, Mar 10, 2017 at 4:52 PM Cody Koeninger <c...@koeninger.org> wrote:
>>
>> I think it ought to be its own page, linked from the more / community
>> menu dropdowns.
>>
>> We also need the jira tag, and for the page to clearly link to filters
>> that show proposed / completed SPIPs
>>
>> On Fri, Mar 10, 2017 at 3:39 AM, Sean Owen <so...@cloudera.com> wrote:
>> > Alrighty, if nobody is objecting, and nobody calls for a VOTE, then,
>> > let's
>> > say this document is the SPIP 1.0 process.
>> >
>> > I think the next step is just to translate the text to some suitable
>> > location. I suggest adding it to
>> > https://github.com/apache/spark-website/blob/asf-site/contributing.md
>> >
>> > On Thu, Mar 9, 2017 at 4:55 PM Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> I think a VOTE is over-thinking it, and is rarely used, but, can't
>> >> hurt.
>> >> Nah, anyone can call a vote. This really isn't that formal. We just
>> >> want to
>> >> declare and document consensus.
>> >>
>> >> I think SPIP is just a remix of existing process anyway, and don't
>> >> think
>> >> it will actually do much anyway, which is why I am sanguine about the
>> >> whole
>> >> thing.
>> >>
>> >> To bring this to a conclusion, I will just put the contents of the doc
>> >> in
>> >> an email tomorrow for a VOTE. Raise any objections now.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-03-10 Thread Cody Koeninger
I think it ought to be its own page, linked from the more / community
menu dropdowns.

We also need the jira tag, and for the page to clearly link to filters
that show proposed / completed SPIPs

On Fri, Mar 10, 2017 at 3:39 AM, Sean Owen  wrote:
> Alrighty, if nobody is objecting, and nobody calls for a VOTE, then, let's
> say this document is the SPIP 1.0 process.
>
> I think the next step is just to translate the text to some suitable
> location. I suggest adding it to
> https://github.com/apache/spark-website/blob/asf-site/contributing.md
>
> On Thu, Mar 9, 2017 at 4:55 PM Sean Owen  wrote:
>>
>> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
>> Nah, anyone can call a vote. This really isn't that formal. We just want to
>> declare and document consensus.
>>
>> I think SPIP is just a remix of existing process anyway, and don't think
>> it will actually do much anyway, which is why I am sanguine about the whole
>> thing.
>>
>> To bring this to a conclusion, I will just put the contents of the doc in
>> an email tomorrow for a VOTE. Raise any objections now.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-03-09 Thread Cody Koeninger
I started this idea as a fork with a merge-able change to docs.
Reynold moved it to his google doc, and has suggested during this
email thread that a vote should occur.
If a vote needs to occur, I can't see anything on
http://apache.org/foundation/voting.html suggesting that I can call
for a vote, which is why I'm asking PMC members to do it since they're
the ones who would vote anyway.
Now Sean is saying this is a code/doc change that can just be reviewed
and merged as usual...which is what I tried to do to begin with.

The fact that you haven't agreed on a process to agree on your process
is, I think, an indication that the process really does need
improvement ;)

On Tue, Mar 7, 2017 at 11:05 AM, Sean Owen <so...@cloudera.com> wrote:
> Do we need a VOTE? heck I think anyone can call one, anyway.
>
> Pre-flight vote check: anyone have objections to the text as-is?
> See
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>
> If so let's hash out specific suggest changes.
>
> If not, then I think the next step is to probably update the
> github.com/apache/spark-website repo with the text here. That's a code/doc
> change we can just review and merge as usual.
>
> On Tue, Mar 7, 2017 at 3:15 PM Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Another week, another ping.  Anyone on the PMC willing to call a vote on
>> this?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-03-07 Thread Cody Koeninger
Another week, another ping.  Anyone on the PMC willing to call a vote on
this?

On Mon, Feb 27, 2017 at 3:08 PM, Ryan Blue <rb...@netflix.com> wrote:

> I'd like to see more discussion on the issues I raised. I don't think
> there was a response for why voting is limited to PMC members.
>
> Tim was kind enough to reply with his rationale for a shepherd, but I
> don't think that it justifies failing proposals. I think it boiled down to
> "shepherds can be helpful", which isn't a good reason to require them in my
> opinion. Sam also had some good comments on this and I think that there's
> more to talk about.
>
> That said, I'd rather not have this proposal fail because we're tired of
> talking about it. If most people are okay with it as it stands and want a
> vote, I'm fine testing this out and fixing it later.
>
> rb
>
> On Fri, Feb 24, 2017 at 8:28 PM, Joseph Bradley <jos...@databricks.com>
> wrote:
>
>> The current draft LGTM.  I agree some of the various concerns may need to
>> be addressed in the future, depending on how SPIPs progress in practice.
>> If others agree, let's put it to a vote and revisit the proposal in a few
>> months.
>> Joseph
>>
>> On Fri, Feb 24, 2017 at 5:35 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> It's been a week since any further discussion.
>>>
>>> Do PMC members think the current draft is OK to vote on?
>>>
>>> On Fri, Feb 17, 2017 at 10:41 PM, vaquar khan <vaquar.k...@gmail.com>
>>> wrote:
>>> > I like document and happy to see SPIP draft version however i feel
>>> shepherd
>>> > role is again hurdle in process improvement ,It's like everything
>>> depends
>>> > only on shepherd .
>>> >
>>> > Also want to add point that SPIP  should be time bound with define SLA
>>> else
>>> > will defeats purpose.
>>> >
>>> >
>>> > Regards,
>>> > Vaquar khan
>>> >
>>> > On Thu, Feb 16, 2017 at 3:26 PM, Ryan Blue <rb...@netflix.com.invalid>
>>> > wrote:
>>> >>
>>> >> > [The shepherd] can advise on technical and procedural
>>> considerations for
>>> >> > people outside the community
>>> >>
>>> >> The sentiment is good, but this doesn't justify requiring a shepherd
>>> for a
>>> >> proposal. There are plenty of people that wouldn't need this, would
>>> get
>>> >> feedback during discussion, or would ask a committer or PMC member if
>>> it
>>> >> weren't a formal requirement.
>>> >>
>>> >> > if no one is willing to be a shepherd, the proposed idea is
>>> probably not
>>> >> > going to receive much traction in the first place.
>>> >>
>>> >> This also doesn't sound like a reason for needing a shepherd. Saying
>>> that
>>> >> a shepherd probably won't hurt the process doesn't give me an idea of
>>> why a
>>> >> shepherd should be required in the first place.
>>> >>
>>> >> What was the motivation for adding a shepherd originally? It may not
>>> be
>>> >> bad and it could be helpful, but neither of those makes me think that
>>> they
>>> >> should be required or else the proposal fails.
>>> >>
>>> >> rb
>>> >>
>>> >> On Thu, Feb 16, 2017 at 12:23 PM, Tim Hunter <
>>> timhun...@databricks.com>
>>> >> wrote:
>>> >>>
>>> >>> The doc looks good to me.
>>> >>>
>>> >>> Ryan, the role of the shepherd is to make sure that someone
>>> >>> knowledgeable with Spark processes is involved: this person can
>>> advise
>>> >>> on technical and procedural considerations for people outside the
>>> >>> community. Also, if no one is willing to be a shepherd, the proposed
>>> >>> idea is probably not going to receive much traction in the first
>>> >>> place.
>>> >>>
>>> >>> Tim
>>> >>>
>>> >>> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger <c...@koeninger.org>
>>> >>> wrote:
>>> >>> > Reynold, thanks, LGTM.
>>> >>> >
>>> >>> > Sean, great concerns.  I agree that behavior is largely cultural
>>> and
>>> >>> > wr

Re: Spark Improvement Proposals

2017-02-24 Thread Cody Koeninger
It's been a week since any further discussion.

Do PMC members think the current draft is OK to vote on?

On Fri, Feb 17, 2017 at 10:41 PM, vaquar khan <vaquar.k...@gmail.com> wrote:
> I like document and happy to see SPIP draft version however i feel shepherd
> role is again hurdle in process improvement ,It's like everything depends
> only on shepherd .
>
> Also want to add point that SPIP  should be time bound with define SLA else
> will defeats purpose.
>
>
> Regards,
> Vaquar khan
>
> On Thu, Feb 16, 2017 at 3:26 PM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>>
>> > [The shepherd] can advise on technical and procedural considerations for
>> > people outside the community
>>
>> The sentiment is good, but this doesn't justify requiring a shepherd for a
>> proposal. There are plenty of people that wouldn't need this, would get
>> feedback during discussion, or would ask a committer or PMC member if it
>> weren't a formal requirement.
>>
>> > if no one is willing to be a shepherd, the proposed idea is probably not
>> > going to receive much traction in the first place.
>>
>> This also doesn't sound like a reason for needing a shepherd. Saying that
>> a shepherd probably won't hurt the process doesn't give me an idea of why a
>> shepherd should be required in the first place.
>>
>> What was the motivation for adding a shepherd originally? It may not be
>> bad and it could be helpful, but neither of those makes me think that they
>> should be required or else the proposal fails.
>>
>> rb
>>
>> On Thu, Feb 16, 2017 at 12:23 PM, Tim Hunter <timhun...@databricks.com>
>> wrote:
>>>
>>> The doc looks good to me.
>>>
>>> Ryan, the role of the shepherd is to make sure that someone
>>> knowledgeable with Spark processes is involved: this person can advise
>>> on technical and procedural considerations for people outside the
>>> community. Also, if no one is willing to be a shepherd, the proposed
>>> idea is probably not going to receive much traction in the first
>>> place.
>>>
>>> Tim
>>>
>>> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>> > Reynold, thanks, LGTM.
>>> >
>>> > Sean, great concerns.  I agree that behavior is largely cultural and
>>> > writing down a process won't necessarily solve any problems one way or
>>> > the other.  But one outwardly visible change I'm hoping for out of
>>> > this a way for people who have a stake in Spark, but can't follow
>>> > jiras closely, to go to the Spark website, see the list of proposed
>>> > major changes, contribute discussion on issues that are relevant to
>>> > their needs, and see a clear direction once a vote has passed.  We
>>> > don't have that now.
>>> >
>>> > Ryan, realistically speaking any PMC member can and will stop any
>>> > changes they don't like anyway, so might as well be up front about the
>>> > reality of the situation.
>>> >
>>> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com> wrote:
>>> >> The text seems fine to me. Really, this is not describing a
>>> >> fundamentally
>>> >> new process, which is good. We've always had JIRAs, we've always been
>>> >> able
>>> >> to call a VOTE for a big question. This just writes down a sensible
>>> >> set of
>>> >> guidelines for putting those two together when a major change is
>>> >> proposed. I
>>> >> look forward to turning some big JIRAs into a request for a SPIP.
>>> >>
>>> >> My only hesitation is that this seems to be perceived by some as a new
>>> >> or
>>> >> different thing, that is supposed to solve some problems that aren't
>>> >> otherwise solvable. I see mentioned problems like: clear process for
>>> >> managing work, public communication, more committers, some sort of
>>> >> binding
>>> >> outcome and deadline.
>>> >>
>>> >> If SPIP is supposed to be a way to make people design in public and a
>>> >> way to
>>> >> force attention to a particular change, then, this doesn't do that by
>>> >> itself. Therefore I don't want to let a detailed discussion of SPIP
>>> >> detract
>>> >> from the discussion about doing what SPIP implies. It's just a process

Re: Spark Improvement Proposals

2017-02-16 Thread Cody Koeninger
Reynold, thanks, LGTM.

Sean, great concerns.  I agree that behavior is largely cultural and
writing down a process won't necessarily solve any problems one way or
the other.  But one outwardly visible change I'm hoping for out of
this a way for people who have a stake in Spark, but can't follow
jiras closely, to go to the Spark website, see the list of proposed
major changes, contribute discussion on issues that are relevant to
their needs, and see a clear direction once a vote has passed.  We
don't have that now.

Ryan, realistically speaking any PMC member can and will stop any
changes they don't like anyway, so might as well be up front about the
reality of the situation.

On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com> wrote:
> The text seems fine to me. Really, this is not describing a fundamentally
> new process, which is good. We've always had JIRAs, we've always been able
> to call a VOTE for a big question. This just writes down a sensible set of
> guidelines for putting those two together when a major change is proposed. I
> look forward to turning some big JIRAs into a request for a SPIP.
>
> My only hesitation is that this seems to be perceived by some as a new or
> different thing, that is supposed to solve some problems that aren't
> otherwise solvable. I see mentioned problems like: clear process for
> managing work, public communication, more committers, some sort of binding
> outcome and deadline.
>
> If SPIP is supposed to be a way to make people design in public and a way to
> force attention to a particular change, then, this doesn't do that by
> itself. Therefore I don't want to let a detailed discussion of SPIP detract
> from the discussion about doing what SPIP implies. It's just a process
> document.
>
> Still, a fine step IMHO.
>
> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <r...@databricks.com> wrote:
>>
>> Updated. Any feedback from other community members?
>>
>>
>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>>
>>> Thanks for doing that.
>>>
>>> Given that there are at least 4 different Apache voting processes,
>>> "typical Apache vote process" isn't meaningful to me.
>>>
>>> I think the intention is that in order to pass, it needs at least 3 +1
>>> votes from PMC members *and no -1 votes from PMC members*.  But the document
>>> doesn't explicitly say that second part.
>>>
>>> There's also no mention of the duration a vote should remain open.
>>> There's a mention of a month for finding a shepherd, but that's different.
>>>
>>> Other than that, LGTM.
>>>
>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com> wrote:
>>>>
>>>> Here's a new draft that incorporated most of the feedback:
>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>>>>
>>>> I added a specific role for SPIP Author and another one for SPIP
>>>> Shepherd.
>>>>
>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com> wrote:
>>>>>
>>>>> During the summit, I also had a lot of discussions over similar topics
>>>>> with multiple Committers and active users. I heard many fantastic ideas. I
>>>>> believe Spark improvement proposals are good channels to collect the
>>>>> requirements/designs.
>>>>>
>>>>>
>>>>> IMO, we also need to consider the priority when working on these items.
>>>>> Even if the proposal is accepted, it does not mean it will be implemented
>>>>> and merged immediately. It is not a FIFO queue.
>>>>>
>>>>>
>>>>> Even if some PRs are merged, sometimes, we still have to revert them
>>>>> back, if the design and implementation are not reviewed carefully. We have
>>>>> to ensure our quality. Spark is not an application software. It is an
>>>>> infrastructure software that is being used by many many companies. We have
>>>>> to be very careful in the design and implementation, especially
>>>>> adding/changing the external APIs.
>>>>>
>>>>>
>>>>> When I developed the Mainframe infrastructure/middleware software in
>>>>> the past 6 years, I were involved in the discussions with 
>>>>> external/internal
>>>>> customers. The to-do feature list was always above 100. Sometimes, the
>>>>> customers are feeling frustrated when we are unable to deliver them on 
>>>>> time
>>>>> due to the resource limits and others. Even if they paid us billions, we
>>>>> still need to do it phase by phase or sometimes they have to accept the
>>>>> workarounds. That is the reality everyone has to face, I think.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> Xiao Li
>>>>>>
>>>>>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-02-14 Thread Cody Koeninger
Thanks for doing that.

Given that there are at least 4 different Apache voting processes, "typical
Apache vote process" isn't meaningful to me.

I think the intention is that in order to pass, it needs at least 3 +1
votes from PMC members *and no -1 votes from PMC members*.  But the
document doesn't explicitly say that second part.

There's also no mention of the duration a vote should remain open.  There's
a mention of a month for finding a shepherd, but that's different.

Other than that, LGTM.

On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <r...@databricks.com> wrote:

> Here's a new draft that incorporated most of the feedback:
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#
>
> I added a specific role for SPIP Author and another one for SPIP Shepherd.
>
> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsm...@gmail.com> wrote:
>
>> During the summit, I also had a lot of discussions over similar topics
>> with multiple Committers and active users. I heard many fantastic ideas. I
>> believe Spark improvement proposals are good channels to collect the
>> requirements/designs.
>>
>>
>> IMO, we also need to consider the priority when working on these items.
>> Even if the proposal is accepted, it does not mean it will be implemented
>> and merged immediately. It is not a FIFO queue.
>>
>>
>> Even if some PRs are merged, sometimes, we still have to revert them
>> back, if the design and implementation are not reviewed carefully. We have
>> to ensure our quality. Spark is not an application software. It is an
>> infrastructure software that is being used by many many companies. We have
>> to be very careful in the design and implementation, especially
>> adding/changing the external APIs.
>>
>>
>> When I developed the Mainframe infrastructure/middleware software in the
>> past 6 years, I were involved in the discussions with external/internal
>> customers. The to-do feature list was always above 100. Sometimes, the
>> customers are feeling frustrated when we are unable to deliver them on time
>> due to the resource limits and others. Even if they paid us billions, we
>> still need to do it phase by phase or sometimes they have to accept the
>> workarounds. That is the reality everyone has to face, I think.
>>
>>
>> Thanks,
>>
>>
>> Xiao Li
>>
>> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <c...@koeninger.org>:
>>
>>> At the spark summit this week, everyone from PMC members to users I had
>>> never met before were asking me about the Spark improvement proposals
>>> idea.  It's clear that it's a real community need.
>>>
>>> But it's been almost half a year, and nothing visible has been done.
>>>
>>> Reynold, are you going to do this?
>>>
>>> If so, when?
>>>
>>> If not, why?
>>>
>>> You already did the right thing by including long-deserved committers.
>>> Please keep doing the right thing for the community.
>>>
>>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <r...@databricks.com>
>>> wrote:
>>>
>>>> +1 on all counts (consensus, time bound, define roles)
>>>>
>>>> I can update the doc in the next few days and share back. Then maybe we
>>>> can just officially vote on this. As Tim suggested, we might not get it
>>>> 100% right the first time and would need to re-iterate. But that's fine.
>>>>
>>>>
>>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <timhun...@databricks.com>
>>>> wrote:
>>>>
>>>>> Hi Cody,
>>>>> thank you for bringing up this topic, I agree it is very important to
>>>>> keep a cohesive community around some common, fluid goals. Here are a few
>>>>> comments about the current document:
>>>>>
>>>>> 1. name: it should not overlap with an existing one such as SIP. Can
>>>>> you imagine someone trying to discuss a scala spore proposal for spark?
>>>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>>>>> sounds great.
>>>>>
>>>>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>>>>> technical decisions with a lasting impact. As such, the template should
>>>>> emphasize the role of the various parties during this process:
>>>>>
>>>>>  - the SPIP author is responsible for building consensus. She is the
>>>>> champion driv

Re: Spark Improvement Proposals

2017-02-11 Thread Cody Koeninger
At the spark summit this week, everyone from PMC members to users I had
never met before were asking me about the Spark improvement proposals
idea.  It's clear that it's a real community need.

But it's been almost half a year, and nothing visible has been done.

Reynold, are you going to do this?

If so, when?

If not, why?

You already did the right thing by including long-deserved committers.
Please keep doing the right thing for the community.

On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <r...@databricks.com> wrote:

> +1 on all counts (consensus, time bound, define roles)
>
> I can update the doc in the next few days and share back. Then maybe we
> can just officially vote on this. As Tim suggested, we might not get it
> 100% right the first time and would need to re-iterate. But that's fine.
>
>
> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <timhun...@databricks.com>
> wrote:
>
>> Hi Cody,
>> thank you for bringing up this topic, I agree it is very important to
>> keep a cohesive community around some common, fluid goals. Here are a few
>> comments about the current document:
>>
>> 1. name: it should not overlap with an existing one such as SIP. Can you
>> imagine someone trying to discuss a scala spore proposal for spark?
>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>> sounds great.
>>
>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>> technical decisions with a lasting impact. As such, the template should
>> emphasize the role of the various parties during this process:
>>
>>  - the SPIP author is responsible for building consensus. She is the
>> champion driving the process forward and is responsible for ensuring that
>> the SPIP follows the general guidelines. The author should be identified in
>> the SPIP. The authorship of a SPIP can be transferred if the current author
>> is not interested and someone else wants to move the SPIP forward. There
>> should probably be 2-3 authors at most for each SPIP.
>>
>>  - someone with voting power should probably shepherd the SPIP (and be
>> recorded as such): ensuring that the final decision over the SPIP is
>> recorded (rejected, accepted, etc.), and advising about the technical
>> quality of the SPIP: this person need not be a champion for the SPIP or
>> contribute to it, but rather makes sure it stands a chance of being
>> approved when the vote happens. Also, if the author cannot find anyone who
>> would want to take this role, this proposal is likely to be rejected anyway.
>>
>>  - users, committers, contributors have the roles already outlined in the
>> document
>>
>> 3. timeline: ideally, once a SPIP has been offered for voting, it should
>> move swiftly into either being accepted or rejected, so that we do not end
>> up with a distracting long tail of half-hearted proposals.
>>
>> These rules are meant to be flexible, but the current document should be
>> clear about who is in charge of a SPIP, and the state it is currently in.
>>
>> We have had long discussions over some very important questions such as
>> approval. I do not have an opinion on these, but why not make a pick and
>> reevaluate this decision later? This is not a binding process at this point.
>>
>> Tim
>>
>>
>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> I don't have a concern about voting vs consensus.
>>>
>>> I have a concern that whatever the decision making process is, it is
>>> explicitly announced on the ticket for the given proposal, with an explicit
>>> deadline, and an explicit outcome.
>>>
>>>
>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <iras...@cloudera.com>
>>> wrote:
>>>
>>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>>
>>>> My take on the specific issues Joseph mentioned:
>>>>
>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>>>> earlier for consensus:
>>>>
>>>> > Majority vs consensus: My rationale is that I don't think we want to
>>>> consider a proposal approved if it had objections serious enough that
>>>> committers down-voted (or PMC depending on who gets a vote). If these
>>>> proposals are like PEPs, then they represent a significant amount of
>>>> community effort and I wouldn't want to move forward if up to half of the
>>>> community thinks it's an untenable idea.
>>>>
>>>> 2) Design doc template 

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Cody Koeninger
Congrats, glad to hear it

On Jan 24, 2017 12:47 PM, "Shixiong(Ryan) Zhu" 
wrote:

> Congrats Burak & Holden!
>
> On Tue, Jan 24, 2017 at 10:39 AM, Joseph Bradley 
> wrote:
>
>> Congratulations Burak & Holden!
>>
>> On Tue, Jan 24, 2017 at 10:33 AM, Dongjoon Hyun 
>> wrote:
>>
>>> Great! Congratulations, Burak and Holden.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On 2017-01-24 10:29 (-0800), Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>> >  
>>> >
>>> > Congratulations, Burak and Holden.
>>> >
>>> > On Tue, Jan 24, 2017 at 1:27 PM Russell Spitzer <
>>> russell.spit...@gmail.com>
>>> > wrote:
>>> >
>>> > > Great news! Congratulations!
>>> > >
>>> > > On Tue, Jan 24, 2017 at 10:25 AM Dean Wampler >> >
>>> > > wrote:
>>> > >
>>> > > Congratulations to both of you!
>>> > >
>>> > > dean
>>> > >
>>> > > *Dean Wampler, Ph.D.*
>>> > > Author: Programming Scala, 2nd Edition
>>> > > , Fast Data
>>> > > Architectures for Streaming Applications
>>> > > >> r-streaming-applications.csp>,
>>> > > Functional Programming for Java Developers
>>> > > , and Programming
>>> Hive
>>> > >  (O'Reilly)
>>> > > Lightbend 
>>> > > @deanwampler 
>>> > > http://polyglotprogramming.com
>>> > > https://github.com/deanwampler
>>> > >
>>> > > On Tue, Jan 24, 2017 at 6:14 PM, Xiao Li 
>>> wrote:
>>> > >
>>> > > Congratulations! Burak and Holden!
>>> > >
>>> > > 2017-01-24 10:13 GMT-08:00 Reynold Xin :
>>> > >
>>> > > Hi all,
>>> > >
>>> > > Burak and Holden have recently been elected as Apache Spark
>>> committers.
>>> > >
>>> > > Burak has been very active in a large number of areas in Spark,
>>> including
>>> > > linear algebra, stats/maths functions in DataFrames, Python/R APIs
>>> for
>>> > > DataFrames, dstream, and most recently Structured Streaming.
>>> > >
>>> > > Holden has been a long time Spark contributor and evangelist. She has
>>> > > written a few books on Spark, as well as frequent contributions to
>>> the
>>> > > Python API to improve its usability and performance.
>>> > >
>>> > > Please join me in welcoming the two!
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] 
>>
>
>


Re: Feedback on MLlib roadmap process proposal

2017-01-24 Thread Cody Koeninger
Totally agree with most of what Sean said, just wanted to give an
alternate take on the "maintainers" thing

On Tue, Jan 24, 2017 at 10:23 AM, Sean Owen  wrote:
> There is no such list because there's no formal notion of ownership or
> access to subsets of the project. Tracking an informal notion would be
> process mostly for its own sake, and probably just go out of date. We sort
> of tried this with 'maintainers' and it didn't actually do anything.
>

My perception of that situation is that the Apache process is actively
antagonistic towards factoring out responsibility for particular parts
of the code into a hierarchy.  I think if Spark was under a different
open source model, with otherwise exactly the same committers, that
attempt at identifying maintainers would have worked out differently.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2017-01-03 Thread Cody Koeninger
I don't have a concern about voting vs consensus.

I have a concern that whatever the decision making process is, it is
explicitly announced on the ticket for the given proposal, with an explicit
deadline, and an explicit outcome.


On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <iras...@cloudera.com> wrote:

> I'm also in favor of this.  Thanks for your persistence Cody.
>
> My take on the specific issues Joseph mentioned:
>
> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
> earlier for consensus:
>
> > Majority vs consensus: My rationale is that I don't think we want to
> consider a proposal approved if it had objections serious enough that
> committers down-voted (or PMC depending on who gets a vote). If these
> proposals are like PEPs, then they represent a significant amount of
> community effort and I wouldn't want to move forward if up to half of the
> community thinks it's an untenable idea.
>
> 2) Design doc template -- agree this would be useful, but also seems
> totally orthogonal to moving forward on the SIP proposal.
>
> 3) agree w/ Joseph's proposal for updating the template.
>
> One small addition:
>
> 4) Deciding on a name -- minor, but I think its wroth disambiguating from
> Scala's SIPs, and the best proposal I've heard is "SPIP".   At least, no
> one has objected.  (don't care enough that I'd object to anything else,
> though.)
>
>
> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <jos...@databricks.com>
> wrote:
>
>> Hi Cody,
>>
>> Thanks for being persistent about this.  I too would like to see this
>> happen.  Reviewing the thread, it sounds like the main things remaining are:
>> * Decide about a few issues
>> * Finalize the doc(s)
>> * Vote on this proposal
>>
>> Issues & TODOs:
>>
>> (1) The main issue I see above is voting vs. consensus.  I have little
>> preference here.  It sounds like something which could be tailored based on
>> whether we see too many or too few SIPs being approved.
>>
>> (2) Design doc template  (This would be great to have for Spark
>> regardless of this SIP discussion.)
>> * Reynold, are you still putting this together?
>>
>> (3) Template cleanups.  Listing some items mentioned above + a new one
>> w.r.t. Reynold's draft
>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>> :
>> * Reinstate the "Where" section with links to current and past SIPs
>> * Add field for stating explicit deadlines for approval
>> * Add field for stating Author & Committer shepherd
>>
>> Thanks all!
>> Joseph
>>
>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> I'm bumping this one more time for the new year, and then I'm giving up.
>>>
>>> Please, fix your process, even if it isn't exactly the way I suggested.
>>>
>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> wrote:
>>> > On lazy consensus as opposed to voting:
>>> >
>>> > First, why lazy consensus? The proposal was for consensus, which is at
>>> least
>>> > three +1 votes and no vetos. Consensus has no losing side, it requires
>>> > getting to a point where there is agreement. Isn't that agreement what
>>> we
>>> > want to achieve with these proposals?
>>> >
>>> > Second, lazy consensus only removes the requirement for three +1
>>> votes. Why
>>> > would we not want at least three committers to think something is a
>>> good
>>> > idea before adopting the proposal?
>>> >
>>> > rb
>>> >
>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>> >>
>>> >> So there are some minor things (the Where section heading appears to
>>> >> be dropped; wherever this document is posted it needs to actually link
>>> >> to a jira filter showing current / past SIPs) but it doesn't look like
>>> >> I can comment on the google doc.
>>> >>
>>> >> The major substantive issue that I have is that this version is
>>> >> significantly less clear as to the outcome of an SIP.
>>> >>
>>> >> The apache example of lazy consensus at
>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves an
>>> >> explicit announcement of an explicit deadline, which I think are
>>> >> necessary for clarity.
>&g

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-09 Thread Cody Koeninger
Agree that frequent topic deletion is not a very Kafka-esque thing to do

On Fri, Dec 9, 2016 at 12:09 PM, Shixiong(Ryan) Zhu
 wrote:
> Sean, "stress test for failOnDataLoss=false" is because Kafka consumer may
> be thrown NPE when a topic is deleted. I added some logic to retry on such
> failure, however, it may still fail when topic deletion is too frequent (the
> stress test). Just reopened
> https://issues.apache.org/jira/browse/SPARK-18588.
>
> Anyway, this is just a best effort to deal with Kafka issue, and in
> practice, people won't delete topic frequently, so this is not a release
> blocker.
>
> On Fri, Dec 9, 2016 at 2:55 AM, Sean Owen  wrote:
>>
>> As usual, the sigs / hashes are fine and licenses look fine.
>>
>> I am still seeing some test failures. A few I've seen over time and aren't
>> repeatable, but a few seem persistent. ANyone else observed these? I'm on
>> Ubuntu 16 / Java 8 building for -Pyarn -Phadoop-2.7 -Phive
>>
>> If anyone can confirm I'll investigate the cause if I can. I'd hesitate to
>> support the release yet unless the build is definitely passing for others.
>>
>>
>> udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.281 sec
>> <<< ERROR!
>> java.lang.NoSuchMethodError:
>> org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2;
>> at test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)
>>
>>
>>
>> - caching on disk *** FAILED ***
>>   java.util.concurrent.TimeoutException: Can't find 2 executors before
>> 3 milliseconds elapsed
>>   at
>> org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:584)
>>   at
>> org.apache.spark.DistributedSuite.org$apache$spark$DistributedSuite$$testCaching(DistributedSuite.scala:154)
>>   at
>> org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply$mcV$sp(DistributedSuite.scala:191)
>>   at
>> org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191)
>>   at
>> org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(DistributedSuite.scala:191)
>>   at
>> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>>   ...
>>
>>
>> - stress test for failOnDataLoss=false *** FAILED ***
>>   org.apache.spark.sql.streaming.StreamingQueryException: Query [id =
>> 3b191b78-7f30-46d3-93f8-5fbeecce94a2, runId =
>> 0cab93b6-19d8-47a7-88ad-d296bea72405] terminated with exception: null
>>   at
>> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:262)
>>   at
>> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:160)
>>   ...
>>   Cause: java.lang.NullPointerException:
>>   ...
>>
>>
>>
>> On Thu, Dec 8, 2016 at 4:40 PM Reynold Xin  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.1.0. The vote is open until Sun, December 11, 2016 at 1:00 PT and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.0-rc2
>>> (080717497365b83bc202ab16812ced93eb1ea7bd)
>>>
>>> List of JIRA tickets resolved are:
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1217
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/
>>>
>>>
>>> (Note that the docs and staging repo are still being uploaded and will be
>>> available soon)
>>>
>>>
>>> ===
>>> How can I help test this release?
>>> ===
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.1.0?
>>> 

Re: Back-pressure to Spark Kafka Streaming?

2016-12-05 Thread Cody Koeninger
If you want finer-grained max rate setting, SPARK-17510 got merged a
while ago.  There's also SPARK-18580 which might help address the
issue of starting backpressure rate for the first batch.

On Mon, Dec 5, 2016 at 4:18 PM, Liren Ding  wrote:
> Hey all,
>
> Does backressure actually work on spark kafka streaming? According to the
> latest spark streaming document:
> http://spark.apache.org/docs/latest/streaming-programming-guide.html
> "In Spark 1.5, we have introduced a feature called backpressure that
> eliminate the need to set this rate limit, as Spark Streaming automatically
> figures out the rate limits and dynamically adjusts them if the processing
> conditions change. This backpressure can be enabled by setting the
> configuration parameter spark.streaming.backpressure.enabled to true."
> But I also see a few open spark jira tickets on this option:
> https://issues.apache.org/jira/browse/SPARK-7398
> https://issues.apache.org/jira/browse/SPARK-18371
>
> The case in the second ticket describes a similar issue as we have here. We
> use Kafka to send large batches (10~100M) to spark streaming, and the spark
> streaming interval is set to 1~4 minutes. With the backpressure set to true,
> the queued active batches still pile up when average batch processing time
> takes longer than default interval. After the spark driver is restarted, all
> queued batches turn to a giant batch, which block subsequent batches and
> also have a great chance to fail eventually. The only config we found that
> might help is "spark.streaming.kafka.maxRatePerPartition". It does limit the
> incoming batch size, but not a perfect solution since it depends on size of
> partition as well as the length of batch interval. For our case, hundreds of
> partitions X minutes of interval still produce a number that is too large
> for each batch. So we still want to figure out how to make the backressure
> work in spark kafka streaming, if it is supposed to work there. Thanks.
>
>
> Liren
>
>
>
>
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread Cody Koeninger
Generating / defining an RDDis not the same thing as running the
compute() method of an rdd .  The direct stream definitely runs kafka
consumers on the executors.

If you want more info, the blog post and video linked from
https://github.com/koeninger/kafka-exactly-once refers to the 0.8
implementation, but the general design is similar for the 0.10
version.

I think the likelihood of an official release supporting 0.9 is fairly
slim at this point, it's a year out of date and wouldn't be a drop-in
dependency change.


On Tue, Nov 15, 2016 at 5:50 PM, aakash aakash <email2aak...@gmail.com> wrote:
>
>
>> You can use the 0.8 artifact to consume from a 0.9 broker
>
> We are currently using "Camus" in production and one of the main goal to
> move to Spark is to use new Kafka Consumer API  of Kafka 0.9 and in our case
> we need the security provisions available in 0.9, that why we cannot use 0.8
> client.
>
>> Where are you reading documentation indicating that the direct stream
> only runs on the driver?
>
> I might be wrong here, but I see that new kafka+Spark stream code extend the
> InputStream and its documentation says : Input streams that can generate
> RDDs from new data by running a service/thread only on the driver node (that
> is, without running a receiver on worker nodes)
>
> Thanks and regards,
> Aakash Pradeep
>
>
> On Tue, Nov 15, 2016 at 2:55 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> It'd probably be worth no longer marking the 0.8 interface as
>> experimental.  I don't think it's likely to be subject to active
>> development at this point.
>>
>> You can use the 0.8 artifact to consume from a 0.9 broker
>>
>> Where are you reading documentation indicating that the direct stream
>> only runs on the driver?  It runs consumers on the worker nodes.
>>
>>
>> On Tue, Nov 15, 2016 at 10:58 AM, aakash aakash <email2aak...@gmail.com>
>> wrote:
>> > Re-posting it at dev group.
>> >
>> > Thanks and Regards,
>> > Aakash
>> >
>> >
>> > -- Forwarded message --
>> > From: aakash aakash <email2aak...@gmail.com>
>> > Date: Mon, Nov 14, 2016 at 4:10 PM
>> > Subject: using Spark Streaming with Kafka 0.9/0.10
>> > To: user-subscr...@spark.apache.org
>> >
>> >
>> > Hi,
>> >
>> > I am planning to use Spark Streaming to consume messages from Kafka 0.9.
>> > I
>> > have couple of questions regarding this :
>> >
>> > I see APIs are annotated with @Experimental. So can you please tell me
>> > when
>> > are we planning to make it production ready ?
>> > Currently, I see we are using Kafka 0.10 and so curious to know why not
>> > we
>> > started with 0.9 Kafka instead of 0.10 Kafka. As I see 0.10 kafka client
>> > would not be compatible with 0.9 client since there are some changes in
>> > arguments in consumer API.
>> > Current API extends InputDstream and as per document it means RDD will
>> > be
>> > generated by running a service/thread only on the driver node instead of
>> > worker node. Can you please explain to me why we are doing this and what
>> > is
>> > required to make sure that it runs on worker node.
>> >
>> >
>> > Thanks in advance !
>> >
>> > Regards,
>> > Aakash
>> >
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread Cody Koeninger
It'd probably be worth no longer marking the 0.8 interface as
experimental.  I don't think it's likely to be subject to active
development at this point.

You can use the 0.8 artifact to consume from a 0.9 broker

Where are you reading documentation indicating that the direct stream
only runs on the driver?  It runs consumers on the worker nodes.


On Tue, Nov 15, 2016 at 10:58 AM, aakash aakash  wrote:
> Re-posting it at dev group.
>
> Thanks and Regards,
> Aakash
>
>
> -- Forwarded message --
> From: aakash aakash 
> Date: Mon, Nov 14, 2016 at 4:10 PM
> Subject: using Spark Streaming with Kafka 0.9/0.10
> To: user-subscr...@spark.apache.org
>
>
> Hi,
>
> I am planning to use Spark Streaming to consume messages from Kafka 0.9. I
> have couple of questions regarding this :
>
> I see APIs are annotated with @Experimental. So can you please tell me when
> are we planning to make it production ready ?
> Currently, I see we are using Kafka 0.10 and so curious to know why not we
> started with 0.9 Kafka instead of 0.10 Kafka. As I see 0.10 kafka client
> would not be compatible with 0.9 client since there are some changes in
> arguments in consumer API.
> Current API extends InputDstream and as per document it means RDD will be
> generated by running a service/thread only on the driver node instead of
> worker node. Can you please explain to me why we are doing this and what is
> required to make sure that it runs on worker node.
>
>
> Thanks in advance !
>
> Regards,
> Aakash
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Connectors using new Kafka consumer API

2016-11-09 Thread Cody Koeninger
Ok... in general it seems to me like effort would be better spent
trying to help upstream, as opposed to us making a 5th slightly
different interface to kafka (currently have 0.8 receiver, 0.8
dstream, 0.10 dstream, 0.10 structured stream)

On Tue, Nov 8, 2016 at 10:05 PM, Mark Grover <m...@apache.org> wrote:
> I think they are open to others helping, in fact, more than one person has
> worked on the JIRA so far. And, it's been crawling really slowly and that's
> preventing adoption of Spark's new connector in secure Kafka environments.
>
> On Tue, Nov 8, 2016 at 7:59 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Have you asked the assignee on the Kafka jira whether they'd be
>> willing to accept help on it?
>>
>> On Tue, Nov 8, 2016 at 5:26 PM, Mark Grover <m...@apache.org> wrote:
>> > Hi all,
>> > We currently have a new direct stream connector, thanks to work by Cody
>> > and
>> > others on SPARK-12177.
>> >
>> > However, that can't be used in secure clusters that require Kerberos
>> > authentication. That's because Kafka currently doesn't support
>> > delegation
>> > tokens (KAFKA-1696). Unfortunately, very little work has been done on
>> > that
>> > JIRA, so, in my opinion, folks who want to use secure Kafka (using the
>> > norm
>> > - Kerberos) can't do so because Spark Streaming can't consume from it
>> > today.
>> >
>> > The right way is, of course, to get delegation tokens in Kafka but
>> > honestly
>> > I don't know if that's happening in the near future. I am wondering if
>> > we
>> > should consider something to remedy this - for example, we could come up
>> > with a receiver based connector based on the new Kafka consumer API
>> > that'd
>> > support kerberos authentication. It won't require delegation tokens
>> > since
>> > there's only a very small number of executors talking to Kafka. Of
>> > course,
>> > for anyone who cares about high throughput and other direct connector
>> > benefits would have to use direct connector. Another thing we could do
>> > is
>> > ship the keytab to the executors in the direct connector, so delegation
>> > tokens are not required but the latter would be a pretty comprising
>> > solution, and I'd prefer not doing that.
>> >
>> > What do folks think? Would love to hear your thoughts, especially about
>> > the
>> > receiver.
>> >
>> > Thanks!
>> > Mark
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Connectors using new Kafka consumer API

2016-11-08 Thread Cody Koeninger
Have you asked the assignee on the Kafka jira whether they'd be
willing to accept help on it?

On Tue, Nov 8, 2016 at 5:26 PM, Mark Grover  wrote:
> Hi all,
> We currently have a new direct stream connector, thanks to work by Cody and
> others on SPARK-12177.
>
> However, that can't be used in secure clusters that require Kerberos
> authentication. That's because Kafka currently doesn't support delegation
> tokens (KAFKA-1696). Unfortunately, very little work has been done on that
> JIRA, so, in my opinion, folks who want to use secure Kafka (using the norm
> - Kerberos) can't do so because Spark Streaming can't consume from it today.
>
> The right way is, of course, to get delegation tokens in Kafka but honestly
> I don't know if that's happening in the near future. I am wondering if we
> should consider something to remedy this - for example, we could come up
> with a receiver based connector based on the new Kafka consumer API that'd
> support kerberos authentication. It won't require delegation tokens since
> there's only a very small number of executors talking to Kafka. Of course,
> for anyone who cares about high throughput and other direct connector
> benefits would have to use direct connector. Another thing we could do is
> ship the keytab to the executors in the direct connector, so delegation
> tokens are not required but the latter would be a pretty comprising
> solution, and I'd prefer not doing that.
>
> What do folks think? Would love to hear your thoughts, especially about the
> receiver.
>
> Thanks!
> Mark

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2016-11-08 Thread Cody Koeninger
So there are some minor things (the Where section heading appears to
be dropped; wherever this document is posted it needs to actually link
to a jira filter showing current / past SIPs) but it doesn't look like
I can comment on the google doc.

The major substantive issue that I have is that this version is
significantly less clear as to the outcome of an SIP.

The apache example of lazy consensus at
http://apache.org/foundation/voting.html#LazyConsensus involves an
explicit announcement of an explicit deadline, which I think are
necessary for clarity.



On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <r...@databricks.com> wrote:
> It turned out suggested edits (trackable) don't show up for non-owners, so
> I've just merged all the edits in place. It should be visible now.
>
> On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>> Oops. Let me try figure that out.
>>
>>
>> On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> wrote:
>>>
>>> Thanks for picking up on this.
>>>
>>> Maybe I fail at google docs, but I can't see any edits on the document
>>> you linked.
>>>
>>> Regarding lazy consensus, if the board in general has less of an issue
>>> with that, sure.  As long as it is clearly announced, lasts at least
>>> 72 hours, and has a clear outcome.
>>>
>>> The other points are hard to comment on without being able to see the
>>> text in question.
>>>
>>>
>>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <r...@databricks.com> wrote:
>>> > I just looked through the entire thread again tonight - there are a lot
>>> > of
>>> > great ideas being discussed. Thanks Cody for taking the first crack at
>>> > the
>>> > proposal.
>>> >
>>> > I want to first comment on the context. Spark is one of the most
>>> > innovative
>>> > and important projects in (big) data -- overall technical decisions
>>> > made in
>>> > Apache Spark are sound. But of course, a project as large and active as
>>> > Spark always have room for improvement, and we as a community should
>>> > strive
>>> > to take it to the next level.
>>> >
>>> > To that end, the two biggest areas for improvements in my opinion are:
>>> >
>>> > 1. Visibility: There are so much happening that it is difficult to know
>>> > what
>>> > really is going on. For people that don't follow closely, it is
>>> > difficult to
>>> > know what the important initiatives are. Even for people that do
>>> > follow, it
>>> > is difficult to know what specific things require their attention,
>>> > since the
>>> > number of pull requests and JIRA tickets are high and it's difficult to
>>> > extract signal from noise.
>>> >
>>> > 2. Solicit user (broadly defined, including developers themselves)
>>> > input
>>> > more proactively: At the end of the day the project provides value
>>> > because
>>> > users use it. Users can't tell us exactly what to build, but it is
>>> > important
>>> > to get their inputs.
>>> >
>>> >
>>> > I've taken Cody's doc and edited it:
>>> >
>>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>>> > (I've made all my modifications trackable)
>>> >
>>> > There are couple high level changes I made:
>>> >
>>> > 1. I've consulted a board member and he recommended lazy consensus as
>>> > opposed to voting. The reason being in voting there can easily be a
>>> > "loser'
>>> > that gets outvoted.
>>> >
>>> > 2. I made it lighter weight, and renamed "strategy" to "optional design
>>> > sketch". Echoing one of the earlier email: "IMHO so far aside from
>>> > tagging
>>> > things and linking them elsewhere simply having design docs and
>>> > prototypes
>>> > implementations in PRs is not something that has not worked so far".
>>> >
>>> > 3. I made some the language tweaks to focus more on visibility. For
>>> > example,
>>> > "The purpose of an SIP is to inform and involve", rather than just
>>> > "involve". SIPs should also have at least two emails that go to dev@.
>>> >
>>> >
>>> > Wh

Re: Odp.: Spark Improvement Proposals

2016-11-07 Thread Cody Koeninger
Thanks for picking up on this.

Maybe I fail at google docs, but I can't see any edits on the document
you linked.

Regarding lazy consensus, if the board in general has less of an issue
with that, sure.  As long as it is clearly announced, lasts at least
72 hours, and has a clear outcome.

The other points are hard to comment on without being able to see the
text in question.


On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <r...@databricks.com> wrote:
> I just looked through the entire thread again tonight - there are a lot of
> great ideas being discussed. Thanks Cody for taking the first crack at the
> proposal.
>
> I want to first comment on the context. Spark is one of the most innovative
> and important projects in (big) data -- overall technical decisions made in
> Apache Spark are sound. But of course, a project as large and active as
> Spark always have room for improvement, and we as a community should strive
> to take it to the next level.
>
> To that end, the two biggest areas for improvements in my opinion are:
>
> 1. Visibility: There are so much happening that it is difficult to know what
> really is going on. For people that don't follow closely, it is difficult to
> know what the important initiatives are. Even for people that do follow, it
> is difficult to know what specific things require their attention, since the
> number of pull requests and JIRA tickets are high and it's difficult to
> extract signal from noise.
>
> 2. Solicit user (broadly defined, including developers themselves) input
> more proactively: At the end of the day the project provides value because
> users use it. Users can't tell us exactly what to build, but it is important
> to get their inputs.
>
>
> I've taken Cody's doc and edited it:
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
> (I've made all my modifications trackable)
>
> There are couple high level changes I made:
>
> 1. I've consulted a board member and he recommended lazy consensus as
> opposed to voting. The reason being in voting there can easily be a "loser'
> that gets outvoted.
>
> 2. I made it lighter weight, and renamed "strategy" to "optional design
> sketch". Echoing one of the earlier email: "IMHO so far aside from tagging
> things and linking them elsewhere simply having design docs and prototypes
> implementations in PRs is not something that has not worked so far".
>
> 3. I made some the language tweaks to focus more on visibility. For example,
> "The purpose of an SIP is to inform and involve", rather than just
> "involve". SIPs should also have at least two emails that go to dev@.
>
>
> While I was editing this, I thought we really needed a suggested template
> for design doc too. I will get to that too ...
>
>
> On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>> Most things looked OK to me too, although I do plan to take a closer look
>> after Nov 1st when we cut the release branch for 2.1.
>>
>>
>> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <van...@cloudera.com>
>> wrote:
>>>
>>> The proposal looks OK to me. I assume, even though it's not explicitly
>>> called, that voting would happen by e-mail? A template for the
>>> proposal document (instead of just a bullet nice) would also be nice,
>>> but that can be done at any time.
>>>
>>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
>>> for a SIP, given the scope of the work. The document attached even
>>> somewhat matches the proposed format. So if anyone wants to try out
>>> the process...
>>>
>>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>> > Now that spark summit europe is over, are any committers interested in
>>> > moving forward with this?
>>> >
>>> >
>>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>> >
>>> > Or are we going to let this discussion die on the vine?
>>> >
>>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>> > <tomasz.gaw...@outlook.com> wrote:
>>> >> Maybe my mail was not clear enough.
>>> >>
>>> >>
>>> >> I didn't want to write "lets focus on Flink" or any other framework.
>>> >> The
>>> >> idea with benchmarks was to show two things:
>>> >>
>>> >> - why some people are doing bad PR for Spark
>>> >>
>>> >

Anyone want to weigh in on a Kafka DStreams api change?

2016-11-04 Thread Cody Koeninger
SPARK-17510

https://github.com/apache/spark/pull/15132

It's for allowing tweaking of rate limiting on a per-partition basis

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Continuous warning while consuming using new kafka-spark010 API

2016-11-04 Thread Cody Koeninger
I answered the duplicate post on the user mailing list, I'd say keep
the discussion there.

On Fri, Nov 4, 2016 at 12:14 PM, vonnagy  wrote:
> Nitin,
>
> I am getting the similar issues using Spark 2.0.1 and Kafka 0.10. I have to
> jobs, one that uses a Kafka stream and one that uses just the KafkaRDD.
>
> With the KafkaRDD, I continually get the "Failed to get records". I have
> adjusted the polling with `spark.streaming.kafka.consumer.poll.ms` and the
> size of records with Kafka's `max.poll.records`. Even when it gets records
> it is extremely slow.
>
> When working with multiple KafkaRDDs in parallel I get the dreaded
> `ConcurrentModificationException`. The Spark logic is supposed to use a
> CachedKafkaConsumer based on the topic and partition. This is supposed to
> guarantee thread safety, but I continually get this error along with the
> polling timeout.
>
> Has anyone else tried to use Spark 2 with Kafka 0.10 and had any success. At
> this point it is completely useless in my experience. With Spark 1.6 and
> Kafka 0.8.x, I never had these problems.
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Continuous-warning-while-consuming-using-new-kafka-spark010-API-tp18987p19736.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Handling questions in the mailing lists

2016-11-02 Thread Cody Koeninger
So concrete things people could do

- users could tag subject lines appropriately to the component they're
asking about

- contributors could monitor user@ for tags relating to components
they've worked on.
I'd be surprised if my miss rate for any mailing list questions
well-labeled as Kafka was higher than 5%

- committers could be more aggressive about soliciting and merging PRs
to improve documentation.
It's a lot easier to answer even poorly-asked questions with a link to
relevant docs.

On Wed, Nov 2, 2016 at 7:39 AM, Sean Owen  wrote:
> There's already reviews@ and issues@. dev@ is for project development itself
> and I think is OK. You're suggesting splitting up user@ and I sympathize
> with the motivation. Experience tells me that we'll have a beginner@ that's
> then totally ignored, and people will quickly learn to post to advanced@ to
> get attention, and we'll be back where we started. Putting it in JIRA
> doesn't help. I don't think this a problem that is merely down to lack of
> process. It actually requires cultivating a culture change on the community
> list.
>
> On Wed, Nov 2, 2016 at 12:11 PM Mendelson, Assaf 
> wrote:
>>
>> What I am suggesting is basically to fix that.
>>
>> For example, we might say that mailing list A is only for voting, mailing
>> list B is only for PR and have something like stack overflow for developer
>> questions (I would even go as far as to have beginner, intermediate and
>> advanced mailing list for users and beginner/advanced for dev).
>>
>>
>>
>> This can easily be done using stack overflow tags, however, that would
>> probably be harder to manage.
>>
>> Maybe using special jira tags and manage it in jira?
>>
>>
>>
>> Anyway as I said, the main issue is not user questions (except maybe
>> advanced ones) but more for dev questions. It is so easy to get lost in the
>> chatter that it makes it very hard for people to learn spark internals…
>>
>> Assaf.
>>
>>
>>
>> From: Sean Owen [mailto:so...@cloudera.com]
>> Sent: Wednesday, November 02, 2016 2:07 PM
>> To: Mendelson, Assaf; dev@spark.apache.org
>> Subject: Re: Handling questions in the mailing lists
>>
>>
>>
>> I think that unfortunately mailing lists don't scale well. This one has
>> thousands of subscribers with different interests and levels of experience.
>> For any given person, most messages will be irrelevant. I also find that a
>> lot of questions on user@ are not well-asked, aren't an SSCCE
>> (http://sscce.org/), not something most people are going to bother replying
>> to even if they could answer. I almost entirely ignore user@ because there
>> are higher-priority channels like PRs to deal with, that already have
>> hundreds of messages per day. This is why little of it gets an answer -- too
>> noisy.
>>
>>
>>
>> We have to have official mailing lists, in any event, to have some
>> official channel for things like votes and announcements. It's not wrong to
>> ask questions on user@ of course, but a lot of the questions I see could
>> have been answered with research of existing docs or looking at the code. I
>> think that given the scale of the list, it's not wrong to assert that this
>> is sort of a prerequisite for asking thousands of people to answer one's
>> question. But we can't enforce that.
>>
>>
>>
>> The situation will get better to the extent people ask better questions,
>> help other people ask better questions, and answer good questions. I'd
>> encourage anyone feeling this way to try to help along those dimensions.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Nov 2, 2016 at 11:32 AM assaf.mendelson 
>> wrote:
>>
>> Hi,
>>
>> I know this is a little off topic but I wanted to raise an issue about
>> handling questions in the mailing list (this is true both for the user
>> mailing list and the dev but since there are other options such as stack
>> overflow for user questions, this is more problematic in dev).
>>
>> Let’s say I ask a question (as I recently did). Unfortunately this was
>> during spark summit in Europe so probably people were busy. In any case no
>> one answered.
>>
>> The problem is, that if no one answers very soon, the question will almost
>> certainly remain unanswered because new messages will simply drown it.
>>
>>
>>
>> This is a common issue not just for questions but for any comment or idea
>> which is not immediately picked up.
>>
>>
>>
>> I believe we should have a method of handling this.
>>
>> Generally, I would say these types of things belong in stack overflow,
>> after all, the way it is built is perfect for this. More seasoned spark
>> contributors and committers can periodically check out unanswered questions
>> and answer them.
>>
>> The problem is that stack overflow (as well as other targets such as the
>> databricks forums) tend to have a more user based orientation. This means
>> that any spark internal question will almost certainly remain unanswered.
>>
>>
>>
>> I was wondering 

Re: JIRA Components for Streaming

2016-10-31 Thread Cody Koeninger
Makes sense to me.

I do wonder if e.g.

[SPARK-12345][STRUCTUREDSTREAMING][KAFKA]

is going to leave any room in the Github PR form for actual title content?

On Mon, Oct 31, 2016 at 1:37 PM, Michael Armbrust
 wrote:
> I'm planning to do a little maintenance on JIRA to hopefully improve the
> visibility into the progress / gaps in Structured Streaming.  In particular,
> while we share a lot of optimization / execution logic with SQL, the set of
> desired features and bugs is fairly different.
>
> Proposal:
>   - Structured Streaming (new component, move existing tickets here)
>   - Streaming -> DStreams
>
> Thoughts, objections?
>
> Michael

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Odp.: Spark Improvement Proposals

2016-10-31 Thread Cody Koeninger
Now that spark summit europe is over, are any committers interested in
moving forward with this?

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

Or are we going to let this discussion die on the vine?

On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
<tomasz.gaw...@outlook.com> wrote:
> Maybe my mail was not clear enough.
>
>
> I didn't want to write "lets focus on Flink" or any other framework. The
> idea with benchmarks was to show two things:
>
> - why some people are doing bad PR for Spark
>
> - how - in easy way - we can change it and show that Spark is still on the
> top
>
>
> No more, no less. Benchmarks will be helpful, but I don't think they're the
> most important thing in Spark :) On the Spark main page there is still chart
> "Spark vs Hadoop". It is important to show that framework is not the same
> Spark with other API, but much faster and optimized, comparable or even
> faster than other frameworks.
>
>
> About real-time streaming, I think it would be just good to see it in Spark.
> I very like current Spark model, but many voices that says "we need more" -
> community should listen also them and try to help them. With SIPs it would
> be easier, I've just posted this example as "thing that may be changed with
> SIP".
>
>
> I very like unification via Datasets, but there is a lot of algorithms
> inside - let's make easy API, but with strong background (articles,
> benchmarks, descriptions, etc) that shows that Spark is still modern
> framework.
>
>
> Maybe now my intention will be clearer :) As I said organizational ideas
> were already mentioned and I agree with them, my mail was just to show some
> aspects from my side, so from theside of developer and person who is trying
> to help others with Spark (via StackOverflow or other ways)
>
>
> Pozdrawiam / Best regards,
>
> Tomasz
>
>
> 
> Od: Cody Koeninger <c...@koeninger.org>
> Wysłane: 17 października 2016 16:46
> Do: Debasish Das
> DW: Tomasz Gawęda; dev@spark.apache.org
> Temat: Re: Spark Improvement Proposals
>
> I think narrowly focusing on Flink or benchmarks is missing my point.
>
> My point is evolve or die.  Spark's governance and organization is
> hampering its ability to evolve technologically, and it needs to
> change.
>
> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
>> soon as I looked into it since compared to writing Java map-reduce and
>> Cascading code, Spark made writing distributed code fun...But now as we
>> went
>> deeper with Spark and real-time streaming use-case gets more prominent, I
>> think it is time to bring a messaging model in conjunction with the
>> batch/micro-batch API that Spark is good atakka-streams close
>> integration with spark micro-batching APIs looks like a great direction to
>> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
>> batch with the assumption is that micro-batching is sufficient to run SQL
>> commands on stream but do we really have time to do SQL processing at
>> streaming data within 1-2 seconds ?
>>
>> After reading the email chain, I started to look into Flink documentation
>> and if you compare it with Spark documentation, I think we have major work
>> to do detailing out Spark internals so that more people from community
>> start
>> to take active role in improving the issues so that Spark stays strong
>> compared to Flink.
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>>
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>>
>> Spark is no longer an engine that works for micro-batch and batch...We
>> (and
>> I am sure many others) are pushing spark as an engine for stream and query
>> processing.we need to make it a state-of-the-art engine for high speed
>> streaming data and user queries as well !
>>
>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <tomasz.gaw...@outlook.com>
>> wrote:
>>>
>>> Hi everyone,
>>>
>>> I'm quite late with my answer, but I think my suggestions may help a
>>> little bit. :) Many technical and organizational topics were mentioned,
>>> but I want to focus on these negative posts about Spark and about
>>> "haters"
>>>
>>> I really like Spark. Easy of use, speed, very good community - it's
>>> everything here. But Every project has to "flight" on "framework market&qu

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Cody Koeninger
I think only supporting 1 version of scala at any given time is not
sufficient, 2 probably is ok.

I.e. don't drop 2.10 before 2.12 is out + supported

On Tue, Oct 25, 2016 at 10:56 AM, Sean Owen  wrote:
> The general forces are that new versions of things to support emerge, and
> are valuable to support, but have some cost to support in addition to old
> versions. And the old versions become less used and therefore less valuable
> to support, and at some point it tips to being more cost than value. It's
> hard to judge these costs and benefits.
>
> Scala is perhaps the trickiest one because of the general mutual
> incompatibilities across minor versions. The cost of supporting multiple
> versions is high, and a third version is about to arrive. That's probably
> the most pressing question. It's actually biting with some regularity now,
> with compile errors on 2.10.
>
> (Python I confess I don't have an informed opinion about.)
>
> Java, Hadoop are not as urgent because they're more backwards-compatible.
> Anecdotally, I'd be surprised if anyone today would "upgrade" to Java 7 or
> an old Hadoop version. And I think that's really the question. Even if one
> decided to drop support for all this in 2.1.0, it would not mean people
> can't use Spark with these things. It merely means they can't necessarily
> use Spark 2.1.x. This is why we have maintenance branches for 1.6.x, 2.0.x.
>
> Tying Scala 2.11/12 support to Java 8 might make sense.
>
> In fact, I think that's part of the reason that an update in master, perhaps
> 2.1.x, could be overdue, because it actually is just the beginning of the
> end of the support burden. If you want to stop dealing with these in ~6
> months they need to stop being supported in minor branches by right about
> now.
>
>
>
>
> On Tue, Oct 25, 2016 at 4:47 PM Mark Hamstra 
> wrote:
>>
>> What's changed since the last time we discussed these issues, about 7
>> months ago?  Or, another way to formulate the question: What are the
>> threshold criteria that we should use to decide when to end Scala 2.10
>> and/or Java 7 support?
>>
>> On Tue, Oct 25, 2016 at 8:36 AM, Sean Owen  wrote:
>>>
>>> I'd like to gauge where people stand on the issue of dropping support for
>>> a few things that were considered for 2.0.
>>>
>>> First: Scala 2.10. We've seen a number of build breakages this week
>>> because the PR builder only tests 2.11. No big deal at this stage, but, it
>>> did cause me to wonder whether it's time to plan to drop 2.10 support,
>>> especially with 2.12 coming soon.
>>>
>>> Next, Java 7. It's reasonably old and out of public updates at this
>>> stage. It's not that painful to keep supporting, to be honest. It would
>>> simplify some bits of code, some scripts, some testing.
>>>
>>> Hadoop versions: I think the the general argument is that most anyone
>>> would be using, at the least, 2.6, and it would simplify some code that has
>>> to reflect to use not-even-that-new APIs. It would remove some moderate
>>> complexity in the build.
>>>
>>>
>>> "When" is a tricky question. Although it's a little aggressive for minor
>>> releases, I think these will all happen before 3.x regardless. 2.1.0 is not
>>> out of the question, though coming soon. What about ... 2.2.0?
>>>
>>>
>>> Although I tend to favor dropping support, I'm mostly asking for current
>>> opinions.
>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Cody Koeninger
Yep, I had submitted a PR that included it way back in the original
direct stream for kafka, but it got nixed in favor of
TaskContext.partitionId ;)  The concern then was about too many
xWithBlah apis on rdd.

If we do want to deprecate taskcontext.partitionId and add
foreachPartitionWithIndex, I think that makes sense, I can start a
ticket.

On Thu, Oct 20, 2016 at 1:16 PM, Reynold Xin <r...@databricks.com> wrote:
> Seems like a good new API to add?
>
>
> On Thu, Oct 20, 2016 at 11:14 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Access to the partition ID is necessary for basically every single one
>> of my jobs, and there isn't a foreachPartiionWithIndex equivalent.
>> You can kind of work around it with empty foreach after the map, but
>> it's really awkward to explain to people.
>>
>> On Thu, Oct 20, 2016 at 12:52 PM, Reynold Xin <r...@databricks.com> wrote:
>> > FYI - Xiangrui submitted an amazing pull request to fix a long standing
>> > issue with a lot of the nondeterministic expressions (rand, randn,
>> > monotonically_increasing_id): https://github.com/apache/spark/pull/15567
>> >
>> > Prior to this PR, we were using TaskContext.partitionId as the partition
>> > index in initializing expressions. However, that is actually not a good
>> > index to use in most cases, because it is the physical task's partition
>> > id
>> > and does not always reflect the partition index at the time the RDD is
>> > created (or in the Spark SQL physical plan). This makes a big difference
>> > once there is a union or coalesce operation.
>> >
>> > The "index" given by mapPartitionsWithIndex, on the other hand, does not
>> > have this problem because it actually reflects the logical partition
>> > index
>> > at the time the RDD is created.
>> >
>> > When is it safe to use TaskContext.partitionId? It is safe at the very
>> > end
>> > of a query plan (the root node), because there partitionId is guaranteed
>> > based on the current implementation to be the same as the physical task
>> > partition id.
>> >
>> >
>> > For example, prior to Xiangrui's PR, the following query would return 2
>> > rows, whereas the correct behavior should be 1 entry:
>> >
>> >
>> > spark.range(1).selectExpr("rand(1)").union(spark.range(1).selectExpr("rand(1)")).distinct.show()
>> >
>> > The reason it'd return 2 rows is because rand was using
>> > TaskContext.partitionId as the per-partition seed, and as a result the
>> > two
>> > sides of the union are using different seeds.
>> >
>> >
>> > I'm starting to think we should deprecate the API and ban the use of it
>> > within the project to be safe ...
>> >
>> >
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [PSA] TaskContext.partitionId != the actual logical partition index

2016-10-20 Thread Cody Koeninger
Access to the partition ID is necessary for basically every single one
of my jobs, and there isn't a foreachPartiionWithIndex equivalent.
You can kind of work around it with empty foreach after the map, but
it's really awkward to explain to people.

On Thu, Oct 20, 2016 at 12:52 PM, Reynold Xin  wrote:
> FYI - Xiangrui submitted an amazing pull request to fix a long standing
> issue with a lot of the nondeterministic expressions (rand, randn,
> monotonically_increasing_id): https://github.com/apache/spark/pull/15567
>
> Prior to this PR, we were using TaskContext.partitionId as the partition
> index in initializing expressions. However, that is actually not a good
> index to use in most cases, because it is the physical task's partition id
> and does not always reflect the partition index at the time the RDD is
> created (or in the Spark SQL physical plan). This makes a big difference
> once there is a union or coalesce operation.
>
> The "index" given by mapPartitionsWithIndex, on the other hand, does not
> have this problem because it actually reflects the logical partition index
> at the time the RDD is created.
>
> When is it safe to use TaskContext.partitionId? It is safe at the very end
> of a query plan (the root node), because there partitionId is guaranteed
> based on the current implementation to be the same as the physical task
> partition id.
>
>
> For example, prior to Xiangrui's PR, the following query would return 2
> rows, whereas the correct behavior should be 1 entry:
>
> spark.range(1).selectExpr("rand(1)").union(spark.range(1).selectExpr("rand(1)")).distinct.show()
>
> The reason it'd return 2 rows is because rand was using
> TaskContext.partitionId as the per-partition seed, and as a result the two
> sides of the union are using different seeds.
>
>
> I'm starting to think we should deprecate the API and ban the use of it
> within the project to be safe ...
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: StructuredStreaming status

2016-10-19 Thread Cody Koeninger
I don't think it's just about what to target - if you could target 1ms
batches, without harming 1 second or 1 minute batches why wouldn't you?
I think it's about having a clear strategy and dedicating resources to it.
If  scheduling batches at an order of magnitude or two lower latency is the
strategy, and that's actually feasible, that's great. But I haven't seen
that clear direction, and this is by no means a recent issue.

On Oct 19, 2016 7:36 PM, "Matei Zaharia" <matei.zaha...@gmail.com> wrote:

> I'm also curious whether there are concerns other than latency with the
> way stuff executes in Structured Streaming (now that the time steps don't
> have to act as triggers), as well as what latency people want for various
> apps.
>
> The stateful operator designs for streaming systems aren't inherently
> "better" than micro-batching -- they lose a lot of stuff that is possible
> in Spark, such as load balancing work dynamically across nodes, speculative
> execution for stragglers, scaling clusters up and down elastically, etc.
> Moreover, Spark itself could execute the current model with much lower
> latency. The question is just what combinations of latency, throughput,
> fault recovery, etc to target.
>
> Matei
>
> On Oct 19, 2016, at 2:18 PM, Amit Sela <amitsel...@gmail.com> wrote:
>
>
>
> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> At the AMPLab we've been working on a research project that looks at
>> just the scheduling latencies and on techniques to get lower
>> scheduling latency. It moves away from the micro-batch model, but
>> reuses the fault tolerance etc. in Spark. However we haven't yet
>> figure out all the parts in integrating this with the rest of
>> structured streaming. I'll try to post a design doc / SIP about this
>> soon.
>>
>> On a related note - are there other problems users face with
>> micro-batch other than latency ?
>>
> I think that the fact that they serve as an output trigger is a problem,
> but Structured Streaming seems to resolve this now.
>
>>
>> Thanks
>> Shivaram
>>
>> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>> <mich...@databricks.com> wrote:
>> > I know people are seriously thinking about latency.  So far that has not
>> > been the limiting factor in the users I've been working with.
>> >
>> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>> >>
>> >> Is anyone seriously thinking about alternatives to microbatches?
>> >>
>> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>> >> <mich...@databricks.com> wrote:
>> >> > Anything that is actively being designed should be in JIRA, and it
>> seems
>> >> > like you found most of it.  In general, release windows can be found
>> on
>> >> > the
>> >> > wiki.
>> >> >
>> >> > 2.1 has a lot of stability fixes as well as the kafka support you
>> >> > mentioned.
>> >> > It may also include some of the following.
>> >> >
>> >> > The items I'd like to start thinking about next are:
>> >> >  - Evicting state from the store based on event time watermarks
>> >> >  - Sessionization (grouping together related events by key /
>> eventTime)
>> >> >  - Improvements to the query planner (remove some of the
>> restrictions on
>> >> > what queries can be run).
>> >> >
>> >> > This is roughly in order based on what I've been hearing users hit
>> the
>> >> > most.
>> >> > Would love more feedback on what is blocking real use cases.
>> >> >
>> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor <ofir.ma...@equalum.io>
>> >> > wrote:
>> >> >>
>> >> >> Hi,
>> >> >> I hope it is the right forum.
>> >> >> I am looking for some information of what to expect from
>> >> >> StructuredStreaming in its next releases to help me choose when /
>> where
>> >> >> to
>> >> >> start using it more seriously (or where to invest in workarounds and
>> >> >> where
>> >> >> to wait). I couldn't find a good place where such planning discussed
>> >> >> for 2.1
>> >> >> (like, for example ML and SPARK-15581).
>> >> >> I'm aware of the 2.0 documented limits
>> >> >&g

Re: StructuredStreaming status

2016-10-19 Thread Cody Koeninger
Is anyone seriously thinking about alternatives to microbatches?

On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
 wrote:
> Anything that is actively being designed should be in JIRA, and it seems
> like you found most of it.  In general, release windows can be found on the
> wiki.
>
> 2.1 has a lot of stability fixes as well as the kafka support you mentioned.
> It may also include some of the following.
>
> The items I'd like to start thinking about next are:
>  - Evicting state from the store based on event time watermarks
>  - Sessionization (grouping together related events by key / eventTime)
>  - Improvements to the query planner (remove some of the restrictions on
> what queries can be run).
>
> This is roughly in order based on what I've been hearing users hit the most.
> Would love more feedback on what is blocking real use cases.
>
> On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor  wrote:
>>
>> Hi,
>> I hope it is the right forum.
>> I am looking for some information of what to expect from
>> StructuredStreaming in its next releases to help me choose when / where to
>> start using it more seriously (or where to invest in workarounds and where
>> to wait). I couldn't find a good place where such planning discussed for 2.1
>> (like, for example ML and SPARK-15581).
>> I'm aware of the 2.0 documented limits
>> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations),
>> like no support for multiple aggregations levels, joins are strictly to a
>> static dataset (no SCD or stream-stream) etc, limited sources / sinks (like
>> no sink for interactive queries) etc etc
>> I'm also aware of some changes that have landed in master, like the new
>> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
>> metrics in SPARK-17731, and some improvements for the file source.
>> If I remember correctly, the discussion on Spark release cadence concluded
>> with a preference to a four-month cycles, with likely code freeze pretty
>> soon (end of October). So I believe the scope for 2.1 should likely quite
>> clear to some, and that 2.2 planning should likely be starting about now.
>> Any visibility / sharing will be highly appreciated!
>> thanks in advance,
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-18 Thread Cody Koeninger
+1 to putting docs in one clear place.

On Oct 18, 2016 6:40 AM, "Sean Owen"  wrote:

> I'm OK with that. The upside to the wiki is that it can be edited directly
> outside of a release cycle. However, in practice I find that the wiki is
> rarely changed. To me it also serves as a place for information that isn't
> exactly project documentation like "powered by" listings.
>
> In a way I'd like to get rid of the wiki to have one less place for docs,
> that doesn't have the same accessibility (I don't know who can give edit
> access), and doesn't have a review process.
>
> For now I'd settle for bringing over a few key docs like the one you
> mention. I spent a little time a while ago removing some duplication across
> the wiki and project docs and think there's a bit more than could be done.
>
>
> On Tue, Oct 18, 2016 at 12:25 PM Holden Karau 
> wrote:
>
>> Right now the wiki isn't particularly accessible to updates by external
>> contributors. We've already got a contributing to spark page which just
>> links to the wiki - how about if we just move the wiki contents over? This
>> way contributors can contribute to our documentation about how to
>> contribute probably helping clear up points of confusion for new
>> contributors which the rest of us may be blind to.
>>
>> If we do this we would probably want to update the wiki page to point to
>> the documentation generated from markdown. It would also mean that the
>> results of any update to the contributing guide take a full release cycle
>> to be visible. Another alternative would be opening up the wiki to a
>> broader set of people.
>>
>> I know a lot of people are probably getting ready for Spark Summit EU
>> (and I hope to catch up with some of y'all there) but I figured this a
>> relatively minor proposal.
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>


Re: cutting 2.0.2?

2016-10-17 Thread Cody Koeninger
SPARK-17841  three line bugfix that has a week old PR
SPARK-17812  being able to specify starting offsets is a must have for
a Kafka mvp in my opinion, already has a PR
SPARK-17813  I can put in a PR for this tonight if it'll be considered

On Mon, Oct 17, 2016 at 12:28 AM, Reynold Xin  wrote:
> Since 2.0.1, there have been a number of correctness fixes as well as some
> nice improvements to the experimental structured streaming (notably basic
> Kafka support). I'm thinking about cutting 2.0.2 later this week, before
> Spark Summit Europe. Let me know if there are specific things (bug fixes)
> you really want to merge into branch-2.0.
>
> Cheers.
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2016-10-17 Thread Cody Koeninger
ery Spark developer
>>
>>
>> Second: real-time streaming. I've written some time ago about real-time
>> streaming support in Spark Structured Streaming. Some work should be
>> done to make SSS more low-latency, but I think it's possible. Maybe
>> Spark may look at Gearpump, which is also built on top of Akka? I don't
>> know yet, it is good topic for SIP. However I think that Spark should
>> have real-time streaming support. Currently I see many posts/comments
>> that "Spark has too big latency". Spark Streaming is doing very good
>> jobs with micro-batches, however I think it is possible to add also more
>> real-time processing.
>>
>> Other people said much more and I agree with proposal of SIP. I'm also
>> happy that PMC's are not saying that they will not listen to users, but
>> they really want to make Spark better for every user.
>>
>>
>> What do you think about these two topics? Especially I'm looking at Cody
>> (who has started this topic) and PMCs :)
>>
>> Pozdrawiam / Best regards,
>>
>> Tomasz
>>
>>
>> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
>> > I love Spark.  3 or 4 years ago it was the first distributed computing
>> > environment that felt usable, and the community was welcoming.
>> >
>> > But I just got back from the Reactive Summit, and this is what I
>> > observed:
>> >
>> > - Industry leaders on stage making fun of Spark's streaming model
>> > - Open source project leaders saying they looked at Spark's governance
>> > as a model to avoid
>> > - Users saying they chose Flink because it was technically superior
>> > and they couldn't get any answers on the Spark mailing lists
>> >
>> > Whether you agree with the substance of any of this, when this stuff
>> > gets repeated enough people will believe it.
>> >
>> > Right now Spark is suffering from its own success, and I think
>> > something needs to change.
>> >
>> > - We need a clear process for planning significant changes to the
>> > codebase.
>> > I'm not saying you need to adopt Kafka Improvement Proposals exactly,
>> > but you need a documented process with a clear outcome (e.g. a vote).
>> > Passing around google docs after an implementation has largely been
>> > decided on doesn't cut it.
>> >
>> > - All technical communication needs to be public.
>> > Things getting decided in private chat, or when 1/3 of the committers
>> > work for the same company and can just talk to each other...
>> > Yes, it's convenient, but it's ultimately detrimental to the health of
>> > the project.
>> > The way structured streaming has played out has shown that there are
>> > significant technical blind spots (myself included).
>> > One way to address that is to get the people who have domain knowledge
>> > involved, and listen to them.
>> >
>> > - We need more committers, and more committer diversity.
>> > Per committer there are, what, more than 20 contributors and 10 new
>> > jira tickets a month?  It's too much.
>> > There are people (I am _not_ referring to myself) who have been around
>> > for years, contributed thousands of lines of code, helped educate the
>> > public around Spark... and yet are never going to be voted in.
>> >
>> > - We need a clear process for managing volunteer work.
>> > Too many tickets sit around unowned, unclosed, uncertain.
>> > If someone proposed something and it isn't up to snuff, tell them and
>> > close it.  It may be blunt, but it's clearer than "silent no".
>> > If someone wants to work on something, let them own the ticket and set
>> > a deadline. If they don't meet it, close it or reassign it.
>> >
>> > This is not me putting on an Apache Bureaucracy hat.  This is me
>> > saying, as a fellow hacker and loyal dissenter, something is wrong
>> > with the culture and process.
>> >
>> > Please, let's change it.
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-13 Thread Cody Koeninger
I've always been confused as to why it would ever be a good idea to
put any streaming query system on the critical path for synchronous  <
100msec requests.  It seems to make a lot more sense to have a
streaming system do asynch updates of a store that has better latency
and quality of service characteristics for multiple users.  Then your
only latency concerns are event to update, not request to response.

On Thu, Oct 13, 2016 at 10:39 AM, Fred Reiss  wrote:
> On Tue, Oct 11, 2016 at 11:02 AM, Shivaram Venkataraman
>  wrote:
>>
>> >
>> Could you expand a little bit more on stability ? Is it just bursty
>> workloads in terms of peak vs. average throughput ? Also what level of
>> latencies do you find users care about ? Is it on the order of 2-3
>> seconds vs. 1 second vs. 100s of milliseconds ?
>> >
>
>
> Regarding stability, I've seen two levels of concrete requirements.
>
> The first is "don't bring down my Spark cluster". That is to say, regardless
> of the input data rate, Spark shouldn't thrash or crash outright. Processing
> may lag behind the data arrival rate, but the cluster should stay up and
> remain fully functional.
>
> The second level is "don't bring down my application". A common use for
> streaming systems is to handle heavyweight computations that are part of a
> larger application, like a web application, a mobile app, or a plant control
> system. For example, an online application for car insurance might need to
> do some pretty involved machine learning to produce an accurate quote and
> suggest good upsells to the customer. If the heavyweight portion times out,
> the whole application times out, and you lose a customer.
>
> In terms of bursty vs. non-bursty, the "don't bring down my Spark cluster
> case" is more about handling bursts, while the "don't bring down my
> application" case is more about delivering acceptable end-to-end response
> times under typical load.
>
> Regarding latency: One group I talked to mentioned requirements in the
> 100-200 msec range, driven by the need to display a web page on a browser or
> mobile device. Another group in the Internet of Things space mentioned times
> ranging from 5 seconds to 30 seconds throughout the conversation. But most
> people I've talked to have been pretty vague about specific numbers.
>
> My impression is that these groups are not motivated by anxiety about
> meeting a particular latency target for a particular application. Rather,
> they want to make low latency the norm so that they can stop having to think
> about latency. Today, low latency is a special requirement of special
> applications. But that policy imposes a lot of hidden costs. IT architects
> have to spend time estimating the latency requirements of every application
> and lobbying for special treatment when those requirements are strict.
> Managers have to spend time engineering business processes around latency.
> Data scientists have to spend time packaging up models and negotiating how
> those models will be shipped over to the low-latency serving tier. And
> customers who are accustomed to Google and smartphones end up with an
> experience that is functional but unsatisfying.
>
> It's best to think of latency as a sliding scale. A given level of latency
> imposes a given level of cost enterprise-wide. Someone who is making a
> decision on middleware policy will balance this cost against other costs:
> How much does it cost to deploy the middleware? How much does it cost to
> train developers to use the system? The winner will be the system that
> minimizes the overall cost.
>
> Fred

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
If someone wants to tell me that it's OK and "The Apache Way" for
Kafka and Flink to have a proposal process that ends in a lazy
majority, but it's not OK for Spark to have a proposal process that
ends in a non-lazy consensus...

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#KafkaImprovementProposals-Process

In practice any PMC member can stop a proposal they don't like, so I'm
not sure how much it matters.



On Mon, Oct 10, 2016 at 5:59 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
> There is a larger issue to keep in mind, and that is that what you are
> proposing is a procedure that, as far as I am aware, hasn't previously been
> adopted in an Apache project, and thus is not an easy or exact fit with
> established practices that have been blessed as "The Apache Way".  As such,
> we need to be careful, because we have run into some trouble in the past
> with some inside the ASF but essentially outside the Spark community who
> didn't like the way we were doing things.
>
> On Mon, Oct 10, 2016 at 3:53 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Apache documents say lots of confusing stuff, including that commiters are
>> in practice given a vote.
>>
>> https://www.apache.org/foundation/voting.html
>>
>> I don't care either way, if someone wants me to sub commiter for PMC in
>> the voting section, fine, we just need a clear outcome.
>>
>>
>> On Oct 10, 2016 17:36, "Mark Hamstra" <m...@clearstorydata.com> wrote:
>>>
>>> If I'm correctly understanding the kind of voting that you are talking
>>> about, then to be accurate, it is only the PMC members that have a vote, not
>>> all committers:
>>> https://www.apache.org/foundation/how-it-works.html#pmc-members
>>>
>>> On Mon, Oct 10, 2016 at 12:02 PM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>>
>>>> I think the main value is in being honest about what's going on.  No
>>>> one other than committers can cast a meaningful vote, that's the
>>>> reality.  Beyond that, if people think it's more open to allow formal
>>>> proposals from anyone, I'm not necessarily against it, but my main
>>>> question would be this:
>>>>
>>>> If anyone can submit a proposal, are committers actually going to
>>>> clearly reject and close proposals that don't meet the requirements?
>>>>
>>>> Right now we have a serious problem with lack of clarity regarding
>>>> contributions, and that cannot spill over into goal-setting.
>>>>
>>>> On Mon, Oct 10, 2016 at 1:54 PM, Ryan Blue <rb...@netflix.com> wrote:
>>>> > +1 to votes to approve proposals. I agree that proposals should have
>>>> > an
>>>> > official mechanism to be accepted, and a vote is an established means
>>>> > of
>>>> > doing that well. I like that it includes a period to review the
>>>> > proposal and
>>>> > I think proposals should have been discussed enough ahead of a vote to
>>>> > survive the possibility of a veto.
>>>> >
>>>> > I also like the names that are short and (mostly) unique, like SEP.
>>>> >
>>>> > Where I disagree is with the requirement that a committer must
>>>> > formally
>>>> > propose an enhancement. I don't see the value of restricting this: if
>>>> > someone has the will to write up a proposal then they should be
>>>> > encouraged
>>>> > to do so and start a discussion about it. Even if there is a political
>>>> > reality as Cody says, what is the value of codifying that in our
>>>> > process? I
>>>> > think restricting who can submit proposals would only undermine them
>>>> > by
>>>> > pushing contributors out. Maybe I'm missing something here?
>>>> >
>>>> > rb
>>>> >
>>>> >
>>>> >
>>>> > On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <c...@koeninger.org>
>>>> > wrote:
>>>> >>
>>>> >> Yes, users suggesting SIPs is a good thing and is explicitly called
>>>> >> out in the linked document under the Who? section.  Formally
>>>> >> proposing
>>>> >> them, not so much, because of the political realities.
>>>> >>
>>>> >> Yes, implementation strategy definitely affects goals.  There are all
>>>>

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
Updated on github,
https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

I believe I've touched on all feedback with the exception of naming,
and API vs Strategy.

Do we want a straw poll on naming?

Matei, are your concerns about api vs strategy addressed if we add an
API bullet point to the template?

On Mon, Oct 10, 2016 at 2:38 PM, Steve Loughran <ste...@hortonworks.com> wrote:
> This is an interesting process proposal; I think it could work well.
>
> -It's got the flavour of the ASF incubator; maybe some of the processes 
> there: mentor, regular reporting in could help, in particular, help stop the 
> -1 at the end of the work
> -it may also aid collaboration to have a medium lived branch, so enabling 
> collaboration with multiple people submitting PRs into the ASF codebase. This 
> can reduce cost of merge and enable jenkins to keep on top of it. It also 
> fits in well with the ASF "do in apache infra" community development process.
>
>
>> On 10 Oct 2016, at 20:26, Matei Zaharia <matei.zaha...@gmail.com> wrote:
>>
>> Agreed with this. As I said before regarding who submits: it's not a normal 
>> ASF process to require contributions to only come from committers. 
>> Committers are of course the only people who can *commit* stuff. But the 
>> whole point of an open source project is that anyone can *contribute* -- 
>> indeed, that is how people become committers. For example, in every ASF 
>> project, anyone can open JIRAs, submit design docs, submit patches, review 
>> patches, and vote on releases. This particular process is very similar to 
>> posting a JIRA or a design doc.
>>
>> I also like consensus with a deadline (e.g. someone says "here is a new SEP, 
>> we want to accept it by date X so please comment before").
>>
>> In general, with this type of stuff, it's better to start with very 
>> lightweight processes and then expand them if needed. Adding lots of rules 
>> from the beginning makes it confusing and can reduce contributions. 
>> Although, as engineers, we believe that anything can be solved using 
>> mechanical rules, in practice software development is a social process that 
>> ultimately requires humans to tackle things on a case-by-case basis.
>>
>> Matei
>>
>>
>>> On Oct 10, 2016, at 12:19 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>>
>>> That seems reasonable to me.
>>>
>>> I do not want to see lazy consensus used on one of these proposals
>>> though, I want a clear outcome, i.e. call for a vote, wait at least 72
>>> hours, get three +1s and no vetos.
>>>
>>>
>>>
>>> On Mon, Oct 10, 2016 at 2:15 PM, Ryan Blue <rb...@netflix.com> wrote:
>>>> Proposal submission: I think we should keep this as open as possible. If
>>>> there is a problem with too many open proposals, then we should tackle that
>>>> as a fix rather than excluding participation. Perhaps it will end up that
>>>> way, but I think it's worth trying a more open model first.
>>>>
>>>> Majority vs consensus: My rationale is that I don't think we want to
>>>> consider a proposal approved if it had objections serious enough that
>>>> committers down-voted (or PMC depending on who gets a vote). If these
>>>> proposals are like PEPs, then they represent a significant amount of
>>>> community effort and I wouldn't want to move forward if up to half of the
>>>> community thinks it's an untenable idea.
>>>>
>>>> rb
>>>>
>>>> On Mon, Oct 10, 2016 at 12:07 PM, Cody Koeninger <c...@koeninger.org> 
>>>> wrote:
>>>>>
>>>>> I think this is closer to a procedural issue than a code modification
>>>>> issue, hence why majority.  If everyone thinks consensus is better, I
>>>>> don't care.  Again, I don't feel strongly about the way we achieve
>>>>> clarity, just that we achieve clarity.
>>>>>
>>>>> On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>> Sorry, I missed that the proposal includes majority approval. Why
>>>>>> majority
>>>>>> instead of consensus? I think we want to build consensus around these
>>>>>> proposals and it makes sense to discuss until no one would veto.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>>>
>

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
That seems reasonable to me.

I do not want to see lazy consensus used on one of these proposals
though, I want a clear outcome, i.e. call for a vote, wait at least 72
hours, get three +1s and no vetos.



On Mon, Oct 10, 2016 at 2:15 PM, Ryan Blue <rb...@netflix.com> wrote:
> Proposal submission: I think we should keep this as open as possible. If
> there is a problem with too many open proposals, then we should tackle that
> as a fix rather than excluding participation. Perhaps it will end up that
> way, but I think it's worth trying a more open model first.
>
> Majority vs consensus: My rationale is that I don't think we want to
> consider a proposal approved if it had objections serious enough that
> committers down-voted (or PMC depending on who gets a vote). If these
> proposals are like PEPs, then they represent a significant amount of
> community effort and I wouldn't want to move forward if up to half of the
> community thinks it's an untenable idea.
>
> rb
>
> On Mon, Oct 10, 2016 at 12:07 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> I think this is closer to a procedural issue than a code modification
>> issue, hence why majority.  If everyone thinks consensus is better, I
>> don't care.  Again, I don't feel strongly about the way we achieve
>> clarity, just that we achieve clarity.
>>
>> On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue <rb...@netflix.com> wrote:
>> > Sorry, I missed that the proposal includes majority approval. Why
>> > majority
>> > instead of consensus? I think we want to build consensus around these
>> > proposals and it makes sense to discuss until no one would veto.
>> >
>> > rb
>> >
>> > On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue <rb...@netflix.com> wrote:
>> >>
>> >> +1 to votes to approve proposals. I agree that proposals should have an
>> >> official mechanism to be accepted, and a vote is an established means
>> >> of
>> >> doing that well. I like that it includes a period to review the
>> >> proposal and
>> >> I think proposals should have been discussed enough ahead of a vote to
>> >> survive the possibility of a veto.
>> >>
>> >> I also like the names that are short and (mostly) unique, like SEP.
>> >>
>> >> Where I disagree is with the requirement that a committer must formally
>> >> propose an enhancement. I don't see the value of restricting this: if
>> >> someone has the will to write up a proposal then they should be
>> >> encouraged
>> >> to do so and start a discussion about it. Even if there is a political
>> >> reality as Cody says, what is the value of codifying that in our
>> >> process? I
>> >> think restricting who can submit proposals would only undermine them by
>> >> pushing contributors out. Maybe I'm missing something here?
>> >>
>> >> rb
>> >>
>> >>
>> >>
>> >> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <c...@koeninger.org>
>> >> wrote:
>> >>>
>> >>> Yes, users suggesting SIPs is a good thing and is explicitly called
>> >>> out in the linked document under the Who? section.  Formally proposing
>> >>> them, not so much, because of the political realities.
>> >>>
>> >>> Yes, implementation strategy definitely affects goals.  There are all
>> >>> kinds of examples of this, I'll pick one that's my fault so as to
>> >>> avoid sounding like I'm blaming:
>> >>>
>> >>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>> >>> upon by the community) goals was to make sure people could use the
>> >>> Dstream with however they were already using Kafka at work.  The lack
>> >>> of explicit agreement on that goal led to all kinds of fighting with
>> >>> committers, that could have been avoided.  The lack of explicit
>> >>> up-front strategy discussion led to the DStream not really working
>> >>> with compacted topics.  I knew about compacted topics, but don't have
>> >>> a use for them, so had a blind spot there.  If there was explicit
>> >>> up-front discussion that my strategy was "assume that batches can be
>> >>> defined on the driver solely by beginning and ending offsets", there's
>> >>> a greater chance that a user would have seen that and said, "hey, what
>> >>> about non-contiguous offsets in

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
I think this is closer to a procedural issue than a code modification
issue, hence why majority.  If everyone thinks consensus is better, I
don't care.  Again, I don't feel strongly about the way we achieve
clarity, just that we achieve clarity.

On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue <rb...@netflix.com> wrote:
> Sorry, I missed that the proposal includes majority approval. Why majority
> instead of consensus? I think we want to build consensus around these
> proposals and it makes sense to discuss until no one would veto.
>
> rb
>
> On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue <rb...@netflix.com> wrote:
>>
>> +1 to votes to approve proposals. I agree that proposals should have an
>> official mechanism to be accepted, and a vote is an established means of
>> doing that well. I like that it includes a period to review the proposal and
>> I think proposals should have been discussed enough ahead of a vote to
>> survive the possibility of a veto.
>>
>> I also like the names that are short and (mostly) unique, like SEP.
>>
>> Where I disagree is with the requirement that a committer must formally
>> propose an enhancement. I don't see the value of restricting this: if
>> someone has the will to write up a proposal then they should be encouraged
>> to do so and start a discussion about it. Even if there is a political
>> reality as Cody says, what is the value of codifying that in our process? I
>> think restricting who can submit proposals would only undermine them by
>> pushing contributors out. Maybe I'm missing something here?
>>
>> rb
>>
>>
>>
>> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>>
>>> Yes, users suggesting SIPs is a good thing and is explicitly called
>>> out in the linked document under the Who? section.  Formally proposing
>>> them, not so much, because of the political realities.
>>>
>>> Yes, implementation strategy definitely affects goals.  There are all
>>> kinds of examples of this, I'll pick one that's my fault so as to
>>> avoid sounding like I'm blaming:
>>>
>>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>>> upon by the community) goals was to make sure people could use the
>>> Dstream with however they were already using Kafka at work.  The lack
>>> of explicit agreement on that goal led to all kinds of fighting with
>>> committers, that could have been avoided.  The lack of explicit
>>> up-front strategy discussion led to the DStream not really working
>>> with compacted topics.  I knew about compacted topics, but don't have
>>> a use for them, so had a blind spot there.  If there was explicit
>>> up-front discussion that my strategy was "assume that batches can be
>>> defined on the driver solely by beginning and ending offsets", there's
>>> a greater chance that a user would have seen that and said, "hey, what
>>> about non-contiguous offsets in a compacted topic".
>>>
>>> This kind of thing is only going to happen smoothly if we have a
>>> lightweight user-visible process with clear outcomes.
>>>
>>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>> <assaf.mendel...@rsa.com> wrote:
>>> > I agree with most of what Cody said.
>>> >
>>> > Two things:
>>> >
>>> > First we can always have other people suggest SIPs but mark them as
>>> > “unreviewed” and have committers basically move them forward. The
>>> > problem is
>>> > that writing a good document takes time. This way we can leverage non
>>> > committers to do some of this work (it is just another way to
>>> > contribute).
>>> >
>>> >
>>> >
>>> > As for strategy, in many cases implementation strategy can affect the
>>> > goals.
>>> > I will give  a small example: In the current structured streaming
>>> > strategy,
>>> > we group by the time to achieve a sliding window. This is definitely an
>>> > implementation decision and not a goal. However, I can think of several
>>> > aggregation functions which have the time inside their calculation
>>> > buffer.
>>> > For example, let’s say we want to return a set of all distinct values.
>>> > One
>>> > way to implement this would be to make the set into a map and have the
>>> > value
>>> > contain the last time seen. Multiplying it across the groupby would
>>> > c

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
I think the main value is in being honest about what's going on.  No
one other than committers can cast a meaningful vote, that's the
reality.  Beyond that, if people think it's more open to allow formal
proposals from anyone, I'm not necessarily against it, but my main
question would be this:

If anyone can submit a proposal, are committers actually going to
clearly reject and close proposals that don't meet the requirements?

Right now we have a serious problem with lack of clarity regarding
contributions, and that cannot spill over into goal-setting.

On Mon, Oct 10, 2016 at 1:54 PM, Ryan Blue <rb...@netflix.com> wrote:
> +1 to votes to approve proposals. I agree that proposals should have an
> official mechanism to be accepted, and a vote is an established means of
> doing that well. I like that it includes a period to review the proposal and
> I think proposals should have been discussed enough ahead of a vote to
> survive the possibility of a veto.
>
> I also like the names that are short and (mostly) unique, like SEP.
>
> Where I disagree is with the requirement that a committer must formally
> propose an enhancement. I don't see the value of restricting this: if
> someone has the will to write up a proposal then they should be encouraged
> to do so and start a discussion about it. Even if there is a political
> reality as Cody says, what is the value of codifying that in our process? I
> think restricting who can submit proposals would only undermine them by
> pushing contributors out. Maybe I'm missing something here?
>
> rb
>
>
>
> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Yes, users suggesting SIPs is a good thing and is explicitly called
>> out in the linked document under the Who? section.  Formally proposing
>> them, not so much, because of the political realities.
>>
>> Yes, implementation strategy definitely affects goals.  There are all
>> kinds of examples of this, I'll pick one that's my fault so as to
>> avoid sounding like I'm blaming:
>>
>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>> upon by the community) goals was to make sure people could use the
>> Dstream with however they were already using Kafka at work.  The lack
>> of explicit agreement on that goal led to all kinds of fighting with
>> committers, that could have been avoided.  The lack of explicit
>> up-front strategy discussion led to the DStream not really working
>> with compacted topics.  I knew about compacted topics, but don't have
>> a use for them, so had a blind spot there.  If there was explicit
>> up-front discussion that my strategy was "assume that batches can be
>> defined on the driver solely by beginning and ending offsets", there's
>> a greater chance that a user would have seen that and said, "hey, what
>> about non-contiguous offsets in a compacted topic".
>>
>> This kind of thing is only going to happen smoothly if we have a
>> lightweight user-visible process with clear outcomes.
>>
>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>> <assaf.mendel...@rsa.com> wrote:
>> > I agree with most of what Cody said.
>> >
>> > Two things:
>> >
>> > First we can always have other people suggest SIPs but mark them as
>> > “unreviewed” and have committers basically move them forward. The
>> > problem is
>> > that writing a good document takes time. This way we can leverage non
>> > committers to do some of this work (it is just another way to
>> > contribute).
>> >
>> >
>> >
>> > As for strategy, in many cases implementation strategy can affect the
>> > goals.
>> > I will give  a small example: In the current structured streaming
>> > strategy,
>> > we group by the time to achieve a sliding window. This is definitely an
>> > implementation decision and not a goal. However, I can think of several
>> > aggregation functions which have the time inside their calculation
>> > buffer.
>> > For example, let’s say we want to return a set of all distinct values.
>> > One
>> > way to implement this would be to make the set into a map and have the
>> > value
>> > contain the last time seen. Multiplying it across the groupby would cost
>> > a
>> > lot in performance. So adding such a strategy would have a great effect
>> > on
>> > the type of aggregations and their performance which does affect the
>> > goal.
>> > Without adding the strategy, it is easy for whoever goes to the design
>> > document to not 

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
Yes, users suggesting SIPs is a good thing and is explicitly called
out in the linked document under the Who? section.  Formally proposing
them, not so much, because of the political realities.

Yes, implementation strategy definitely affects goals.  There are all
kinds of examples of this, I'll pick one that's my fault so as to
avoid sounding like I'm blaming:

When I implemented the Kafka DStream, one of my (not explicitly agreed
upon by the community) goals was to make sure people could use the
Dstream with however they were already using Kafka at work.  The lack
of explicit agreement on that goal led to all kinds of fighting with
committers, that could have been avoided.  The lack of explicit
up-front strategy discussion led to the DStream not really working
with compacted topics.  I knew about compacted topics, but don't have
a use for them, so had a blind spot there.  If there was explicit
up-front discussion that my strategy was "assume that batches can be
defined on the driver solely by beginning and ending offsets", there's
a greater chance that a user would have seen that and said, "hey, what
about non-contiguous offsets in a compacted topic".

This kind of thing is only going to happen smoothly if we have a
lightweight user-visible process with clear outcomes.

On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
<assaf.mendel...@rsa.com> wrote:
> I agree with most of what Cody said.
>
> Two things:
>
> First we can always have other people suggest SIPs but mark them as
> “unreviewed” and have committers basically move them forward. The problem is
> that writing a good document takes time. This way we can leverage non
> committers to do some of this work (it is just another way to contribute).
>
>
>
> As for strategy, in many cases implementation strategy can affect the goals.
> I will give  a small example: In the current structured streaming strategy,
> we group by the time to achieve a sliding window. This is definitely an
> implementation decision and not a goal. However, I can think of several
> aggregation functions which have the time inside their calculation buffer.
> For example, let’s say we want to return a set of all distinct values. One
> way to implement this would be to make the set into a map and have the value
> contain the last time seen. Multiplying it across the groupby would cost a
> lot in performance. So adding such a strategy would have a great effect on
> the type of aggregations and their performance which does affect the goal.
> Without adding the strategy, it is easy for whoever goes to the design
> document to not think about these cases. Furthermore, it might be decided
> that these cases are rare enough so that the strategy is still good enough
> but how would we know it without user feedback?
>
> I believe this example is exactly what Cody was talking about. Since many
> times implementation strategies have a large effect on the goal, we should
> have it discussed when discussing the goals. In addition, while it is often
> easy to throw out completely infeasible goals, it is often much harder to
> figure out that the goals are unfeasible without fine tuning.
>
>
>
>
>
> Assaf.
>
>
>
> From: Cody Koeninger-2 [via Apache Spark Developers List]
> [mailto:ml-node+[hidden email]]
> Sent: Monday, October 10, 2016 2:25 AM
> To: Mendelson, Assaf
> Subject: Re: Spark Improvement Proposals
>
>
>
> Only committers should formally submit SIPs because in an apache
> project only commiters have explicit political power.  If a user can't
> find a commiter willing to sponsor an SIP idea, they have no way to
> get the idea passed in any case.  If I can't find a committer to
> sponsor this meta-SIP idea, I'm out of luck.
>
> I do not believe unrealistic goals can be found solely by inspection.
> We've managed to ignore unrealistic goals even after implementation!
> Focusing on APIs can allow people to think they've solved something,
> when there's really no way of implementing that API while meeting the
> goals.  Rapid iteration is clearly the best way to address this, but
> we've already talked about why that hasn't really worked.  If adding a
> non-binding API section to the template is important to you, I'm not
> against it, but I don't think it's sufficient.
>
> On your PRD vs design doc spectrum, I'm saying this is closer to a
> PRD.  Clear agreement on goals is the most important thing and that's
> why it's the thing I want binding agreement on.  But I cannot agree to
> goals unless I have enough minimal technical info to judge whether the
> goals are likely to actually be accomplished.
>
>
>
> On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]> wrote:
>
>
>> Well, I think there are a few things here that do

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
Only committers should formally submit SIPs because in an apache
project only commiters have explicit political power.  If a user can't
find a commiter willing to sponsor an SIP idea, they have no way to
get the idea passed in any case.  If I can't find a committer to
sponsor this meta-SIP idea, I'm out of luck.

I do not believe unrealistic goals can be found solely by inspection.
We've managed to ignore unrealistic goals even after implementation!
Focusing on APIs can allow people to think they've solved something,
when there's really no way of implementing that API while meeting the
goals.  Rapid iteration is clearly the best way to address this, but
we've already talked about why that hasn't really worked.  If adding a
non-binding API section to the template is important to you, I'm not
against it, but I don't think it's sufficient.

On your PRD vs design doc spectrum, I'm saying this is closer to a
PRD.  Clear agreement on goals is the most important thing and that's
why it's the thing I want binding agreement on.  But I cannot agree to
goals unless I have enough minimal technical info to judge whether the
goals are likely to actually be accomplished.



On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> Well, I think there are a few things here that don't make sense. First, why
> should only committers submit SIPs? Development in the project should be
> open to all contributors, whether they're committers or not. Second, I think
> unrealistic goals can be found just by inspecting the goals, and I'm not
> super worried that we'll accept a lot of SIPs that are then infeasible -- we
> can then submit new ones. But this depends on whether you want this process
> to be a "design doc lite", where people also agree on implementation
> strategy, or just a way to agree on goals. This is what I asked earlier
> about PRDs vs design docs (and I'm open to either one but I'd just like
> clarity). Finally, both as a user and designer of software, I always want to
> give feedback on APIs, so I'd really like a culture of having those early.
> People don't argue about prettiness when they discuss APIs, they argue about
> the core concepts to expose in order to meet various goals, and then they're
> stuck maintaining those for a long time.
>
> Matei
>
> On Oct 9, 2016, at 3:10 PM, Cody Koeninger <c...@koeninger.org> wrote:
>
> Users instead of people, sure.  Commiters and contributors are (or at least
> should be) a subset of users.
>
> Non goals, sure. I don't care what the name is, but we need to clearly say
> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>
> API, what I care most about is whether it allows me to accomplish the goals.
> Arguing about how ugly or pretty it is can be saved for design/
> implementation imho.
>
> Strategy, this is necessary because otherwise goals can be out of line with
> reality.  Don't propose goals you don't have at least some idea of how to
> implement.
>
> Rejected strategies, given that commiters are the only ones I'm saying
> should formally submit SPARKLIs or SIPs, if they put junk in a required
> section then slap them down for it and tell them to fix it.
>
>
> On Oct 9, 2016 4:36 PM, "Matei Zaharia" <matei.zaha...@gmail.com> wrote:
>>
>> Yup, this is the stuff that I found unclear. Thanks for clarifying here,
>> but we should also clarify it in the writeup. In particular:
>>
>> - Goals needs to be about user-facing behavior ("people" is broad)
>>
>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig up
>> one of these and say "Spark's developers have officially rejected X, which
>> our awesome system has".
>>
>> - For user-facing stuff, I think you need a section on API. Virtually all
>> other *IPs I've seen have that.
>>
>> - I'm still not sure why the strategy section is needed if the purpose is
>> to define user-facing behavior -- unless this is the strategy for setting
>> the goals or for defining the API. That sounds squarely like a design doc
>> issue. In some sense, who cares whether the proposal is technically feasible
>> right now? If it's infeasible, that will be discovered later during design
>> and implementation. Same thing with rejected strategies -- listing some of
>> those is definitely useful sometimes, but if you make this a *required*
>> section, people are just going to fill it in with bogus stuff (I've seen
>> this happen before).
>>
>> Matei
>>
>> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <c...@koeninger.org> wrote:
>> >
>> > So to focus the discussion on the specific strategy I'm suggesting,
>> > documented at
>> 

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
Yeah, I've looked at KIPs and Scala SIPs.

I'm reluctant to use the Kafka structured streaming as an example
because of the pre-existing conflict around it.  If Michael or another
committer wanted to put it forth as an example, I'd participate in
good faith though.

On Sun, Oct 9, 2016 at 5:07 PM, Ofir Manor <ofir.ma...@equalum.io> wrote:
> This is a great discussion!
> Maybe you could have a look at Kafka's process - it also uses Rejected
> Alternatives and I personally find it very clear actually (the link also
> leads to all KIPs):
>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> Cody - maybe you could take one of the open issues and write a sample
> proposal? A concrete example might make it clearer for those who see this
> for the first time. Maybe the Kafka offset discussion or some other
> Kafka/Structured Streaming open issue? Will that be helpful?
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
>
> On Mon, Oct 10, 2016 at 12:36 AM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>>
>> Yup, this is the stuff that I found unclear. Thanks for clarifying here,
>> but we should also clarify it in the writeup. In particular:
>>
>> - Goals needs to be about user-facing behavior ("people" is broad)
>>
>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig up
>> one of these and say "Spark's developers have officially rejected X, which
>> our awesome system has".
>>
>> - For user-facing stuff, I think you need a section on API. Virtually all
>> other *IPs I've seen have that.
>>
>> - I'm still not sure why the strategy section is needed if the purpose is
>> to define user-facing behavior -- unless this is the strategy for setting
>> the goals or for defining the API. That sounds squarely like a design doc
>> issue. In some sense, who cares whether the proposal is technically feasible
>> right now? If it's infeasible, that will be discovered later during design
>> and implementation. Same thing with rejected strategies -- listing some of
>> those is definitely useful sometimes, but if you make this a *required*
>> section, people are just going to fill it in with bogus stuff (I've seen
>> this happen before).
>>
>> Matei
>>
>> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <c...@koeninger.org> wrote:
>> >
>> > So to focus the discussion on the specific strategy I'm suggesting,
>> > documented at
>> >
>> >
>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >
>> > "Goals: What must this allow people to do, that they can't currently?"
>> >
>> > Is it unclear that this is focusing specifically on people-visible
>> > behavior?
>> >
>> > Rejected goals -  are important because otherwise people keep trying
>> > to argue about scope.  Of course you can change things later with a
>> > different SIP and different vote, the point is to focus.
>> >
>> > Use cases - are something that people are going to bring up in
>> > discussion.  If they aren't clearly documented as a goal ("This must
>> > allow me to connect using SSL"), they should be added.
>> >
>> > Internal architecture - if the people who need specific behavior are
>> > implementers of other parts of the system, that's fine.
>> >
>> > Rejected strategies - If you have none of these, you have no evidence
>> > that the proponent didn't just go with the first thing they had in
>> > mind (or have already implemented), which is a big problem currently.
>> > Approval isn't binding as to specifics of implementation, so these
>> > aren't handcuffs.  The goals are the contract, the strategy is
>> > evidence that contract can actually be met.
>> >
>> > Design docs - I'm not touching design docs.  The markdown file I
>> > linked specifically says of the strategy section "This is not a full
>> > design document."  Is this unclear?  Design docs can be worked on
>> > obviously, but that's not what I'm concerned with here.
>> >
>> >
>> >
>> >
>> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <matei.zaha...@gmail.com>
>> > wrote:
>> >> Hi Cody,
>> >>
>> >> I think this would be a lot more concrete if we had a more detailed
>> >> template
>> >> for SIPs. Right now, it's not super clear what's in scope -- e.g. are
>>

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
Users instead of people, sure.  Commiters and contributors are (or at least
should be) a subset of users.

Non goals, sure. I don't care what the name is, but we need to clearly say
e.g. 'no we are not maintaining compatibility with XYZ right now'.

API, what I care most about is whether it allows me to accomplish the
goals. Arguing about how ugly or pretty it is can be saved for design/
implementation imho.

Strategy, this is necessary because otherwise goals can be out of line with
reality.  Don't propose goals you don't have at least some idea of how to
implement.

Rejected strategies, given that commiters are the only ones I'm saying
should formally submit SPARKLIs or SIPs, if they put junk in a required
section then slap them down for it and tell them to fix it.

On Oct 9, 2016 4:36 PM, "Matei Zaharia" <matei.zaha...@gmail.com> wrote:

> Yup, this is the stuff that I found unclear. Thanks for clarifying here,
> but we should also clarify it in the writeup. In particular:
>
> - Goals needs to be about user-facing behavior ("people" is broad)
>
> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig up
> one of these and say "Spark's developers have officially rejected X, which
> our awesome system has".
>
> - For user-facing stuff, I think you need a section on API. Virtually all
> other *IPs I've seen have that.
>
> - I'm still not sure why the strategy section is needed if the purpose is
> to define user-facing behavior -- unless this is the strategy for setting
> the goals or for defining the API. That sounds squarely like a design doc
> issue. In some sense, who cares whether the proposal is technically
> feasible right now? If it's infeasible, that will be discovered later
> during design and implementation. Same thing with rejected strategies --
> listing some of those is definitely useful sometimes, but if you make this
> a *required* section, people are just going to fill it in with bogus stuff
> (I've seen this happen before).
>
> Matei
>
> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger <c...@koeninger.org> wrote:
> >
> > So to focus the discussion on the specific strategy I'm suggesting,
> > documented at
> >
> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >
> > "Goals: What must this allow people to do, that they can't currently?"
> >
> > Is it unclear that this is focusing specifically on people-visible
> behavior?
> >
> > Rejected goals -  are important because otherwise people keep trying
> > to argue about scope.  Of course you can change things later with a
> > different SIP and different vote, the point is to focus.
> >
> > Use cases - are something that people are going to bring up in
> > discussion.  If they aren't clearly documented as a goal ("This must
> > allow me to connect using SSL"), they should be added.
> >
> > Internal architecture - if the people who need specific behavior are
> > implementers of other parts of the system, that's fine.
> >
> > Rejected strategies - If you have none of these, you have no evidence
> > that the proponent didn't just go with the first thing they had in
> > mind (or have already implemented), which is a big problem currently.
> > Approval isn't binding as to specifics of implementation, so these
> > aren't handcuffs.  The goals are the contract, the strategy is
> > evidence that contract can actually be met.
> >
> > Design docs - I'm not touching design docs.  The markdown file I
> > linked specifically says of the strategy section "This is not a full
> > design document."  Is this unclear?  Design docs can be worked on
> > obviously, but that's not what I'm concerned with here.
> >
> >
> >
> >
> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
> >> Hi Cody,
> >>
> >> I think this would be a lot more concrete if we had a more detailed
> template
> >> for SIPs. Right now, it's not super clear what's in scope -- e.g. are
> they
> >> a way to solicit feedback on the user-facing behavior or on the
> internals?
> >> "Goals" can cover both things. I've been thinking of SIPs more as
> Product
> >> Requirements Docs (PRDs), which focus on *what* a code change should do
> as
> >> opposed to how.
> >>
> >> In particular, here are some things that you may or may not consider in
> >> scope for SIPs:
> >>
> >> - Goals and non-goals: This is definitely in scope, and IMO should
> focus on
> >> user-visible behavior (e.g. "system 

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
Regarding name, if the SIP overlap is a concern, we can pick a different name.
My tongue in cheek suggestion would be
Spark Lightweight Improvement process (SPARKLI)

On Sun, Oct 9, 2016 at 4:14 PM, Cody Koeninger <c...@koeninger.org> wrote:
> So to focus the discussion on the specific strategy I'm suggesting,
> documented at
>
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>
> "Goals: What must this allow people to do, that they can't currently?"
>
> Is it unclear that this is focusing specifically on people-visible behavior?
>
> Rejected goals -  are important because otherwise people keep trying
> to argue about scope.  Of course you can change things later with a
> different SIP and different vote, the point is to focus.
>
> Use cases - are something that people are going to bring up in
> discussion.  If they aren't clearly documented as a goal ("This must
> allow me to connect using SSL"), they should be added.
>
> Internal architecture - if the people who need specific behavior are
> implementers of other parts of the system, that's fine.
>
> Rejected strategies - If you have none of these, you have no evidence
> that the proponent didn't just go with the first thing they had in
> mind (or have already implemented), which is a big problem currently.
> Approval isn't binding as to specifics of implementation, so these
> aren't handcuffs.  The goals are the contract, the strategy is
> evidence that contract can actually be met.
>
> Design docs - I'm not touching design docs.  The markdown file I
> linked specifically says of the strategy section "This is not a full
> design document."  Is this unclear?  Design docs can be worked on
> obviously, but that's not what I'm concerned with here.
>
>
>
>
> On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
>> Hi Cody,
>>
>> I think this would be a lot more concrete if we had a more detailed template
>> for SIPs. Right now, it's not super clear what's in scope -- e.g. are  they
>> a way to solicit feedback on the user-facing behavior or on the internals?
>> "Goals" can cover both things. I've been thinking of SIPs more as Product
>> Requirements Docs (PRDs), which focus on *what* a code change should do as
>> opposed to how.
>>
>> In particular, here are some things that you may or may not consider in
>> scope for SIPs:
>>
>> - Goals and non-goals: This is definitely in scope, and IMO should focus on
>> user-visible behavior (e.g. "system supports SQL window functions" or
>> "system continues working if one node fails"). BTW I wouldn't say "rejected
>> goals" because some of them might become goals later, so we're not
>> definitively rejecting them.
>>
>> - Public API: Probably should be included in most SIPs unless it's too large
>> to fully specify then (e.g. "let's add an ML library").
>>
>> - Use cases: I usually find this very useful in PRDs to better communicate
>> the goals.
>>
>> - Internal architecture: This is usually *not* a thing users can easily
>> comment on and it sounds more like a design doc item. Of course it's
>> important to show that the SIP is feasible to implement. One exception,
>> however, is that I think we'll have some SIPs primarily on internals (e.g.
>> if somebody wants to refactor Spark's query optimizer or something).
>>
>> - Rejected strategies: I personally wouldn't put this, because what's the
>> point of voting to reject a strategy before you've really begun designing
>> and implementing something? What if you discover that the strategy is
>> actually better when you start doing stuff?
>>
>> At a super high level, it depends on whether you want the SIPs to be PRDs
>> for getting some quick feedback on the goals of a feature before it is
>> designed, or something more like full-fledged design docs (just a more
>> visible design doc for bigger changes). I looked at Kafka's KIPs, and they
>> actually seem to be more like design docs. This can work too but it does
>> require more work from the proposer and it can lead to the same problems you
>> mentioned with people already having a design and implementation in mind.
>>
>> Basically, the question is, are you trying to iterate faster on design by
>> adding a step for user feedback earlier? Or are you just trying to make
>> design docs for key features more visible (and their approval more formal)?
>>
>> BTW note that in either case, I'd like to have a template for design docs
>> too, which should also include goals. I think th

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
So to focus the discussion on the specific strategy I'm suggesting,
documented at

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

"Goals: What must this allow people to do, that they can't currently?"

Is it unclear that this is focusing specifically on people-visible behavior?

Rejected goals -  are important because otherwise people keep trying
to argue about scope.  Of course you can change things later with a
different SIP and different vote, the point is to focus.

Use cases - are something that people are going to bring up in
discussion.  If they aren't clearly documented as a goal ("This must
allow me to connect using SSL"), they should be added.

Internal architecture - if the people who need specific behavior are
implementers of other parts of the system, that's fine.

Rejected strategies - If you have none of these, you have no evidence
that the proponent didn't just go with the first thing they had in
mind (or have already implemented), which is a big problem currently.
Approval isn't binding as to specifics of implementation, so these
aren't handcuffs.  The goals are the contract, the strategy is
evidence that contract can actually be met.

Design docs - I'm not touching design docs.  The markdown file I
linked specifically says of the strategy section "This is not a full
design document."  Is this unclear?  Design docs can be worked on
obviously, but that's not what I'm concerned with here.




On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> Hi Cody,
>
> I think this would be a lot more concrete if we had a more detailed template
> for SIPs. Right now, it's not super clear what's in scope -- e.g. are  they
> a way to solicit feedback on the user-facing behavior or on the internals?
> "Goals" can cover both things. I've been thinking of SIPs more as Product
> Requirements Docs (PRDs), which focus on *what* a code change should do as
> opposed to how.
>
> In particular, here are some things that you may or may not consider in
> scope for SIPs:
>
> - Goals and non-goals: This is definitely in scope, and IMO should focus on
> user-visible behavior (e.g. "system supports SQL window functions" or
> "system continues working if one node fails"). BTW I wouldn't say "rejected
> goals" because some of them might become goals later, so we're not
> definitively rejecting them.
>
> - Public API: Probably should be included in most SIPs unless it's too large
> to fully specify then (e.g. "let's add an ML library").
>
> - Use cases: I usually find this very useful in PRDs to better communicate
> the goals.
>
> - Internal architecture: This is usually *not* a thing users can easily
> comment on and it sounds more like a design doc item. Of course it's
> important to show that the SIP is feasible to implement. One exception,
> however, is that I think we'll have some SIPs primarily on internals (e.g.
> if somebody wants to refactor Spark's query optimizer or something).
>
> - Rejected strategies: I personally wouldn't put this, because what's the
> point of voting to reject a strategy before you've really begun designing
> and implementing something? What if you discover that the strategy is
> actually better when you start doing stuff?
>
> At a super high level, it depends on whether you want the SIPs to be PRDs
> for getting some quick feedback on the goals of a feature before it is
> designed, or something more like full-fledged design docs (just a more
> visible design doc for bigger changes). I looked at Kafka's KIPs, and they
> actually seem to be more like design docs. This can work too but it does
> require more work from the proposer and it can lead to the same problems you
> mentioned with people already having a design and implementation in mind.
>
> Basically, the question is, are you trying to iterate faster on design by
> adding a step for user feedback earlier? Or are you just trying to make
> design docs for key features more visible (and their approval more formal)?
>
> BTW note that in either case, I'd like to have a template for design docs
> too, which should also include goals. I think that would've avoided some of
> the issues you brought up.
>
> Matei
>
> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <c...@koeninger.org> wrote:
>
> Here's my specific proposal (meta-proposal?)
>
> Spark Improvement Proposals (SIP)
>
>
> Background:
>
> The current problem is that design and implementation of large features are
> often done in private, before soliciting user feedback.
>
> When feedback is solicited, it is often as to detailed design specifics, not
> focused on goals.
>
> When implementation does take place aft

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
continues working if one node fails"). BTW I wouldn't say "rejected
>>> goals" because some of them might become goals later, so we're not
>>> definitively rejecting them.
>>>
>>> - Public API: Probably should be included in most SIPs unless it's too
>>> large to fully specify then (e.g. "let's add an ML library").
>>>
>>> - Use cases: I usually find this very useful in PRDs to better
>>> communicate the goals.
>>>
>>> - Internal architecture: This is usually *not* a thing users can easily
>>> comment on and it sounds more like a design doc item. Of course it's
>>> important to show that the SIP is feasible to implement. One exception,
>>> however, is that I think we'll have some SIPs primarily on internals (e.g.
>>> if somebody wants to refactor Spark's query optimizer or something).
>>>
>>> - Rejected strategies: I personally wouldn't put this, because what's
>>> the point of voting to reject a strategy before you've really begun
>>> designing and implementing something? What if you discover that the
>>> strategy is actually better when you start doing stuff?
>>>
>>> At a super high level, it depends on whether you want the SIPs to be
>>> PRDs for getting some quick feedback on the goals of a feature before it is
>>> designed, or something more like full-fledged design docs (just a more
>>> visible design doc for bigger changes). I looked at Kafka's KIPs, and they
>>> actually seem to be more like design docs. This can work too but it does
>>> require more work from the proposer and it can lead to the same problems
>>> you mentioned with people already having a design and implementation in
>>> mind.
>>>
>>> Basically, the question is, are you trying to iterate faster on design
>>> by adding a step for user feedback earlier? Or are you just trying to make
>>> design docs for key features more visible (and their approval more formal)?
>>>
>>> BTW note that in either case, I'd like to have a template for design
>>> docs too, which should also include goals. I think that would've avoided
>>> some of the issues you brought up.
>>>
>>> Matei
>>>
>>> On Oct 9, 2016, at 10:40 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>>
>>> Here's my specific proposal (meta-proposal?)
>>>
>>> Spark Improvement Proposals (SIP)
>>>
>>>
>>> Background:
>>>
>>> The current problem is that design and implementation of large features
>>> are often done in private, before soliciting user feedback.
>>>
>>> When feedback is solicited, it is often as to detailed design specifics,
>>> not focused on goals.
>>>
>>> When implementation does take place after design, there is often
>>> disagreement as to what goals are or are not in scope.
>>>
>>> This results in commits that don't fully meet user needs.
>>>
>>>
>>> Goals:
>>>
>>> - Ensure user, contributor, and committer goals are clearly identified
>>> and agreed upon, before implementation takes place.
>>>
>>> - Ensure that a technically feasible strategy is chosen that is likely
>>> to meet the goals.
>>>
>>>
>>> Rejected Goals:
>>>
>>> - SIPs are not for detailed design.  Design by committee doesn't work.
>>>
>>> - SIPs are not for every change.  We dont need that much process.
>>>
>>>
>>> Strategy:
>>>
>>> My suggestion is outlined as a Spark Improvement Proposal process
>>> documented at
>>>
>>> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
>>> improvement-proposals.md
>>>
>>> Specifics of Jira manipulation are an implementation detail we can
>>> figure out.
>>>
>>> I'm suggesting voting; the need here is for a _clear_ outcome.
>>>
>>>
>>> Rejected Strategies:
>>>
>>> Having someone who understands the problem implement it first works, but
>>> only if significant iteration after user feedback is allowed.
>>>
>>> Historically this has been problematic due to pressure to limit public
>>> api changes.
>>>
>>> On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> Alright looks like there are quite a bit of support. We should wait to
>>>> hear from more people too.
>>>>
>>>> To 

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
Here's my specific proposal (meta-proposal?)

Spark Improvement Proposals (SIP)


Background:

The current problem is that design and implementation of large features are
often done in private, before soliciting user feedback.

When feedback is solicited, it is often as to detailed design specifics,
not focused on goals.

When implementation does take place after design, there is often
disagreement as to what goals are or are not in scope.

This results in commits that don't fully meet user needs.


Goals:

- Ensure user, contributor, and committer goals are clearly identified and
agreed upon, before implementation takes place.

- Ensure that a technically feasible strategy is chosen that is likely to
meet the goals.


Rejected Goals:

- SIPs are not for detailed design.  Design by committee doesn't work.

- SIPs are not for every change.  We dont need that much process.


Strategy:

My suggestion is outlined as a Spark Improvement Proposal process
documented at

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

Specifics of Jira manipulation are an implementation detail we can figure
out.

I'm suggesting voting; the need here is for a _clear_ outcome.


Rejected Strategies:

Having someone who understands the problem implement it first works, but
only if significant iteration after user feedback is allowed.

Historically this has been problematic due to pressure to limit public api
changes.

On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin <r...@databricks.com> wrote:

> Alright looks like there are quite a bit of support. We should wait to
> hear from more people too.
>
> To push this forward, Cody and I will be working together in the next
> couple of weeks to come up with a concrete, detailed proposal on what this
> entails, and then we can discuss this the specific proposal as well.
>
>
> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> Yeah, in case it wasn't clear, I was talking about SIPs for major
>> user-facing or cross-cutting changes, not minor feature adds.
>>
>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>>
>>> +1 to the SIP label as long as it does not slow down things and it
>>> targets optimizing efforts, coordination etc. For example really small
>>> features should not need to go through this process (assuming they dont
>>> touch public interfaces)  or re-factorings and hope it will be kept this
>>> way. So as a guideline doc should be provided, like in the KIP case.
>>>
>>> IMHO so far aside from tagging things and linking them elsewhere simply
>>> having design docs and prototypes implementations in PRs is not something
>>> that has not worked so far. What is really a pain in many projects out
>>> there is discontinuity in progress of PRs, missing features, slow reviews
>>> which is understandable to some extent... it is not only about Spark but
>>> things can be improved for sure for this project in particular as already
>>> stated.
>>>
>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> +1 to adding an SIP label and linking it from the website.  I think it
>>>> needs
>>>>
>>>> - template that focuses it towards soliciting user goals / non goals
>>>> - clear resolution as to which strategy was chosen to pursue.  I'd
>>>> recommend a vote.
>>>>
>>>> Matei asked me to clarify what I meant by changing interfaces, I think
>>>> it's directly relevant to the SIP idea so I'll clarify here, and split
>>>> a thread for the other discussion per Nicholas' request.
>>>>
>>>> I meant changing public user interfaces.  I think the first design is
>>>> unlikely to be right, because it's done at a time when you have the
>>>> least information.  As a user, I find it considerably more frustrating
>>>> to be unable to use a tool to get my job done, than I do having to
>>>> make minor changes to my code in order to take advantage of features.
>>>> I've seen committers be seriously reluctant to allow changes to
>>>> @experimental code that are needed in order for it to really work
>>>> right.  You need to be able to iterate, and if people on both sides of
>>>> the fence aren't going to respect that some newer apis are subject to
>>>> change, then why even mark them as such?
>>>>
>>>> Ideally a finished SIP should give me a checklist of things that an
>>>> implementation must do, and things that it doesn't need to do.
>>

Re: PSA: JIRA resolutions and meanings

2016-10-09 Thread Cody Koeninger
That's awesome Sean, very clear.

One minor thing, noncommiters can't change assigned field as far as I know.

On Oct 9, 2016 3:40 AM, "Sean Owen"  wrote:

I added a variant on this text to https://cwiki.apache.org/
confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-
ContributingtoJIRAMaintenance


On Sat, Oct 8, 2016 at 10:09 AM Sean Owen  wrote:

> That flood of emails means several people (Xiao, Holden mostly AFAICT)
> have been updating the status of old JIRAs. Thank you, I think that really
> does help.
>
> I have a suggested set of conventions I've been using, just to bring some
> order to the resolutions. It helps because JIRA functions as a huge archive
> of decisions and the more accurately we can record that the better. What do
> people think of this?
>
> - Resolve as Fixed if there's a change you can point to that resolved the
> issue
> - If the issue is a proper subset of another issue, mark it a Duplicate of
> that issue (rather than the other way around)
> - If it's probably resolved, but not obvious what fixed it or when, then
> Cannot Reproduce or Not a Problem
> - Obsolete issue? Not a Problem
> - If it's a coherent issue but does not seem like there is support or
> interest in acting on it, then Won't Fix
> - If the issue doesn't make sense (non-Spark issue, etc) then Invalid
> - I tend to mark Umbrellas as "Done" when done if they're just containers
> - Try to set Fix version
> - Try to set Assignee to the person who most contributed to the
> resolution. Usually the person who opened the PR. Strong preference for
> ties going to the more 'junior' contributor
>
> The only ones I think are sort of important are getting the Duplicate
> pointers right, and possibly making sure that Fixed issues have a clear
> path to finding what change fixed it and when. The rest doesn't matter much.
>
>


Re: Improving governance / committers (split from Spark Improvement Proposals thread)

2016-10-08 Thread Cody Koeninger
It's not about technical design disagreement as to matters of taste,
it's about familiarity with the domain.  To make an analogy, it's as
if a committer in MLlib was firmly intent on, I dunno, treating a
collection of categorical variables as if it were an ordered range of
continuous variables.  It's just wrong.  That kind of thing, to a
greater or lesser degree, has been going on related to the Kafka
modules, for years.

On Sat, Oct 8, 2016 at 4:11 PM, Matei Zaharia  wrote:
> This makes a lot of sense; just to comment on a few things:
>
>> - More committers
>> Just looking at the ratio of committers to open tickets, or committers
>> to contributors, I don't think you have enough human power.
>> I realize this is a touchy issue.  I don't have dog in this fight,
>> because I'm not on either coast nor in a big company that views
>> committership as a political thing.  I just think you need more people
>> to do the work, and more diversity of viewpoint.
>> It's unfortunate that the Apache governance process involves giving
>> someone all the keys or none of the keys, but until someone really
>> starts screwing up, I think it's better to err on the side of
>> accepting hard-working people.
>
> This is something the PMC is actively discussing. Historically, we've added 
> committers when people contributed a new module or feature, basically to the 
> point where other developers are asking them to review changes in that area 
> (https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-BecomingaCommitter).
>  For example, we added the original authors of GraphX when we merged in 
> GraphX, the authors of new ML algorithms, etc. However, there's a good 
> argument that some areas are simply not covered well now and we should add 
> people there. Also, as the project has grown, there are also more people who 
> focus on smaller fixes and are nonetheless contributing a lot.
>
>> - Each major area of the code needs at least one person who cares
>> about it that is empowered with a vote, otherwise decisions get made
>> that don't make technical sense.
>> I don't know if anyone with a vote is shepherding GraphX (or maybe
>> it's just dead), the Mesos relationship has always been weird, no one
>> with a vote really groks Kafka.
>> marmbrus and zsxwing are getting there quickly on the Kafka side, and
>> I appreciate it, but it's been bad for a while.
>> Because I don't have any political power, my response to seeing things
>> that I know are technically dangerous has been to yell really loud
>> until someone listens, which sucks for everyone involved.
>> I already apologized to Michael privately; Ryan, I'm sorry, it's not about 
>> you.
>> This seems pretty straightforward to fix, if politically awkward:
>> those people exist, just give them a vote.
>> Failing that, listen the first or second time they say something not
>> the third or fourth, and if it doesn't make sense, ask.
>
> Just as a note here -- it's true that some areas are not super well covered, 
> but I also hope to avoid a situation where people have to yell to be listened 
> to. I can't say anything about *all* technical discussions we've ever had, 
> but historically, people have been able to comment on the design of many 
> things without yelling. This is actually important because a culture of 
> having to yell can drive away contributors. So it's awesome that you yelled 
> about the Kafka source stuff, but at the same time, hopefully we make these 
> types of things work without yelling. This would be a problem even if there 
> were committers with more expertise in each area -- what if someone disagrees 
> with the committers?
>
> Matei
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Cody Koeninger
Cool, I'll start going through stuff as I have time.  Already closed
one, if anyone sees a problem let me know.

Still think it would be nice to have some way to make it obvious to
the people who have the will and knowledge to do it that it's ok for
them to do it :)

On Sat, Oct 8, 2016 at 2:19 PM, Reynold Xin <r...@databricks.com> wrote:
> I think so (at least I think it is socially acceptable). Of course, use good
> judgement here :)
>
>
>
> On Sat, Oct 8, 2016 at 12:06 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> So to be clear, can I go clean up the Kafka cruft?
>>
>> On Sat, Oct 8, 2016 at 1:41 PM, Reynold Xin <r...@databricks.com> wrote:
>> >
>> > On Sat, Oct 8, 2016 at 2:09 AM, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >>
>> >> - Resolve as Fixed if there's a change you can point to that resolved
>> >> the
>> >> issue
>> >> - If the issue is a proper subset of another issue, mark it a Duplicate
>> >> of
>> >> that issue (rather than the other way around)
>> >> - If it's probably resolved, but not obvious what fixed it or when,
>> >> then
>> >> Cannot Reproduce or Not a Problem
>> >> - Obsolete issue? Not a Problem
>> >> - If it's a coherent issue but does not seem like there is support or
>> >> interest in acting on it, then Won't Fix
>> >> - If the issue doesn't make sense (non-Spark issue, etc) then Invalid
>> >> - I tend to mark Umbrellas as "Done" when done if they're just
>> >> containers
>> >> - Try to set Fix version
>> >> - Try to set Assignee to the person who most contributed to the
>> >> resolution. Usually the person who opened the PR. Strong preference for
>> >> ties
>> >> going to the more 'junior' contributor
>> >
>> >
>> > +1
>> >
>> > This is consistent with my understanding. It would be good to document
>> > these
>> > on JIRA. And I second "The only ones I think are sort of important are
>> > getting the Duplicate pointers right, and possibly making sure that
>> > Fixed
>> > issues have a clear path to finding what change fixed it and when. The
>> > rest
>> > doesn't matter much."
>> >
>> > I also think it is a good idea to give people rights to close tickets to
>> > help with JIRA maintenance. We can always revoke that if we see a
>> > malicious
>> > actor (or somebody with extremely bad judgement), but we are pretty far
>> > away
>> > from that right now.
>> >
>> >
>> >
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Cody Koeninger
So to be clear, can I go clean up the Kafka cruft?

On Sat, Oct 8, 2016 at 1:41 PM, Reynold Xin  wrote:
>
> On Sat, Oct 8, 2016 at 2:09 AM, Sean Owen  wrote:
>>
>>
>> - Resolve as Fixed if there's a change you can point to that resolved the
>> issue
>> - If the issue is a proper subset of another issue, mark it a Duplicate of
>> that issue (rather than the other way around)
>> - If it's probably resolved, but not obvious what fixed it or when, then
>> Cannot Reproduce or Not a Problem
>> - Obsolete issue? Not a Problem
>> - If it's a coherent issue but does not seem like there is support or
>> interest in acting on it, then Won't Fix
>> - If the issue doesn't make sense (non-Spark issue, etc) then Invalid
>> - I tend to mark Umbrellas as "Done" when done if they're just containers
>> - Try to set Fix version
>> - Try to set Assignee to the person who most contributed to the
>> resolution. Usually the person who opened the PR. Strong preference for ties
>> going to the more 'junior' contributor
>
>
> +1
>
> This is consistent with my understanding. It would be good to document these
> on JIRA. And I second "The only ones I think are sort of important are
> getting the Duplicate pointers right, and possibly making sure that Fixed
> issues have a clear path to finding what change fixed it and when. The rest
> doesn't matter much."
>
> I also think it is a good idea to give people rights to close tickets to
> help with JIRA maintenance. We can always revoke that if we see a malicious
> actor (or somebody with extremely bad judgement), but we are pretty far away
> from that right now.
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-08 Thread Cody Koeninger
Yeah, I've interacted with other projects that used that system and it was
pleasant.

1. "this is getting closed cause its stale, let us know if thats a problem"
2. "actually that matters to us"
3. "ok well leave it open"

I'd be fine with totally automating step 1 as long as a human was involved
at step 2 and 3


On Saturday, October 8, 2016, assaf.mendelson <assaf.mendel...@rsa.com>
wrote:

> I don’t really have much experience with large open source projects but I
> have some experience with having lots of issues with no one handling them.
> Automation proved a good solution in my experience, but one thing that I
> found which was really important is giving people a chance to say “don’t
> close this please”.
>
> Basically, because closing you can send an email to the reporter (and
> probably people who are watching the issue) and tell them this is going to
> be closed. Allow them an option to ping back saying “don’t close this
> please” which would ping committers for input (as if there were 5+ votes as
> described by Nick).
>
> The main reason for this is that many times people fine solutions and the
> issue does become stale but at other times, the issue is still important,
> it is just that no one noticed it because of the noise of other issues.
>
> Thanks,
>
> Assaf.
>
>
>
>
>
>
>
> *From:* Nicholas Chammas [via Apache Spark Developers List] [mailto:
> ml-node+ <javascript:_e(%7B%7D,'cvml','ml-node%2B');>[hidden email]
> <http:///user/SendEmail.jtp?type=node=19322=0>]
> *Sent:* Saturday, October 08, 2016 12:42 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Improving volunteer management / JIRAs (split from Spark
> Improvement Proposals thread)
>
>
>
> I agree with Cody and others that we need some automation — or at least an
> adjusted process — to help us manage organic contributions better.
>
> The objections about automated closing being potentially abrasive are
> understood, but I wouldn’t accept that as a defeat for automation. Instead,
> it seems like a constraint we should impose on any proposed solution: Make
> sure it doesn’t turn contributors off. Rolling as we have been won’t cut
> it, and I don’t think adding committers will ever be a sufficient solution
> to this particular problem.
>
> To me, it seems like we need a way to filter out viable contributions with
> community support from other contributions when it comes to deciding that
> automated action is appropriate. Our current tooling isn’t perfect, but
> perhaps we can leverage it to create such a filter.
>
> For example, consider the following strawman proposal for how to cut down
> on the number of pending but unviable proposals, and simultaneously help
> contributors organize to promote viable proposals and get the attention of
> committers:
>
> 1.  Have a bot scan for *stale* JIRA issues and PRs—i.e. they haven’t
> been updated in 20+ days (or D+ days, if you prefer).
>
> 2.  Depending on the level of community support, either close the
> item or ping specific people for action. Specifically:
> a. If the JIRA/PR has no input from a committer and the JIRA/PR has 5+
> votes (or V+ votes), ping committers for input. (For PRs, you could count
> comments from different people, or thumbs up on the initial PR post.)
> b. If the JIRA/PR has no input from a committer and the JIRA/PR has less
> than V votes, close it with a gentle message asking the contributor to
> solicit support from either the community or a committer, and try again
> later.
> c. If the JIRA/PR has input from a committer or committers, ping them for
> an update.
>
> This is just a rough idea. The point is that when contributors have stale
> proposals that they don’t close, committers need to take action. A little
> automation to selectively bring contributions to the attention of
> committers can perhaps help them manage the backlog of stale contributions.
> The “selective” part is implemented in this strawman proposal by using JIRA
> votes as a crude proxy for when the community is interested in something,
> but it could be anything.
>
> Also, this doesn’t have to be used just to clear out stale proposals. Once
> the initial backlog is trimmed down, you could set D to 5 days and use
> this as a regular way to bring contributions to the attention of committers.
>
> I dunno if people think this is perhaps too complex, but at our scale I
> feel we need some kind of loose but automated system for funneling
> contributions through some kind of lifecycle. The status quo is just not
> that good (e.g. 474 open PRs <https://github.com/apache/spark/pulls>
> against Spark as of this moment).
>
> Nick
>
> ​
>
>
>
> O

Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Cody Koeninger
That makes sense, thanks.

One thing I've never been clear on is who should be allowed to resolve
Jiras.  Can I go clean up the backlog of Kafka Jiras that weren't created
by me?

If there's an informal policy here, can we update the wiki to reflect it?
Maybe it's there already, but I didn't see it last time I looked.

On Oct 8, 2016 4:10 AM, "Sean Owen"  wrote:

That flood of emails means several people (Xiao, Holden mostly AFAICT) have
been updating the status of old JIRAs. Thank you, I think that really does
help.

I have a suggested set of conventions I've been using, just to bring some
order to the resolutions. It helps because JIRA functions as a huge archive
of decisions and the more accurately we can record that the better. What do
people think of this?

- Resolve as Fixed if there's a change you can point to that resolved the
issue
- If the issue is a proper subset of another issue, mark it a Duplicate of
that issue (rather than the other way around)
- If it's probably resolved, but not obvious what fixed it or when, then
Cannot Reproduce or Not a Problem
- Obsolete issue? Not a Problem
- If it's a coherent issue but does not seem like there is support or
interest in acting on it, then Won't Fix
- If the issue doesn't make sense (non-Spark issue, etc) then Invalid
- I tend to mark Umbrellas as "Done" when done if they're just containers
- Try to set Fix version
- Try to set Assignee to the person who most contributed to the resolution.
Usually the person who opened the PR. Strong preference for ties going to
the more 'junior' contributor

The only ones I think are sort of important are getting the Duplicate
pointers right, and possibly making sure that Fixed issues have a clear
path to finding what change fixed it and when. The rest doesn't matter much.


Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Cody Koeninger
I really like the idea of using jira votes (and/or watchers?) as a filter!

On Fri, Oct 7, 2016 at 4:41 PM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> I agree with Cody and others that we need some automation — or at least an
> adjusted process — to help us manage organic contributions better.
>
> The objections about automated closing being potentially abrasive are
> understood, but I wouldn’t accept that as a defeat for automation. Instead,
> it seems like a constraint we should impose on any proposed solution: Make
> sure it doesn’t turn contributors off. Rolling as we have been won’t cut it,
> and I don’t think adding committers will ever be a sufficient solution to
> this particular problem.
>
> To me, it seems like we need a way to filter out viable contributions with
> community support from other contributions when it comes to deciding that
> automated action is appropriate. Our current tooling isn’t perfect, but
> perhaps we can leverage it to create such a filter.
>
> For example, consider the following strawman proposal for how to cut down on
> the number of pending but unviable proposals, and simultaneously help
> contributors organize to promote viable proposals and get the attention of
> committers:
>
> Have a bot scan for stale JIRA issues and PRs—i.e. they haven’t been updated
> in 20+ days (or D+ days, if you prefer).
> Depending on the level of community support, either close the item or ping
> specific people for action. Specifically:
> a. If the JIRA/PR has no input from a committer and the JIRA/PR has 5+ votes
> (or V+ votes), ping committers for input. (For PRs, you could count comments
> from different people, or thumbs up on the initial PR post.)
> b. If the JIRA/PR has no input from a committer and the JIRA/PR has less
> than V votes, close it with a gentle message asking the contributor to
> solicit support from either the community or a committer, and try again
> later.
> c. If the JIRA/PR has input from a committer or committers, ping them for an
> update.
>
> This is just a rough idea. The point is that when contributors have stale
> proposals that they don’t close, committers need to take action. A little
> automation to selectively bring contributions to the attention of committers
> can perhaps help them manage the backlog of stale contributions. The
> “selective” part is implemented in this strawman proposal by using JIRA
> votes as a crude proxy for when the community is interested in something,
> but it could be anything.
>
> Also, this doesn’t have to be used just to clear out stale proposals. Once
> the initial backlog is trimmed down, you could set D to 5 days and use this
> as a regular way to bring contributions to the attention of committers.
>
> I dunno if people think this is perhaps too complex, but at our scale I feel
> we need some kind of loose but automated system for funneling contributions
> through some kind of lifecycle. The status quo is just not that good (e.g.
> 474 open PRs against Spark as of this moment).
>
> Nick
>
>
> On Fri, Oct 7, 2016 at 4:48 PM Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Matei asked:
>>
>>
>> > I agree about empowering people interested here to contribute, but I'm
>> > wondering, do you think there are technical things that people don't want 
>> > to
>> > work on, or is it a matter of what there's been time to do?
>>
>>
>> It's a matter of mismanagement and miscommunication.
>>
>> The structured streaming kafka jira sat with multiple unanswered
>> requests for someone who was a committer to communicate whether they
>> were working on it and what the plan was.  I could have done that
>> implementation and had it in users' hands months ago.  I didn't
>> pre-emptively do it because I didn't want to then have to argue with
>> committers about why my code did or did not meet their uncommunicated
>> expectations.
>>
>>
>> I don't want to re-hash that particular circumstance, I just want to
>> make sure it never happens again.
>>
>>
>> Hopefully the SIP thread results in clearer expectations, but there
>> are still some ideas on the table regarding management of volunteer
>> contributions:
>>
>>
>> - Closing stale jiras.  I hear the bots are impersonal argument, but
>> the alternative of "someone cleans it up" is not sufficient right now
>> (with apologies to Sean and all the other janitors).
>>
>> - Clear rejection of jiras.  This isn't mean, it's respectful.
>>
>> - Clear "I'm working on this", with clear removal and reassignment if
>> they go radio silent.  This could be keyed to auto

Re: Spark Improvement Proposals

2016-10-07 Thread Cody Koeninger
Yeah, in case it wasn't clear, I was talking about SIPs for major
user-facing or cross-cutting changes, not minor feature adds.

On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> +1 to the SIP label as long as it does not slow down things and it targets
> optimizing efforts, coordination etc. For example really small features
> should not need to go through this process (assuming they dont touch public
> interfaces)  or re-factorings and hope it will be kept this way. So as a
> guideline doc should be provided, like in the KIP case.
>
> IMHO so far aside from tagging things and linking them elsewhere simply
> having design docs and prototypes implementations in PRs is not something
> that has not worked so far. What is really a pain in many projects out
> there is discontinuity in progress of PRs, missing features, slow reviews
> which is understandable to some extent... it is not only about Spark but
> things can be improved for sure for this project in particular as already
> stated.
>
> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> +1 to adding an SIP label and linking it from the website.  I think it
>> needs
>>
>> - template that focuses it towards soliciting user goals / non goals
>> - clear resolution as to which strategy was chosen to pursue.  I'd
>> recommend a vote.
>>
>> Matei asked me to clarify what I meant by changing interfaces, I think
>> it's directly relevant to the SIP idea so I'll clarify here, and split
>> a thread for the other discussion per Nicholas' request.
>>
>> I meant changing public user interfaces.  I think the first design is
>> unlikely to be right, because it's done at a time when you have the
>> least information.  As a user, I find it considerably more frustrating
>> to be unable to use a tool to get my job done, than I do having to
>> make minor changes to my code in order to take advantage of features.
>> I've seen committers be seriously reluctant to allow changes to
>> @experimental code that are needed in order for it to really work
>> right.  You need to be able to iterate, and if people on both sides of
>> the fence aren't going to respect that some newer apis are subject to
>> change, then why even mark them as such?
>>
>> Ideally a finished SIP should give me a checklist of things that an
>> implementation must do, and things that it doesn't need to do.
>> Contributors/committers should be seriously discouraged from putting
>> out a version 0.1 that doesn't have at least a prototype
>> implementation of all those things, especially if they're then going
>> to argue against interface changes necessary to get the the rest of
>> the things done in the 0.2 version.
>>
>>
>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin <r...@databricks.com> wrote:
>> > I like the lightweight proposal to add a SIP label.
>> >
>> > During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
>> > track the list of major changes, but that never really materialized due
>> to
>> > the overhead. Adding a SIP label on major JIRAs and then link to them
>> > prominently on the Spark website makes a lot of sense.
>> >
>> >
>> > On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <matei.zaha...@gmail.com
>> >
>> > wrote:
>> >>
>> >> For the improvement proposals, I think one major point was to make them
>> >> really visible to users who are not contributors, so we should do more
>> than
>> >> sending stuff to dev@. One very lightweight idea is to have a new
>> type of
>> >> JIRA called a SIP and have a link to a filter that shows all such
>> JIRAs from
>> >> http://spark.apache.org. I also like the idea of SIP and design doc
>> >> templates (in fact many projects have them).
>> >>
>> >> Matei
>> >>
>> >> On Oct 7, 2016, at 10:38 AM, Reynold Xin <r...@databricks.com> wrote:
>> >>
>> >> I called Cody last night and talked about some of the topics in his
>> email.
>> >> It became clear to me Cody genuinely cares about the project.
>> >>
>> >> Some of the frustrations come from the success of the project itself
>> >> becoming very "hot", and it is difficult to get clarity from people who
>> >> don't dedicate all their time to Spark. In fact, it is in some ways
>> similar
>> >> to scaling an engineering team in a successful startup: old processes
>> that
>> >&g

Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Cody Koeninger
Matei asked:


> I agree about empowering people interested here to contribute, but I'm 
> wondering, do you think there are technical things that people don't want to 
> work on, or is it a matter of what there's been time to do?


It's a matter of mismanagement and miscommunication.

The structured streaming kafka jira sat with multiple unanswered
requests for someone who was a committer to communicate whether they
were working on it and what the plan was.  I could have done that
implementation and had it in users' hands months ago.  I didn't
pre-emptively do it because I didn't want to then have to argue with
committers about why my code did or did not meet their uncommunicated
expectations.


I don't want to re-hash that particular circumstance, I just want to
make sure it never happens again.


Hopefully the SIP thread results in clearer expectations, but there
are still some ideas on the table regarding management of volunteer
contributions:


- Closing stale jiras.  I hear the bots are impersonal argument, but
the alternative of "someone cleans it up" is not sufficient right now
(with apologies to Sean and all the other janitors).

- Clear rejection of jiras.  This isn't mean, it's respectful.

- Clear "I'm working on this", with clear removal and reassignment if
they go radio silent.  This could be keyed to automated check for
staleness.

- Clear expectation that if someone is working on a jira, you can work
on your own alternative, but you need to communicate.


I'm sure I've missed some.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Cody Koeninger
Without a hell of a lot more work, Assign would be the only strategy usable.

On Fri, Oct 7, 2016 at 3:25 PM, Michael Armbrust  wrote:
>> The implementation is totally and completely different however, in ways
>> that leak to the end user.
>
>
> Can you elaborate? Especially in the context of the interface provided by
> structured streaming.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Cody Koeninger
0.10 consumers won't work on an earlier broker.

Earlier consumers will (should?) work on a 0.10 broker.

The main things earlier consumers lack from a user perspective is
support for SSL, and pre-fetching messages.  The implementation is
totally and completely different however, in ways that leak to the end
user.

On Fri, Oct 7, 2016 at 3:15 PM, Reynold Xin  wrote:
> Does Kafka 0.10 work on a Kafka 0.8/0.9 cluster?
>
>
> On Fri, Oct 7, 2016 at 1:14 PM, Jeremy Smith 
> wrote:
>>
>> +1
>>
>> We're on CDH, and it will probably be a while before they support Kafka
>> 0.10. At the same time, we don't use their Spark and we're looking forward
>> to upgrading to 2.0.x and using structured streaming.
>>
>> I was just going to write our own Kafka Source implementation which uses
>> the existing KafkaRDD but it would be much easier to get buy-in for an
>> official Spark module.
>>
>> Jeremy
>>
>> On Fri, Oct 7, 2016 at 12:41 PM, Michael Armbrust 
>> wrote:
>>>
>>> We recently merged support for Kafak 0.10.0 in Structured Streaming, but
>>> I've been hearing a few people tell me that they are stuck on an older
>>> version of Kafka and cannot upgrade.  I'm considering revisiting
>>> SPARK-17344, but it would be good to have more information.
>>>
>>> Could people please vote or comment on the above ticket if a lack of
>>> support for older versions of kafka would block you from trying out
>>> structured streaming?
>>>
>>> Thanks!
>>>
>>> Michael
>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2016-10-07 Thread Cody Koeninger
+1 to adding an SIP label and linking it from the website.  I think it needs

- template that focuses it towards soliciting user goals / non goals
- clear resolution as to which strategy was chosen to pursue.  I'd
recommend a vote.

Matei asked me to clarify what I meant by changing interfaces, I think
it's directly relevant to the SIP idea so I'll clarify here, and split
a thread for the other discussion per Nicholas' request.

I meant changing public user interfaces.  I think the first design is
unlikely to be right, because it's done at a time when you have the
least information.  As a user, I find it considerably more frustrating
to be unable to use a tool to get my job done, than I do having to
make minor changes to my code in order to take advantage of features.
I've seen committers be seriously reluctant to allow changes to
@experimental code that are needed in order for it to really work
right.  You need to be able to iterate, and if people on both sides of
the fence aren't going to respect that some newer apis are subject to
change, then why even mark them as such?

Ideally a finished SIP should give me a checklist of things that an
implementation must do, and things that it doesn't need to do.
Contributors/committers should be seriously discouraged from putting
out a version 0.1 that doesn't have at least a prototype
implementation of all those things, especially if they're then going
to argue against interface changes necessary to get the the rest of
the things done in the 0.2 version.


On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin  wrote:
> I like the lightweight proposal to add a SIP label.
>
> During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
> track the list of major changes, but that never really materialized due to
> the overhead. Adding a SIP label on major JIRAs and then link to them
> prominently on the Spark website makes a lot of sense.
>
>
> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia 
> wrote:
>>
>> For the improvement proposals, I think one major point was to make them
>> really visible to users who are not contributors, so we should do more than
>> sending stuff to dev@. One very lightweight idea is to have a new type of
>> JIRA called a SIP and have a link to a filter that shows all such JIRAs from
>> http://spark.apache.org. I also like the idea of SIP and design doc
>> templates (in fact many projects have them).
>>
>> Matei
>>
>> On Oct 7, 2016, at 10:38 AM, Reynold Xin  wrote:
>>
>> I called Cody last night and talked about some of the topics in his email.
>> It became clear to me Cody genuinely cares about the project.
>>
>> Some of the frustrations come from the success of the project itself
>> becoming very "hot", and it is difficult to get clarity from people who
>> don't dedicate all their time to Spark. In fact, it is in some ways similar
>> to scaling an engineering team in a successful startup: old processes that
>> worked well might not work so well when it gets to a certain size, cultures
>> can get diluted, building culture vs building process, etc.
>>
>> I also really like to have a more visible process for larger changes,
>> especially major user facing API changes. Historically we upload design docs
>> for major changes, but it is not always consistent and difficult to quality
>> of the docs, due to the volunteering nature of the organization.
>>
>> Some of the more concrete ideas we discussed focus on building a culture
>> to improve clarity:
>>
>> - Process: Large changes should have design docs posted on JIRA. One thing
>> Cody and I didn't discuss but an idea that just came to me is we should
>> create a design doc template for the project and ask everybody to follow.
>> The design doc template should also explicitly list goals and non-goals, to
>> make design doc more consistent.
>>
>> - Process: Email dev@ to solicit feedback. We have some this with some
>> changes, but again very inconsistent. Just posting something on JIRA isn't
>> sufficient, because there are simply too many JIRAs and the signal get lost
>> in the noise. While this is generally impossible to enforce because we can't
>> force all volunteers to conform to a process (or they might not even be
>> aware of this),  those who are more familiar with the project can help by
>> emailing the dev@ when they see something that hasn't been.
>>
>> - Culture: The design doc author(s) should be open to feedback. A design
>> doc should serve as the base for discussion and is by no means the final
>> design. Of course, this does not mean the author has to accept every
>> feedback. They should also be comfortable accepting / rejecting ideas on
>> technical grounds.
>>
>> - Process / Culture: For major ongoing projects, it can be useful to have
>> some monthly Google hangouts that are open to the world. I am actually not
>> sure how well this will work, because of the volunteering nature and we need
>> to adjust 

Re: Spark Improvement Proposals

2016-10-07 Thread Cody Koeninger
Sean, that was very eloquently put, and I 100% agree.  If I ever meet
you in person, I'll buy you multiple rounds of beverages of your
choice ;)
This is probably reiterating some of what you said in a less clear
manner, but I'll throw more of my 2 cents in.

- Design.
Yes, design by committee doesn't work.  The best designs are when a
person who understands the problem builds something that works for
them, shares with others, and most importantly iterates when it
doesn't work for others.  This iteration only works if you're willing
to change interfaces, but committer and user goals are not aligned
here.  Users want something that is clearly documented and helps them
get their job done.  Committers (not all) want to minimize interface
change, even at the expense of users being able to do their jobs.  In
this situation, it is critical that you understand early what users
need to be able to do.  This is what the improvement proposal process
should focus on: Goals, non-goals, possible solutions, rejected
solutions.  Not class-level design.  Most importantly, it needs a
clear, unambiguous outcome that is visible to the public.

- Trolling
It's not just trolling.  Event time and kafka are technically
important and should not be ignored.  I've been banging this drum for
years.  These concerns haven't been fully heard and understood by
committers.  This one example of why diversity of enfranchised users
is important and governance concerns shouldn't be ignored.

- Jira
Concretely, automate closing stale jiras after X amount of time.  It's
really surprising to me how much reluctance a community of programmers
have shown towards automating their own processes around stuff like
this (not to mention automatic code formatting of modified files).  I
understand the arguments against. but the current alternative doesn't
work.
Concretely, clearly reject and close jiras.  I have a backlog of 50+
kafka jiras, many of which are irrelevant at this point, but I do not
feel that I have the political power to close them.
Concretely, make it clear who is working on something.  This can be as
simple as just "I'm working on this", assign it to me, if I don't
follow up in X amount of time, close it or reassign.  That doesn't
mean there can't be competing work, but it does mean those people
should talk to each other.  Conversely, if committers currently don't
have time to work on something that is important, make that clear in
the ticket.


On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen  wrote:
> Suggestion actions way at the bottom.
>
> On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia 
> wrote:
>>
>> since March. But it's true that other things such as the Kafka source for
>> it didn't have as much design on JIRA. Nonetheless, this component is still
>> early on and there's still a lot of time to change it, which is happening.
>
>
> It's hard to drive design discussions in OSS. Even when diligently
> publishing design docs, the doc happens after brainstorming, and that
> happens inside someone's head or in chats.
>
> The lazy consensus model that works for small changes doesn't work well
> here. If a committer wants a change, that change will basically be made
> modulo small edits; vetoes are for dire disagreement. (Otherwise we'd get
> nothing done.) However this model means it's hard to significantly change a
> design after draft 1.
>
> I've heard this complaint a few times, and it has never been down to bad
> faith. We should err further towards over-including early and often. I've
> seen some great discussions start more with a problem statement and an RFC,
> not a design doc. Keeping regular contributors enfranchised is essential, so
> that they're willing and able to participate when design time comes. (See
> below.)
>
>
>>
>> 2) About what people say at Reactive Summit -- there will always be
>> trolls, but just ignore them and build a great project. Those of us involved
>> in the project for a while have long seen similar stuff, e.g. a
>
>
> The hype cycle may be turning against Spark, as is normal for this stage of
> maturity. People idealize technologies they don't really use as greener
> grass; it's the things they use and need to work that they love to hate.
>
> I would not dismiss this as just trolling. Customer anecdotes I see suggest
> that Spark underperforms their (inflated) expectations, and generally does
> not Just Work. It takes expertise, tuning, patience, workarounds. And then
> it gets great things done. I do see a gap between how the group here talks
> about the technology, and how the users I see talk about it. The gap
> manifests in attention given to making yet more things, and attention given
> to fixing and project mechanics.
>
> I would also not dismiss criticism of governance. We can recognize some big
> problems that were resolved over even the past 3 months. Usually I hear,
> well, we do better than most projects, right? and that is true. But, Spark
> is bigger 

Spark Improvement Proposals

2016-10-06 Thread Cody Koeninger
I love Spark.  3 or 4 years ago it was the first distributed computing
environment that felt usable, and the community was welcoming.

But I just got back from the Reactive Summit, and this is what I observed:

- Industry leaders on stage making fun of Spark's streaming model
- Open source project leaders saying they looked at Spark's governance
as a model to avoid
- Users saying they chose Flink because it was technically superior
and they couldn't get any answers on the Spark mailing lists

Whether you agree with the substance of any of this, when this stuff
gets repeated enough people will believe it.

Right now Spark is suffering from its own success, and I think
something needs to change.

- We need a clear process for planning significant changes to the codebase.
I'm not saying you need to adopt Kafka Improvement Proposals exactly,
but you need a documented process with a clear outcome (e.g. a vote).
Passing around google docs after an implementation has largely been
decided on doesn't cut it.

- All technical communication needs to be public.
Things getting decided in private chat, or when 1/3 of the committers
work for the same company and can just talk to each other...
Yes, it's convenient, but it's ultimately detrimental to the health of
the project.
The way structured streaming has played out has shown that there are
significant technical blind spots (myself included).
One way to address that is to get the people who have domain knowledge
involved, and listen to them.

- We need more committers, and more committer diversity.
Per committer there are, what, more than 20 contributors and 10 new
jira tickets a month?  It's too much.
There are people (I am _not_ referring to myself) who have been around
for years, contributed thousands of lines of code, helped educate the
public around Spark... and yet are never going to be voted in.

- We need a clear process for managing volunteer work.
Too many tickets sit around unowned, unclosed, uncertain.
If someone proposed something and it isn't up to snuff, tell them and
close it.  It may be blunt, but it's clearer than "silent no".
If someone wants to work on something, let them own the ticket and set
a deadline. If they don't meet it, close it or reassign it.

This is not me putting on an Apache Bureaucracy hat.  This is me
saying, as a fellow hacker and loyal dissenter, something is wrong
with the culture and process.

Please, let's change it.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark SQL JSON Column Support

2016-09-29 Thread Cody Koeninger
Totally agree that specifying the schema manually should be the
baseline.  LGTM, thanks for working on it.  Seems like it looks good
to others too judging by the comment on the PR that it's getting
merged to master :)

On Thu, Sep 29, 2016 at 2:13 PM, Michael Armbrust
 wrote:
>> Will this be able to handle projection pushdown if a given job doesn't
>> utilize all the columns in the schema?  Or should people have a
>>
>> per-job schema?
>
>
> As currently written, we will do a little bit of extra work to pull out
> fields that aren't needed.  I think it would be pretty straight forward to
> add a rule to the optimizer that prunes the schema passed to the
> JsonToStruct expression when there is another Project operator present.
>
>> I’m not a spark guru, but I would have hoped that DataSets and DataFrames
>> were more dynamic.
>
>
> We are dynamic in that all of these decisions can be made at runtime, and
> you can even look at the data when making them.  We do however need to know
> the schema before any single query begins executing so that we can give good
> analysis error messages and so that we can generate efficient byte code in
> our code generation.
>
>>
>> You should be doing schema inference. JSON includes the schema with each
>> record and you should take advantage of it. I guess the only issue is that
>> DataSets / DataFrames have static schemas and structures. Then if your first
>> record doesn’t include all of the columns you will have a problem.
>
>
> I agree that for ad-hoc use cases we should make it easy to infer the
> schema.  I would also argue that for a production pipeline you need the
> ability to specify it manually to avoid surprises.
>
> There are several tricky cases here.  You bring up the fact that the first
> record might be missing fields, but in many data sets there are fields that
> are only present in 1 out of 100,000s records.  Even if all fields are
> present, sometimes it can be very expensive to get even the first record
> (say you are reading from an expensive query coming from the JDBC data
> source).
>
> Another issue, is that inference means you need to read some data before the
> user explicitly starts the query.  Historically, cases where we do this have
> been pretty confusing to users of Spark (think: the surprise job that finds
> partition boundaries for RDD.sort).
>
> So, I think we should add inference, but that it should be in addition to
> the API proposed in this PR.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark SQL JSON Column Support

2016-09-29 Thread Cody Koeninger
Will this be able to handle projection pushdown if a given job doesn't
utilize all the columns in the schema?  Or should people have a
per-job schema?

On Wed, Sep 28, 2016 at 2:17 PM, Michael Armbrust
 wrote:
> Burak, you can configure what happens with corrupt records for the
> datasource using the parse mode.  The parse will still fail, so we can't get
> any data out of it, but we do leave the JSON in another column for you to
> inspect.
>
> In the case of this function, we'll just return null if its unparable.  You
> could filter for rows where the function returns null and inspect the input
> if you want to see whats going wrong.
>
>> When you talk about ‘user specified schema’ do you mean for the user to
>> supply an additional schema, or that you’re using the schema that’s
>> described by the JSON string?
>
>
> I mean we don't do schema inference (which we might consider adding, but
> that would be a much larger change than this PR).  You need to construct a
> StructType that says what columns you want to extract from the JSON column
> and pass that in.  I imagine in many cases the user will run schema
> inference ahead of time and then encode the inferred schema into their
> program.
>
>
> On Wed, Sep 28, 2016 at 11:04 AM, Burak Yavuz  wrote:
>>
>> I would really love something like this! It would be great if it doesn't
>> throw away corrupt_records like the Data Source.
>>
>> On Wed, Sep 28, 2016 at 11:02 AM, Nathan Lande 
>> wrote:
>>>
>>> We are currently pulling out the JSON columns, passing them through
>>> read.json, and then joining them back onto the initial DF so something like
>>> from_json would be a nice quality of life improvement for us.
>>>
>>> On Wed, Sep 28, 2016 at 10:52 AM, Michael Armbrust
>>>  wrote:

 Spark SQL has great support for reading text files that contain JSON
 data. However, in many cases the JSON data is just one column amongst
 others. This is particularly true when reading from sources such as Kafka.
 This PR adds a new functions from_json that converts a string column into a
 nested StructType with a user specified schema, using the same internal
 logic as the json Data Source.

 Would love to hear any comments / suggestions.

 Michael
>>>
>>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] Spark 2.x release cadence

2016-09-29 Thread Cody Koeninger
Regarding documentation debt, is there a reason not to deploy
documentation updates more frequently than releases?  I recall this
used to be the case.

On Wed, Sep 28, 2016 at 3:35 PM, Joseph Bradley  wrote:
> +1 for 4 months.  With QA taking about a month, that's very reasonable.
>
> My main ask (especially for MLlib) is for contributors and committers to
> take extra care not to delay on updating the Programming Guide for new APIs.
> Documentation debt often collects and has to be paid off during QA, and a
> longer cycle will exacerbate this problem.
>
> On Wed, Sep 28, 2016 at 7:30 AM, Tom Graves 
> wrote:
>>
>> +1 to 4 months.
>>
>> Tom
>>
>>
>> On Tuesday, September 27, 2016 2:07 PM, Reynold Xin 
>> wrote:
>>
>>
>> We are 2 months past releasing Spark 2.0.0, an important milestone for the
>> project. Spark 2.0.0 deviated (took 6 month from the regular release cadence
>> we had for the 1.x line, and we never explicitly discussed what the release
>> cadence should look like for 2.x. Thus this email.
>>
>> During Spark 1.x, roughly every three months we make a new 1.x feature
>> release (e.g. 1.5.0 comes out three months after 1.4.0). Development
>> happened primarily in the first two months, and then a release branch was
>> cut at the end of month 2, and the last month was reserved for QA and
>> release preparation.
>>
>> During 2.0.0 development, I really enjoyed the longer release cycle
>> because there was a lot of major changes happening and the longer time was
>> critical for thinking through architectural changes as well as API design.
>> While I don't expect the same degree of drastic changes in a 2.x feature
>> release, I do think it'd make sense to increase the length of release cycle
>> so we can make better designs.
>>
>> My strawman proposal is to maintain a regular release cadence, as we did
>> in Spark 1.x, and increase the cycle from 3 months to 4 months. This
>> effectively gives us ~50% more time to develop (in reality it'd be slightly
>> less than 50% since longer dev time also means longer QA time). As for
>> maintenance releases, I think those should still be cut on-demand, similar
>> to Spark 1.x, but more aggressively.
>>
>> To put this into perspective, 4-month cycle means we will release Spark
>> 2.1.0 at the end of Nov or early Dec (and branch cut / code freeze at the
>> end of Oct).
>>
>> I am curious what others think.
>>
>>
>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Removing published kinesis, ganglia artifacts due to license issues?

2016-09-07 Thread Cody Koeninger
To be clear, "safe" has very little to do with this.

It's pretty clear that there's very little risk of the spark module
for kinesis being considered a derivative work, much less all of
spark.

The use limitation in 3.3 that caused the amazon license to be put on
the apache X list also doesn't have anything to do with a legal safety
risk here.  Really, what are you going to use a kinesis connector for,
except for connecting to kinesis?


On Wed, Sep 7, 2016 at 2:41 PM, Luciano Resende  wrote:
>
>
> On Wed, Sep 7, 2016 at 12:20 PM, Mridul Muralidharan 
> wrote:
>>
>>
>> It is good to get clarification, but the way I read it, the issue is
>> whether we publish it as official Apache artifacts (in maven, etc).
>>
>> Users can of course build it directly (and we can make it easy to do so) -
>> as they are explicitly agreeing to additional licenses.
>>
>> Regards
>> Mridul
>>
>
> +1, by providing instructions on how the user would build, and attaching the
> license details on the instructions, we are then safe on the legal aspects
> of it.
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Removing published kinesis, ganglia artifacts due to license issues?

2016-09-07 Thread Cody Koeninger
I don't see a reason to remove the non-assembly artifact, why would
you?  You're not distributing copies of Amazon licensed code, and the
Amazon license goes out of its way not to over-reach regarding
derivative works.

This seems pretty clearly to fall in the spirit of

http://www.apache.org/legal/resolved.html#optional

I certainly think the majority of Spark users will still want to use
Spark without adding Kinesis

On Wed, Sep 7, 2016 at 3:29 AM, Sean Owen  wrote:
> It's worth calling attention to:
>
> https://issues.apache.org/jira/browse/SPARK-17418
> https://issues.apache.org/jira/browse/SPARK-17422
>
> It looks like we need to at least not publish the kinesis *assembly*
> Maven artifact because it contains Amazon Software Licensed-code
> directly.
>
> However there's a reasonably strong reason to believe that we'd have
> to remove the non-assembly Kinesis artifact too, as well as the
> Ganglia one. This doesn't mean it goes away from the project, just
> means it would no longer be published as a Maven artifact. (These have
> never been bundled in the main Spark artifacts.)
>
> I wanted to give a heads up to see if anyone a) believes this
> conclusion is wrong or b) wants to take it up with legal@? I'm
> inclined to believe we have to remove them given the interpretation
> Luciano has put forth.
>
> Sean
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Committing Kafka offsets when using DirectKafkaInputDStream

2016-09-03 Thread Cody Koeninger
The Kafka commit api isn't transactional, you aren't going to get
exactly once behavior out of it even if you were committing offsets on
a per-partition basis.  This doesn't really have anything to do with
Spark; the old code you posted was already inherently broken.

Make your outputs idempotent and use commitAsync.
Or store offsets transactionally in your own data store.



On Fri, Sep 2, 2016 at 5:50 PM, vonnagy  wrote:
> I have upgrading to Spark 2.0 and am experimenting with using Kafka 0.10.0. I
> have a stream that I extract the data and would like to update the Kafka
> offsets as each partition is handled. With Spark 1.6 or Spark 2.0 and Kafka
> 0.8.2 I was able to update the offsets, but now there seems no way to do so.
> Here is an example
>
> val stream = getStream
>
> stream.forEachRDD { rdd =>
> val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>
> rdd.foreachPartition { events =>
> val partId = TaskContext.get.partitionId
> val offsets = offsetRanges(partId)
>
> // Do something with the data
>
> // Update the offsets for the partition so at most, the partition's
> data would be duplicated
> }
> }
>
> With the new stream, I could call `commitAsync` with the offsets, but the
> drawback here is that it would only update the offsets after the entire RDD
> is handled. This can be a real issue for near "exactly once".
>
> With the new logic, each partition has a Kafka consumer associated with each
> partition, however, there is no access to it. I have looked at the
> CachedKafkaConsumer classes and there is no way at the cache as well so that
> I could call a commit on the offsets.
>
> Beyond that I have tried to use the new Kafka 0.10 APIs, but always run into
> errors as it requires one to subscribe to the topic and get assigned
> partitions. I only want to update the offsets in Kafka.
>
> Any ideas would be helpful of how I might work with the Kafka API to set the
> offsets or get Spark to add logic to allow the commitment of offsets on a
> partition basis.
>
> Thanks,
>
> Ivan
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Committing-Kafka-offsets-when-using-DirectKafkaInputDStream-tp18840.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Model abstract class in spark ml

2016-08-31 Thread Cody Koeninger
http://blog.originate.com/blog/2014/02/27/types-inside-types-in-scala/

On Wed, Aug 31, 2016 at 2:19 AM, Sean Owen  wrote:
> Weird, I recompiled Spark with a similar change to Model and it seemed
> to work but maybe I missed a step in there.
>
> On Wed, Aug 31, 2016 at 6:33 AM, Mohit Jaggi  wrote:
>> I think I figured it out. There is indeed "something deeper in Scala” :-)
>>
>> abstract class A {
>>   def a: this.type
>> }
>>
>> class AA(i: Int) extends A {
>>   def a = this
>> }
>>
>> the above works ok. But if you return anything other than “this”, you will
>> get a compile error.
>>
>> abstract class A {
>>   def a: this.type
>> }
>>
>> class AA(i: Int) extends A {
>>   def a = new AA(1)
>> }
>>
>> Error:(33, 11) type mismatch;
>>  found   : com.dataorchard.datagears.AA
>>  required: AA.this.type
>>   def a = new AA(1)
>>   ^
>>
>> So you have to do:
>>
>> abstract class A[T <: A[T]]  {
>>   def a: T
>> }
>>
>> class AA(i: Int) extends A[AA] {
>>   def a = new AA(1)
>> }
>>
>>
>>
>> Mohit Jaggi
>> Founder,
>> Data Orchard LLC
>> www.dataorchardllc.com
>>
>>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



  1   2   3   >