Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread Yanbo Liang
Let's move the discussion to JIRA. Thanks!

On Fri, Oct 7, 2016 at 8:43 PM, 王磊(安全部) 
wrote:

> https://issues.apache.org/jira/browse/SPARK-17825
>
> Actually I had created a JIRA. Could you let me your progress to avoid
> duplicated work.
>
> Thanks!
>
> 发件人: didi 
> 日期: 2016年10月8日 星期六 上午12:21
> 至: Yanbo Liang 
>
> 抄送: "dev@spark.apache.org" , "u...@spark.apache.org"
> 
> 主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?
>
> Thanks for replying.
> When could you send out the PR?
>
> 发件人: Yanbo Liang 
> 日期: 2016年10月7日 星期五 下午11:35
> 至: didi 
> 抄送: "dev@spark.apache.org" , "u...@spark.apache.org"
> 
> 主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?
>
> It's a good question and I had similar requirement in my work. I'm copying
> the implementation from mllib to ml currently, and then exposing the
> maximum log likelihood. I will send this PR soon.
>
> Thanks.
> Yanbo
>
> On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) 
> wrote:
>
>>
>> Hi,
>>
>> Do you guys sometimes need to get the log likelihood of EM algorithm in
>> MLLIB?
>>
>> I mean the value in this line https://github.com/apache
>> /spark/blob/master/mllib/src/main/scala/org/apache/spark/
>> mllib/clustering/GaussianMixture.scala#L228
>>
>> Now copying the code here:
>>
>>
>> val sums = breezeData.treeAggregate(ExpectationSum.zero(k,
>> d))(compute.value, _ += _)
>> // Create new distributions based on the partial assignments
>> // (often referred to as the "M" step in literature)
>> val sumWeights = sums.weights.sum
>> if (shouldDistributeGaussians) {
>> val numPartitions = math.min(k, 1024)
>> val tuples =
>> Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
>> val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean,
>> sigma, weight) =>
>> updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
>> }.collect().unzip
>> Array.copy(ws.toArray, 0, weights, 0, ws.length)
>> Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
>> } else {
>> var i = 0
>> while (i < k) {
>> val (weight, gaussian) =
>> updateWeightsAndGaussians(sums.means(i), sums.sigmas(i),
>> sums.weights(i), sumWeights)
>> weights(i) = weight
>> gaussians(i) = gaussian
>> i = i + 1
>> }
>> }
>> llhp = llh // current becomes previous
>> llh = sums.logLikelihood // this is the freshly computed log-likelihood
>> iter += 1
>> compute.destroy(blocking = false) In my application, I need to know log
>> likelihood to compare effect for different number of clusters.
>> And then I use the cluster number with the maximum log likelihood.
>>
>> Is it a good idea to expose this value?
>>
>>
>>
>>
>


Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread 安全部
https://issues.apache.org/jira/browse/SPARK-17825

Actually I had created a JIRA. Could you let me your progress to avoid 
duplicated work.

Thanks!

发件人: didi 
>
日期: 2016年10月8日 星期六 上午12:21
至: Yanbo Liang >
抄送: "dev@spark.apache.org" 
>, 
"u...@spark.apache.org" 
>
主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?

Thanks for replying.
When could you send out the PR?

发件人: Yanbo Liang >
日期: 2016年10月7日 星期五 下午11:35
至: didi >
抄送: "dev@spark.apache.org" 
>, 
"u...@spark.apache.org" 
>
主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?

It's a good question and I had similar requirement in my work. I'm copying the 
implementation from mllib to ml currently, and then exposing the maximum log 
likelihood. I will send this PR soon.

Thanks.
Yanbo

On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) 
> wrote:

Hi,

Do you guys sometimes need to get the log likelihood of EM algorithm in MLLIB?

I mean the value in this line 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L228

Now copying the code here:


val sums = breezeData.treeAggregate(ExpectationSum.zero(k, 
d))(compute.value, _ += _)

// Create new distributions based on the partial assignments
// (often referred to as the "M" step in literature)
val sumWeights = sums.weights.sum

if (shouldDistributeGaussians) {
val numPartitions = math.min(k, 1024)
val tuples =
Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, 
sigma, weight) =>
updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
}.collect().unzip
Array.copy(ws.toArray, 0, weights, 0, ws.length)
Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
} else {
var i = 0
while (i < k) {
val (weight, gaussian) =
updateWeightsAndGaussians(sums.means(i), sums.sigmas(i), 
sums.weights(i), sumWeights)
weights(i) = weight
gaussians(i) = gaussian
i = i + 1
}
}

llhp = llh // current becomes previous
llh = sums.logLikelihood // this is the freshly computed log-likelihood
iter += 1
compute.destroy(blocking = false)
In my application, I need to know log likelihood to compare effect for 
different number of clusters.
And then I use the cluster number with the maximum log likelihood.

Is it a good idea to expose this value?






Issue with Spark Streaming with checkpointing in Spark 2.0

2016-10-07 Thread Arijit
In a Spark Streaming sample code I am trying to implicitly convert an RDD to DS 
and save to permanent storage. Below is the snippet of the code I am trying to 
run. The job runs fine first time when started with the checkpoint directory 
empty. However, if I kill and restart the job with the same checkpoint 
directory I get the following error resulting in job failure:


16/10/07 23:42:50 ERROR JobScheduler: Error running job streaming job 
147588355 ms.0
java.lang.NullPointerException
 at org.apache.spark.sql.SQLImplicits.rddToDatasetHolder(SQLImplicits.scala:163)
 at 
com.microsoft.spark.streaming.examples.workloads.EventhubsToAzureBlobAsJSON$$anonfun$createStreamingContext$2.apply(EventhubsToAzureBlobAsJSON.scala:72)
 at 
com.microsoft.spark.streaming.examples.workloads.EventhubsToAzureBlobAsJSON$$anonfun$createStreamingContext$2.apply(EventhubsToAzureBlobAsJSON.scala:72)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
 at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
 at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
 at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
 at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
 at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
 at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
 at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
 at scala.util.Try$.apply(Try.scala:192)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
 at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:245)
 at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:245)
 at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:245)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
 at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:244)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
16/10/07 23:42:50 INFO SparkContext: Starting job: print at 
EventhubsToAzureBlobAsJSON.scala:93


Does anyone have any sample recoverable Spark Streaming code using Spark 
Session constructs of 2.0?


object EventhubsToAzureBlobAsJSON {

  def createStreamingContext(inputOptions: ArgumentMap): StreamingContext = {

.

val sparkSession : SparkSession = 
SparkSession.builder.config(sparkConfiguration).getOrCreate

import sparkSession.implicits._

val streamingContext = new StreamingContext(sparkSession.sparkContext,
  
Seconds(inputOptions(Symbol(EventhubsArgumentKeys.BatchIntervalInSeconds)).asInstanceOf[Int]))

streamingContext.checkpoint(inputOptions(Symbol(EventhubsArgumentKeys.CheckpointDirectory)).asInstanceOf[String])

val eventHubsStream = EventHubsUtils.createUnionStream(streamingContext, 
eventHubsParameters)

val eventHubsWindowedStream = eventHubsStream
  
.window(Seconds(inputOptions(Symbol(EventhubsArgumentKeys.BatchIntervalInSeconds)).asInstanceOf[Int]))

/**
  * This fails on restart
  */

eventHubsWindowedStream.map(x => EventContent(new String(x)))
  .foreachRDD(rdd => rdd.toDS.toJSON.write.mode(SaveMode.Overwrite)
.save(inputOptions(Symbol(EventhubsArgumentKeys.EventStoreFolder))
  .asInstanceOf[String]))

/**
  * This runs fine on restart
  */

/*
eventHubsWindowedStream.map(x => EventContent(new String(x)))
  .foreachRDD(rdd => 
rdd.saveAsTextFile(inputOptions(Symbol(EventhubsArgumentKeys.EventStoreFolder))
  .asInstanceOf[String], classOf[GzipCodec]))
*/

.

  }

  def main(inputArguments: Array[String]): Unit = {

val inputOptions = EventhubsArgumentParser.parseArguments(Map(), 
inputArguments.toList)


EventhubsArgumentParser.verifyEventhubsToAzureBlobAsJSONArguments(inputOptions)

//Create or recreate streaming context

val streamingContext = StreamingContext
  
.getOrCreate(inputOptions(Symbol(EventhubsArgumentKeys.CheckpointDirectory)).asInstanceOf[String],
() => createStreamingContext(inputOptions))

streamingContext.start()

if(inputOptions.contains(Symbol(EventhubsArgumentKeys.TimeoutInMinutes))) {

  

Re: Spark Improvement Proposals

2016-10-07 Thread Reynold Xin
Alright looks like there are quite a bit of support. We should wait to hear
from more people too.

To push this forward, Cody and I will be working together in the next
couple of weeks to come up with a concrete, detailed proposal on what this
entails, and then we can discuss this the specific proposal as well.


On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger  wrote:

> Yeah, in case it wasn't clear, I was talking about SIPs for major
> user-facing or cross-cutting changes, not minor feature adds.
>
> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos  lightbend.com> wrote:
>
>> +1 to the SIP label as long as it does not slow down things and it
>> targets optimizing efforts, coordination etc. For example really small
>> features should not need to go through this process (assuming they dont
>> touch public interfaces)  or re-factorings and hope it will be kept this
>> way. So as a guideline doc should be provided, like in the KIP case.
>>
>> IMHO so far aside from tagging things and linking them elsewhere simply
>> having design docs and prototypes implementations in PRs is not something
>> that has not worked so far. What is really a pain in many projects out
>> there is discontinuity in progress of PRs, missing features, slow reviews
>> which is understandable to some extent... it is not only about Spark but
>> things can be improved for sure for this project in particular as already
>> stated.
>>
>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger 
>> wrote:
>>
>>> +1 to adding an SIP label and linking it from the website.  I think it
>>> needs
>>>
>>> - template that focuses it towards soliciting user goals / non goals
>>> - clear resolution as to which strategy was chosen to pursue.  I'd
>>> recommend a vote.
>>>
>>> Matei asked me to clarify what I meant by changing interfaces, I think
>>> it's directly relevant to the SIP idea so I'll clarify here, and split
>>> a thread for the other discussion per Nicholas' request.
>>>
>>> I meant changing public user interfaces.  I think the first design is
>>> unlikely to be right, because it's done at a time when you have the
>>> least information.  As a user, I find it considerably more frustrating
>>> to be unable to use a tool to get my job done, than I do having to
>>> make minor changes to my code in order to take advantage of features.
>>> I've seen committers be seriously reluctant to allow changes to
>>> @experimental code that are needed in order for it to really work
>>> right.  You need to be able to iterate, and if people on both sides of
>>> the fence aren't going to respect that some newer apis are subject to
>>> change, then why even mark them as such?
>>>
>>> Ideally a finished SIP should give me a checklist of things that an
>>> implementation must do, and things that it doesn't need to do.
>>> Contributors/committers should be seriously discouraged from putting
>>> out a version 0.1 that doesn't have at least a prototype
>>> implementation of all those things, especially if they're then going
>>> to argue against interface changes necessary to get the the rest of
>>> the things done in the 0.2 version.
>>>
>>>
>>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin  wrote:
>>> > I like the lightweight proposal to add a SIP label.
>>> >
>>> > During Spark 2.0 development, Tom (Graves) and I suggested using wiki
>>> to
>>> > track the list of major changes, but that never really materialized
>>> due to
>>> > the overhead. Adding a SIP label on major JIRAs and then link to them
>>> > prominently on the Spark website makes a lot of sense.
>>> >
>>> >
>>> > On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia <
>>> matei.zaha...@gmail.com>
>>> > wrote:
>>> >>
>>> >> For the improvement proposals, I think one major point was to make
>>> them
>>> >> really visible to users who are not contributors, so we should do
>>> more than
>>> >> sending stuff to dev@. One very lightweight idea is to have a new
>>> type of
>>> >> JIRA called a SIP and have a link to a filter that shows all such
>>> JIRAs from
>>> >> http://spark.apache.org. I also like the idea of SIP and design doc
>>> >> templates (in fact many projects have them).
>>> >>
>>> >> Matei
>>> >>
>>> >> On Oct 7, 2016, at 10:38 AM, Reynold Xin  wrote:
>>> >>
>>> >> I called Cody last night and talked about some of the topics in his
>>> email.
>>> >> It became clear to me Cody genuinely cares about the project.
>>> >>
>>> >> Some of the frustrations come from the success of the project itself
>>> >> becoming very "hot", and it is difficult to get clarity from people
>>> who
>>> >> don't dedicate all their time to Spark. In fact, it is in some ways
>>> similar
>>> >> to scaling an engineering team in a successful startup: old processes
>>> that
>>> >> worked well might not work so well when it gets to a certain size,
>>> cultures
>>> >> can get diluted, building culture vs building process, etc.
>>> >>
>>> >> 

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Nicholas Chammas
Ah yes, on a given JIRA issue the number of watchers is often a better
indicator of community interest than votes.

But yeah, it could be any metric or formula we want, as long as it yielded
a "reasonable" bar to cross for unsolicited contributions to get committer
review--or at the very least a comment from them saying yes/no/later.

On Fri, Oct 7, 2016 at 5:59 PM Cody Koeninger  wrote:

> I really like the idea of using jira votes (and/or watchers?) as a filter!
>
> On Fri, Oct 7, 2016 at 4:41 PM, Nicholas Chammas
>  wrote:
> > I agree with Cody and others that we need some automation — or at least
> an
> > adjusted process — to help us manage organic contributions better.
> >
> > The objections about automated closing being potentially abrasive are
> > understood, but I wouldn’t accept that as a defeat for automation.
> Instead,
> > it seems like a constraint we should impose on any proposed solution:
> Make
> > sure it doesn’t turn contributors off. Rolling as we have been won’t cut
> it,
> > and I don’t think adding committers will ever be a sufficient solution to
> > this particular problem.
> >
> > To me, it seems like we need a way to filter out viable contributions
> with
> > community support from other contributions when it comes to deciding that
> > automated action is appropriate. Our current tooling isn’t perfect, but
> > perhaps we can leverage it to create such a filter.
> >
> > For example, consider the following strawman proposal for how to cut
> down on
> > the number of pending but unviable proposals, and simultaneously help
> > contributors organize to promote viable proposals and get the attention
> of
> > committers:
> >
> > Have a bot scan for stale JIRA issues and PRs—i.e. they haven’t been
> updated
> > in 20+ days (or D+ days, if you prefer).
> > Depending on the level of community support, either close the item or
> ping
> > specific people for action. Specifically:
> > a. If the JIRA/PR has no input from a committer and the JIRA/PR has 5+
> votes
> > (or V+ votes), ping committers for input. (For PRs, you could count
> comments
> > from different people, or thumbs up on the initial PR post.)
> > b. If the JIRA/PR has no input from a committer and the JIRA/PR has less
> > than V votes, close it with a gentle message asking the contributor to
> > solicit support from either the community or a committer, and try again
> > later.
> > c. If the JIRA/PR has input from a committer or committers, ping them
> for an
> > update.
> >
> > This is just a rough idea. The point is that when contributors have stale
> > proposals that they don’t close, committers need to take action. A little
> > automation to selectively bring contributions to the attention of
> committers
> > can perhaps help them manage the backlog of stale contributions. The
> > “selective” part is implemented in this strawman proposal by using JIRA
> > votes as a crude proxy for when the community is interested in something,
> > but it could be anything.
> >
> > Also, this doesn’t have to be used just to clear out stale proposals.
> Once
> > the initial backlog is trimmed down, you could set D to 5 days and use
> this
> > as a regular way to bring contributions to the attention of committers.
> >
> > I dunno if people think this is perhaps too complex, but at our scale I
> feel
> > we need some kind of loose but automated system for funneling
> contributions
> > through some kind of lifecycle. The status quo is just not that good
> (e.g.
> > 474 open PRs against Spark as of this moment).
> >
> > Nick
> >
> >
> > On Fri, Oct 7, 2016 at 4:48 PM Cody Koeninger 
> wrote:
> >>
> >> Matei asked:
> >>
> >>
> >> > I agree about empowering people interested here to contribute, but I'm
> >> > wondering, do you think there are technical things that people don't
> want to
> >> > work on, or is it a matter of what there's been time to do?
> >>
> >>
> >> It's a matter of mismanagement and miscommunication.
> >>
> >> The structured streaming kafka jira sat with multiple unanswered
> >> requests for someone who was a committer to communicate whether they
> >> were working on it and what the plan was.  I could have done that
> >> implementation and had it in users' hands months ago.  I didn't
> >> pre-emptively do it because I didn't want to then have to argue with
> >> committers about why my code did or did not meet their uncommunicated
> >> expectations.
> >>
> >>
> >> I don't want to re-hash that particular circumstance, I just want to
> >> make sure it never happens again.
> >>
> >>
> >> Hopefully the SIP thread results in clearer expectations, but there
> >> are still some ideas on the table regarding management of volunteer
> >> contributions:
> >>
> >>
> >> - Closing stale jiras.  I hear the bots are impersonal argument, but
> >> the alternative of "someone cleans it up" is not sufficient right now
> >> (with apologies to Sean and all the other janitors).
> >>
> 

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Cody Koeninger
I really like the idea of using jira votes (and/or watchers?) as a filter!

On Fri, Oct 7, 2016 at 4:41 PM, Nicholas Chammas
 wrote:
> I agree with Cody and others that we need some automation — or at least an
> adjusted process — to help us manage organic contributions better.
>
> The objections about automated closing being potentially abrasive are
> understood, but I wouldn’t accept that as a defeat for automation. Instead,
> it seems like a constraint we should impose on any proposed solution: Make
> sure it doesn’t turn contributors off. Rolling as we have been won’t cut it,
> and I don’t think adding committers will ever be a sufficient solution to
> this particular problem.
>
> To me, it seems like we need a way to filter out viable contributions with
> community support from other contributions when it comes to deciding that
> automated action is appropriate. Our current tooling isn’t perfect, but
> perhaps we can leverage it to create such a filter.
>
> For example, consider the following strawman proposal for how to cut down on
> the number of pending but unviable proposals, and simultaneously help
> contributors organize to promote viable proposals and get the attention of
> committers:
>
> Have a bot scan for stale JIRA issues and PRs—i.e. they haven’t been updated
> in 20+ days (or D+ days, if you prefer).
> Depending on the level of community support, either close the item or ping
> specific people for action. Specifically:
> a. If the JIRA/PR has no input from a committer and the JIRA/PR has 5+ votes
> (or V+ votes), ping committers for input. (For PRs, you could count comments
> from different people, or thumbs up on the initial PR post.)
> b. If the JIRA/PR has no input from a committer and the JIRA/PR has less
> than V votes, close it with a gentle message asking the contributor to
> solicit support from either the community or a committer, and try again
> later.
> c. If the JIRA/PR has input from a committer or committers, ping them for an
> update.
>
> This is just a rough idea. The point is that when contributors have stale
> proposals that they don’t close, committers need to take action. A little
> automation to selectively bring contributions to the attention of committers
> can perhaps help them manage the backlog of stale contributions. The
> “selective” part is implemented in this strawman proposal by using JIRA
> votes as a crude proxy for when the community is interested in something,
> but it could be anything.
>
> Also, this doesn’t have to be used just to clear out stale proposals. Once
> the initial backlog is trimmed down, you could set D to 5 days and use this
> as a regular way to bring contributions to the attention of committers.
>
> I dunno if people think this is perhaps too complex, but at our scale I feel
> we need some kind of loose but automated system for funneling contributions
> through some kind of lifecycle. The status quo is just not that good (e.g.
> 474 open PRs against Spark as of this moment).
>
> Nick
>
>
> On Fri, Oct 7, 2016 at 4:48 PM Cody Koeninger  wrote:
>>
>> Matei asked:
>>
>>
>> > I agree about empowering people interested here to contribute, but I'm
>> > wondering, do you think there are technical things that people don't want 
>> > to
>> > work on, or is it a matter of what there's been time to do?
>>
>>
>> It's a matter of mismanagement and miscommunication.
>>
>> The structured streaming kafka jira sat with multiple unanswered
>> requests for someone who was a committer to communicate whether they
>> were working on it and what the plan was.  I could have done that
>> implementation and had it in users' hands months ago.  I didn't
>> pre-emptively do it because I didn't want to then have to argue with
>> committers about why my code did or did not meet their uncommunicated
>> expectations.
>>
>>
>> I don't want to re-hash that particular circumstance, I just want to
>> make sure it never happens again.
>>
>>
>> Hopefully the SIP thread results in clearer expectations, but there
>> are still some ideas on the table regarding management of volunteer
>> contributions:
>>
>>
>> - Closing stale jiras.  I hear the bots are impersonal argument, but
>> the alternative of "someone cleans it up" is not sufficient right now
>> (with apologies to Sean and all the other janitors).
>>
>> - Clear rejection of jiras.  This isn't mean, it's respectful.
>>
>> - Clear "I'm working on this", with clear removal and reassignment if
>> they go radio silent.  This could be keyed to automated check for
>> staleness.
>>
>> - Clear expectation that if someone is working on a jira, you can work
>> on your own alternative, but you need to communicate.
>>
>>
>> I'm sure I've missed some.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: 

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Nicholas Chammas
I agree with Cody and others that we need some automation — or at least an
adjusted process — to help us manage organic contributions better.

The objections about automated closing being potentially abrasive are
understood, but I wouldn’t accept that as a defeat for automation. Instead,
it seems like a constraint we should impose on any proposed solution: Make
sure it doesn’t turn contributors off. Rolling as we have been won’t cut
it, and I don’t think adding committers will ever be a sufficient solution
to this particular problem.

To me, it seems like we need a way to filter out viable contributions with
community support from other contributions when it comes to deciding that
automated action is appropriate. Our current tooling isn’t perfect, but
perhaps we can leverage it to create such a filter.

For example, consider the following strawman proposal for how to cut down
on the number of pending but unviable proposals, and simultaneously help
contributors organize to promote viable proposals and get the attention of
committers:

   1. Have a bot scan for *stale* JIRA issues and PRs—i.e. they haven’t
   been updated in 20+ days (or D+ days, if you prefer).
   2. Depending on the level of community support, either close the item or
   ping specific people for action. Specifically:
   a. If the JIRA/PR has no input from a committer and the JIRA/PR has 5+
   votes (or V+ votes), ping committers for input. (For PRs, you could
   count comments from different people, or thumbs up on the initial PR post.)
   b. If the JIRA/PR has no input from a committer and the JIRA/PR has less
   than V votes, close it with a gentle message asking the contributor to
   solicit support from either the community or a committer, and try again
   later.
   c. If the JIRA/PR has input from a committer or committers, ping them
   for an update.

This is just a rough idea. The point is that when contributors have stale
proposals that they don’t close, committers need to take action. A little
automation to selectively bring contributions to the attention of
committers can perhaps help them manage the backlog of stale contributions.
The “selective” part is implemented in this strawman proposal by using JIRA
votes as a crude proxy for when the community is interested in something,
but it could be anything.

Also, this doesn’t have to be used just to clear out stale proposals. Once
the initial backlog is trimmed down, you could set D to 5 days and use this
as a regular way to bring contributions to the attention of committers.

I dunno if people think this is perhaps too complex, but at our scale I
feel we need some kind of loose but automated system for funneling
contributions through some kind of lifecycle. The status quo is just not
that good (e.g. 474 open PRs 
against Spark as of this moment).

Nick
​

On Fri, Oct 7, 2016 at 4:48 PM Cody Koeninger  wrote:

> Matei asked:
>
>
> > I agree about empowering people interested here to contribute, but I'm
> wondering, do you think there are technical things that people don't want
> to work on, or is it a matter of what there's been time to do?
>
>
> It's a matter of mismanagement and miscommunication.
>
> The structured streaming kafka jira sat with multiple unanswered
> requests for someone who was a committer to communicate whether they
> were working on it and what the plan was.  I could have done that
> implementation and had it in users' hands months ago.  I didn't
> pre-emptively do it because I didn't want to then have to argue with
> committers about why my code did or did not meet their uncommunicated
> expectations.
>
>
> I don't want to re-hash that particular circumstance, I just want to
> make sure it never happens again.
>
>
> Hopefully the SIP thread results in clearer expectations, but there
> are still some ideas on the table regarding management of volunteer
> contributions:
>
>
> - Closing stale jiras.  I hear the bots are impersonal argument, but
> the alternative of "someone cleans it up" is not sufficient right now
> (with apologies to Sean and all the other janitors).
>
> - Clear rejection of jiras.  This isn't mean, it's respectful.
>
> - Clear "I'm working on this", with clear removal and reassignment if
> they go radio silent.  This could be keyed to automated check for
> staleness.
>
> - Clear expectation that if someone is working on a jira, you can work
> on your own alternative, but you need to communicate.
>
>
> I'm sure I've missed some.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark Improvement Proposals

2016-10-07 Thread Cody Koeninger
Yeah, in case it wasn't clear, I was talking about SIPs for major
user-facing or cross-cutting changes, not minor feature adds.

On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> +1 to the SIP label as long as it does not slow down things and it targets
> optimizing efforts, coordination etc. For example really small features
> should not need to go through this process (assuming they dont touch public
> interfaces)  or re-factorings and hope it will be kept this way. So as a
> guideline doc should be provided, like in the KIP case.
>
> IMHO so far aside from tagging things and linking them elsewhere simply
> having design docs and prototypes implementations in PRs is not something
> that has not worked so far. What is really a pain in many projects out
> there is discontinuity in progress of PRs, missing features, slow reviews
> which is understandable to some extent... it is not only about Spark but
> things can be improved for sure for this project in particular as already
> stated.
>
> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger 
> wrote:
>
>> +1 to adding an SIP label and linking it from the website.  I think it
>> needs
>>
>> - template that focuses it towards soliciting user goals / non goals
>> - clear resolution as to which strategy was chosen to pursue.  I'd
>> recommend a vote.
>>
>> Matei asked me to clarify what I meant by changing interfaces, I think
>> it's directly relevant to the SIP idea so I'll clarify here, and split
>> a thread for the other discussion per Nicholas' request.
>>
>> I meant changing public user interfaces.  I think the first design is
>> unlikely to be right, because it's done at a time when you have the
>> least information.  As a user, I find it considerably more frustrating
>> to be unable to use a tool to get my job done, than I do having to
>> make minor changes to my code in order to take advantage of features.
>> I've seen committers be seriously reluctant to allow changes to
>> @experimental code that are needed in order for it to really work
>> right.  You need to be able to iterate, and if people on both sides of
>> the fence aren't going to respect that some newer apis are subject to
>> change, then why even mark them as such?
>>
>> Ideally a finished SIP should give me a checklist of things that an
>> implementation must do, and things that it doesn't need to do.
>> Contributors/committers should be seriously discouraged from putting
>> out a version 0.1 that doesn't have at least a prototype
>> implementation of all those things, especially if they're then going
>> to argue against interface changes necessary to get the the rest of
>> the things done in the 0.2 version.
>>
>>
>> On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin  wrote:
>> > I like the lightweight proposal to add a SIP label.
>> >
>> > During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
>> > track the list of major changes, but that never really materialized due
>> to
>> > the overhead. Adding a SIP label on major JIRAs and then link to them
>> > prominently on the Spark website makes a lot of sense.
>> >
>> >
>> > On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia > >
>> > wrote:
>> >>
>> >> For the improvement proposals, I think one major point was to make them
>> >> really visible to users who are not contributors, so we should do more
>> than
>> >> sending stuff to dev@. One very lightweight idea is to have a new
>> type of
>> >> JIRA called a SIP and have a link to a filter that shows all such
>> JIRAs from
>> >> http://spark.apache.org. I also like the idea of SIP and design doc
>> >> templates (in fact many projects have them).
>> >>
>> >> Matei
>> >>
>> >> On Oct 7, 2016, at 10:38 AM, Reynold Xin  wrote:
>> >>
>> >> I called Cody last night and talked about some of the topics in his
>> email.
>> >> It became clear to me Cody genuinely cares about the project.
>> >>
>> >> Some of the frustrations come from the success of the project itself
>> >> becoming very "hot", and it is difficult to get clarity from people who
>> >> don't dedicate all their time to Spark. In fact, it is in some ways
>> similar
>> >> to scaling an engineering team in a successful startup: old processes
>> that
>> >> worked well might not work so well when it gets to a certain size,
>> cultures
>> >> can get diluted, building culture vs building process, etc.
>> >>
>> >> I also really like to have a more visible process for larger changes,
>> >> especially major user facing API changes. Historically we upload
>> design docs
>> >> for major changes, but it is not always consistent and difficult to
>> quality
>> >> of the docs, due to the volunteering nature of the organization.
>> >>
>> >> Some of the more concrete ideas we discussed focus on building a
>> culture
>> >> to improve clarity:
>> >>
>> >> - Process: Large changes should have design docs posted on JIRA. 

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
>
> Without a hell of a lot more work, Assign would be the only strategy
> usable.


How would the current "subscribe" break?


Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Cody Koeninger
Matei asked:


> I agree about empowering people interested here to contribute, but I'm 
> wondering, do you think there are technical things that people don't want to 
> work on, or is it a matter of what there's been time to do?


It's a matter of mismanagement and miscommunication.

The structured streaming kafka jira sat with multiple unanswered
requests for someone who was a committer to communicate whether they
were working on it and what the plan was.  I could have done that
implementation and had it in users' hands months ago.  I didn't
pre-emptively do it because I didn't want to then have to argue with
committers about why my code did or did not meet their uncommunicated
expectations.


I don't want to re-hash that particular circumstance, I just want to
make sure it never happens again.


Hopefully the SIP thread results in clearer expectations, but there
are still some ideas on the table regarding management of volunteer
contributions:


- Closing stale jiras.  I hear the bots are impersonal argument, but
the alternative of "someone cleans it up" is not sufficient right now
(with apologies to Sean and all the other janitors).

- Clear rejection of jiras.  This isn't mean, it's respectful.

- Clear "I'm working on this", with clear removal and reassignment if
they go radio silent.  This could be keyed to automated check for
staleness.

- Clear expectation that if someone is working on a jira, you can work
on your own alternative, but you need to communicate.


I'm sure I've missed some.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Cody Koeninger
Without a hell of a lot more work, Assign would be the only strategy usable.

On Fri, Oct 7, 2016 at 3:25 PM, Michael Armbrust  wrote:
>> The implementation is totally and completely different however, in ways
>> that leak to the end user.
>
>
> Can you elaborate? Especially in the context of the interface provided by
> structured streaming.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
> 0.10 consumers won't work on an earlier broker.
> Earlier consumers will (should?) work on a 0.10 broker.
>

 This lines up with my testing.  Is there a page I'm missing that describes
this?  Like does a 0.9 client work with 0.8 broker?  Is it always old
clients can talk to new brokers but not vice versa?


Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
>
> The implementation is totally and completely different however, in ways
> that leak to the end user.


Can you elaborate? Especially in the context of the interface provided by
structured streaming.


Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Cody Koeninger
0.10 consumers won't work on an earlier broker.

Earlier consumers will (should?) work on a 0.10 broker.

The main things earlier consumers lack from a user perspective is
support for SSL, and pre-fetching messages.  The implementation is
totally and completely different however, in ways that leak to the end
user.

On Fri, Oct 7, 2016 at 3:15 PM, Reynold Xin  wrote:
> Does Kafka 0.10 work on a Kafka 0.8/0.9 cluster?
>
>
> On Fri, Oct 7, 2016 at 1:14 PM, Jeremy Smith 
> wrote:
>>
>> +1
>>
>> We're on CDH, and it will probably be a while before they support Kafka
>> 0.10. At the same time, we don't use their Spark and we're looking forward
>> to upgrading to 2.0.x and using structured streaming.
>>
>> I was just going to write our own Kafka Source implementation which uses
>> the existing KafkaRDD but it would be much easier to get buy-in for an
>> official Spark module.
>>
>> Jeremy
>>
>> On Fri, Oct 7, 2016 at 12:41 PM, Michael Armbrust 
>> wrote:
>>>
>>> We recently merged support for Kafak 0.10.0 in Structured Streaming, but
>>> I've been hearing a few people tell me that they are stuck on an older
>>> version of Kafka and cannot upgrade.  I'm considering revisiting
>>> SPARK-17344, but it would be good to have more information.
>>>
>>> Could people please vote or comment on the above ticket if a lack of
>>> support for older versions of kafka would block you from trying out
>>> structured streaming?
>>>
>>> Thanks!
>>>
>>> Michael
>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Reynold Xin
Does Kafka 0.10 work on a Kafka 0.8/0.9 cluster?


On Fri, Oct 7, 2016 at 1:14 PM, Jeremy Smith 
wrote:

> +1
>
> We're on CDH, and it will probably be a while before they support Kafka
> 0.10. At the same time, we don't use their Spark and we're looking forward
> to upgrading to 2.0.x and using structured streaming.
>
> I was just going to write our own Kafka Source implementation which uses
> the existing KafkaRDD but it would be much easier to get buy-in for an
> official Spark module.
>
> Jeremy
>
> On Fri, Oct 7, 2016 at 12:41 PM, Michael Armbrust 
> wrote:
>
>> We recently merged support for Kafak 0.10.0 in Structured Streaming, but
>> I've been hearing a few people tell me that they are stuck on an older
>> version of Kafka and cannot upgrade.  I'm considering revisiting
>> SPARK-17344 , but it
>> would be good to have more information.
>>
>> Could people please vote or comment on the above ticket if a lack of
>> support for older versions of kafka would block you from trying out
>> structured streaming?
>>
>> Thanks!
>>
>> Michael
>>
>
>


Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Jeremy Smith
+1

We're on CDH, and it will probably be a while before they support Kafka
0.10. At the same time, we don't use their Spark and we're looking forward
to upgrading to 2.0.x and using structured streaming.

I was just going to write our own Kafka Source implementation which uses
the existing KafkaRDD but it would be much easier to get buy-in for an
official Spark module.

Jeremy

On Fri, Oct 7, 2016 at 12:41 PM, Michael Armbrust 
wrote:

> We recently merged support for Kafak 0.10.0 in Structured Streaming, but
> I've been hearing a few people tell me that they are stuck on an older
> version of Kafka and cannot upgrade.  I'm considering revisiting
> SPARK-17344 , but it
> would be good to have more information.
>
> Could people please vote or comment on the above ticket if a lack of
> support for older versions of kafka would block you from trying out
> structured streaming?
>
> Thanks!
>
> Michael
>


Re: Spark Improvement Proposals

2016-10-07 Thread Cody Koeninger
+1 to adding an SIP label and linking it from the website.  I think it needs

- template that focuses it towards soliciting user goals / non goals
- clear resolution as to which strategy was chosen to pursue.  I'd
recommend a vote.

Matei asked me to clarify what I meant by changing interfaces, I think
it's directly relevant to the SIP idea so I'll clarify here, and split
a thread for the other discussion per Nicholas' request.

I meant changing public user interfaces.  I think the first design is
unlikely to be right, because it's done at a time when you have the
least information.  As a user, I find it considerably more frustrating
to be unable to use a tool to get my job done, than I do having to
make minor changes to my code in order to take advantage of features.
I've seen committers be seriously reluctant to allow changes to
@experimental code that are needed in order for it to really work
right.  You need to be able to iterate, and if people on both sides of
the fence aren't going to respect that some newer apis are subject to
change, then why even mark them as such?

Ideally a finished SIP should give me a checklist of things that an
implementation must do, and things that it doesn't need to do.
Contributors/committers should be seriously discouraged from putting
out a version 0.1 that doesn't have at least a prototype
implementation of all those things, especially if they're then going
to argue against interface changes necessary to get the the rest of
the things done in the 0.2 version.


On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin  wrote:
> I like the lightweight proposal to add a SIP label.
>
> During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
> track the list of major changes, but that never really materialized due to
> the overhead. Adding a SIP label on major JIRAs and then link to them
> prominently on the Spark website makes a lot of sense.
>
>
> On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia 
> wrote:
>>
>> For the improvement proposals, I think one major point was to make them
>> really visible to users who are not contributors, so we should do more than
>> sending stuff to dev@. One very lightweight idea is to have a new type of
>> JIRA called a SIP and have a link to a filter that shows all such JIRAs from
>> http://spark.apache.org. I also like the idea of SIP and design doc
>> templates (in fact many projects have them).
>>
>> Matei
>>
>> On Oct 7, 2016, at 10:38 AM, Reynold Xin  wrote:
>>
>> I called Cody last night and talked about some of the topics in his email.
>> It became clear to me Cody genuinely cares about the project.
>>
>> Some of the frustrations come from the success of the project itself
>> becoming very "hot", and it is difficult to get clarity from people who
>> don't dedicate all their time to Spark. In fact, it is in some ways similar
>> to scaling an engineering team in a successful startup: old processes that
>> worked well might not work so well when it gets to a certain size, cultures
>> can get diluted, building culture vs building process, etc.
>>
>> I also really like to have a more visible process for larger changes,
>> especially major user facing API changes. Historically we upload design docs
>> for major changes, but it is not always consistent and difficult to quality
>> of the docs, due to the volunteering nature of the organization.
>>
>> Some of the more concrete ideas we discussed focus on building a culture
>> to improve clarity:
>>
>> - Process: Large changes should have design docs posted on JIRA. One thing
>> Cody and I didn't discuss but an idea that just came to me is we should
>> create a design doc template for the project and ask everybody to follow.
>> The design doc template should also explicitly list goals and non-goals, to
>> make design doc more consistent.
>>
>> - Process: Email dev@ to solicit feedback. We have some this with some
>> changes, but again very inconsistent. Just posting something on JIRA isn't
>> sufficient, because there are simply too many JIRAs and the signal get lost
>> in the noise. While this is generally impossible to enforce because we can't
>> force all volunteers to conform to a process (or they might not even be
>> aware of this),  those who are more familiar with the project can help by
>> emailing the dev@ when they see something that hasn't been.
>>
>> - Culture: The design doc author(s) should be open to feedback. A design
>> doc should serve as the base for discussion and is by no means the final
>> design. Of course, this does not mean the author has to accept every
>> feedback. They should also be comfortable accepting / rejecting ideas on
>> technical grounds.
>>
>> - Process / Culture: For major ongoing projects, it can be useful to have
>> some monthly Google hangouts that are open to the world. I am actually not
>> sure how well this will work, because of the volunteering nature and we need
>> to adjust 

Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
We recently merged support for Kafak 0.10.0 in Structured Streaming, but
I've been hearing a few people tell me that they are stuck on an older
version of Kafka and cannot upgrade.  I'm considering revisiting SPARK-17344
, but it would be good
to have more information.

Could people please vote or comment on the above ticket if a lack of
support for older versions of kafka would block you from trying out
structured streaming?

Thanks!

Michael


Re: Reading back hdfs files saved as case class

2016-10-07 Thread Deepak Sharma
Thanks for the answer Reynold.
Yes I can use the dataset but it will solve the purpose I am supposed to
use it for.
I am trying to work on a solution where I need to save the case class along
with data in hdfs.
Further this data will move to different folders corresponding to different
case classes .
The spark programs reading these files are supposed to apply the case class
directly depending on the folder they are reading from.

Thanks
Deepak

On Oct 8, 2016 00:53, "Reynold Xin"  wrote:

> You can use the Dataset API -- it should solve this issue for case classes
> that are not very complex.
>
> On Fri, Oct 7, 2016 at 12:20 PM, Deepak Sharma 
> wrote:
>
>> Hi
>> I am saving RDD[Example] in hdfs from spark program , where Example is
>> case class.
>> Now when i am trying to read it back , it returns RDD[String] with the
>> content as below:
>> *Example(1,name,value)*
>>
>> The workaround can be to write as a string in hdfs and read it back as
>> string and perform further processing.This way the case class name wouldn't
>> appear at all in the file being written in hdfs.
>> But i am keen to know if we can read the data directly in Spark if the
>> RDD[Case_Class] is written to hdfs?
>>
>> --
>> Thanks
>> Deepak
>>
>
>


Re: Reading back hdfs files saved as case class

2016-10-07 Thread Reynold Xin
You can use the Dataset API -- it should solve this issue for case classes
that are not very complex.

On Fri, Oct 7, 2016 at 12:20 PM, Deepak Sharma 
wrote:

> Hi
> I am saving RDD[Example] in hdfs from spark program , where Example is
> case class.
> Now when i am trying to read it back , it returns RDD[String] with the
> content as below:
> *Example(1,name,value)*
>
> The workaround can be to write as a string in hdfs and read it back as
> string and perform further processing.This way the case class name wouldn't
> appear at all in the file being written in hdfs.
> But i am keen to know if we can read the data directly in Spark if the
> RDD[Case_Class] is written to hdfs?
>
> --
> Thanks
> Deepak
>


Reading back hdfs files saved as case class

2016-10-07 Thread Deepak Sharma
Hi
I am saving RDD[Example] in hdfs from spark program , where Example is case
class.
Now when i am trying to read it back , it returns RDD[String] with the
content as below:
*Example(1,name,value)*

The workaround can be to write as a string in hdfs and read it back as
string and perform further processing.This way the case class name wouldn't
appear at all in the file being written in hdfs.
But i am keen to know if we can read the data directly in Spark if the
RDD[Case_Class] is written to hdfs?

-- 
Thanks
Deepak


Re: Spark Improvement Proposals

2016-10-07 Thread Reynold Xin
I like the lightweight proposal to add a SIP label.

During Spark 2.0 development, Tom (Graves) and I suggested using wiki to
track the list of major changes, but that never really materialized due to
the overhead. Adding a SIP label on major JIRAs and then link to them
prominently on the Spark website makes a lot of sense.


On Fri, Oct 7, 2016 at 10:50 AM, Matei Zaharia 
wrote:

> For the improvement proposals, I think one major point was to make them
> really visible to users who are not contributors, so we should do more than
> sending stuff to dev@. One very lightweight idea is to have a new type of
> JIRA called a SIP and have a link to a filter that shows all such JIRAs
> from http://spark.apache.org. I also like the idea of SIP and design doc
> templates (in fact many projects have them).
>
> Matei
>
> On Oct 7, 2016, at 10:38 AM, Reynold Xin  wrote:
>
> I called Cody last night and talked about some of the topics in his email.
> It became clear to me Cody genuinely cares about the project.
>
> Some of the frustrations come from the success of the project itself
> becoming very "hot", and it is difficult to get clarity from people who
> don't dedicate all their time to Spark. In fact, it is in some ways similar
> to scaling an engineering team in a successful startup: old processes that
> worked well might not work so well when it gets to a certain size, cultures
> can get diluted, building culture vs building process, etc.
>
> I also really like to have a more visible process for larger changes,
> especially major user facing API changes. Historically we upload design
> docs for major changes, but it is not always consistent and difficult to
> quality of the docs, due to the volunteering nature of the organization.
>
> Some of the more concrete ideas we discussed focus on building a culture
> to improve clarity:
>
> - Process: Large changes should have design docs posted on JIRA. One thing
> Cody and I didn't discuss but an idea that just came to me is we should
> create a design doc template for the project and ask everybody to follow.
> The design doc template should also explicitly list goals and non-goals, to
> make design doc more consistent.
>
> - Process: Email dev@ to solicit feedback. We have some this with some
> changes, but again very inconsistent. Just posting something on JIRA isn't
> sufficient, because there are simply too many JIRAs and the signal get lost
> in the noise. While this is generally impossible to enforce because we
> can't force all volunteers to conform to a process (or they might not even
> be aware of this),  those who are more familiar with the project can help
> by emailing the dev@ when they see something that hasn't been.
>
> - Culture: The design doc author(s) should be open to feedback. A design
> doc should serve as the base for discussion and is by no means the final
> design. Of course, this does not mean the author has to accept every
> feedback. They should also be comfortable accepting / rejecting ideas on
> technical grounds.
>
> - Process / Culture: For major ongoing projects, it can be useful to have
> some monthly Google hangouts that are open to the world. I am actually not
> sure how well this will work, because of the volunteering nature and we
> need to adjust for timezones for people across the globe, but it seems
> worth trying.
>
> - Culture: Contributors (including committers) should be more direct in
> setting expectations, including whether they are working on a specific
> issue, whether they will be working on a specific issue, and whether an
> issue or pr or jira should be rejected. Most people I know in this
> community are nice and don't enjoy telling other people no, but it is often
> more annoying to a contributor to not know anything than getting a no.
>
>
> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia 
> wrote:
>
>>
>> Love the idea of a more visible "Spark Improvement Proposal" process that
>> solicits user input on new APIs. For what it's worth, I don't think
>> committers are trying to minimize their own work -- every committer cares
>> about making the software useful for users. However, it is always hard to
>> get user input and so it helps to have this kind of process. I've certainly
>> looked at the *IPs a lot in other software I use just to see the biggest
>> things on the roadmap.
>>
>> When you're talking about "changing interfaces", are you talking about
>> public or internal APIs? I do think many people hate changing public APIs
>> and I actually think that's for the best of the project. That's a technical
>> debate, but basically, the worst thing when you're using a piece of
>> software is that the developers constantly ask you to rewrite your app to
>> update to a new version (and thus benefit from bug fixes, etc). Cue anyone
>> who's used Protobuf, or Guava. The "let's get everyone to change their code
>> this release" model works well within a 

Re: Spark Improvement Proposals

2016-10-07 Thread Hyukjin Kwon
I am glad that it was not only what I was thinking.
I also do agree with Holden, Sean and Cody. All I wanted to say were all
said.



2016-10-08 1:16 GMT+09:00 Holden Karau :

> First off, thanks Cody for taking the time to put together these proposals
> - I think it has kicked off some wonderful discussion.
>
> I think dismissing people's complaints with Spark as largely trolls does
> us a disservice, it’s important for us to recognize our own shortcomings -
> otherwise we are blind to the weak spots where we need to improve and
> instead focus on new features. Parts of the Python community seem to be
> actively looking for alternatives, and I’d obviously like Spark continue to
> be the place where we come together and collaborate from different
> languages.
>
> I’d be more than happy to do a review of the outstanding Python PRs (I’ve
> been keeping on top of the new ones but largely haven’t looked at the older
> ones) and if there is a committer (maybe Davies or Sean?) who would be able
> to help out with merging them once they are ready that would be awesome.
> I’m at PyData DC this weekend but I’ll also start going through some of the
> older Python JIRAs and seeing if they are still relevant, already fixed, or
> something we are unlikely to be interested in bringing into Spark.
>
> I’m giving a talk later on this month on how to get started contributing
> to Apache Spark at OSCON London, and when I’ve given this talk before I’ve
> had to include a fair number of warnings about the challenges that can face
> a new contributor. I’d love to be able to drop those in future versions :)
>
> P.S.
>
> As one of the non-committers who has been working on Spark for several
> years (see http://bit.ly/hkspmg ) I have strong feelings around the
> current process being used for committers - but since I’m not on the PMC
> (catch-22 style) it's difficult to have any visibility into the process, so
> someone who does will have to weigh in on that :)
>
>
> On Fri, Oct 7, 2016 at 8:00 AM, Cody Koeninger  wrote:
>
>> Sean, that was very eloquently put, and I 100% agree.  If I ever meet
>> you in person, I'll buy you multiple rounds of beverages of your
>> choice ;)
>> This is probably reiterating some of what you said in a less clear
>> manner, but I'll throw more of my 2 cents in.
>>
>> - Design.
>> Yes, design by committee doesn't work.  The best designs are when a
>> person who understands the problem builds something that works for
>> them, shares with others, and most importantly iterates when it
>> doesn't work for others.  This iteration only works if you're willing
>> to change interfaces, but committer and user goals are not aligned
>> here.  Users want something that is clearly documented and helps them
>> get their job done.  Committers (not all) want to minimize interface
>> change, even at the expense of users being able to do their jobs.  In
>> this situation, it is critical that you understand early what users
>> need to be able to do.  This is what the improvement proposal process
>> should focus on: Goals, non-goals, possible solutions, rejected
>> solutions.  Not class-level design.  Most importantly, it needs a
>> clear, unambiguous outcome that is visible to the public.
>>
>> - Trolling
>> It's not just trolling.  Event time and kafka are technically
>> important and should not be ignored.  I've been banging this drum for
>> years.  These concerns haven't been fully heard and understood by
>> committers.  This one example of why diversity of enfranchised users
>> is important and governance concerns shouldn't be ignored.
>>
>> - Jira
>> Concretely, automate closing stale jiras after X amount of time.  It's
>> really surprising to me how much reluctance a community of programmers
>> have shown towards automating their own processes around stuff like
>> this (not to mention automatic code formatting of modified files).  I
>> understand the arguments against. but the current alternative doesn't
>> work.
>> Concretely, clearly reject and close jiras.  I have a backlog of 50+
>> kafka jiras, many of which are irrelevant at this point, but I do not
>> feel that I have the political power to close them.
>> Concretely, make it clear who is working on something.  This can be as
>> simple as just "I'm working on this", assign it to me, if I don't
>> follow up in X amount of time, close it or reassign.  That doesn't
>> mean there can't be competing work, but it does mean those people
>> should talk to each other.  Conversely, if committers currently don't
>> have time to work on something that is important, make that clear in
>> the ticket.
>>
>>
>> On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen  wrote:
>> > Suggestion actions way at the bottom.
>> >
>> > On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia 
>> > wrote:
>> >>
>> >> since March. But it's true that other things such as the Kafka source
>> for
>> >> it didn't have as much 

Anyone interested in Spark & Cloud got time to look at the SPARK-7481 PR?

2016-10-07 Thread Steve Loughran

Some people may have noticed I've been working on adding packaging, docs & 
testing for getting Spark to work with S3, Azure and openstack into a Spark 
distribution,

https://github.com/apache/spark/pull/12004

It's been a WiP, but now I've got tests for all three cloud infrastructures, 
tests covering: basic IO, output committing, dataframe IO and streaming, the 
core test coverage is done; the packaging working.

Which means I'd really like some reviews by people who want to have spark work 
with S3, Azure or their local Swift endpoint to review that PR, ideally going 
through the documentation and validating that as well as the code.
It's Hadoop 2.7+ only, with a new profile, "cloud", to pull in the new module 
of the same name.

thanks

-Steve

PS: documentation (without templated code rendering):
https://github.com/steveloughran/spark/blob/features/SPARK-7481-cloud/docs/cloud-integration.md




Re: Spark Improvement Proposals

2016-10-07 Thread Matei Zaharia
For the improvement proposals, I think one major point was to make them really 
visible to users who are not contributors, so we should do more than sending 
stuff to dev@. One very lightweight idea is to have a new type of JIRA called a 
SIP and have a link to a filter that shows all such JIRAs from 
http://spark.apache.org. I also like the idea of SIP and design doc templates 
(in fact many projects have them).

Matei

> On Oct 7, 2016, at 10:38 AM, Reynold Xin  wrote:
> 
> I called Cody last night and talked about some of the topics in his email. It 
> became clear to me Cody genuinely cares about the project.
> 
> Some of the frustrations come from the success of the project itself becoming 
> very "hot", and it is difficult to get clarity from people who don't dedicate 
> all their time to Spark. In fact, it is in some ways similar to scaling an 
> engineering team in a successful startup: old processes that worked well 
> might not work so well when it gets to a certain size, cultures can get 
> diluted, building culture vs building process, etc.
> 
> I also really like to have a more visible process for larger changes, 
> especially major user facing API changes. Historically we upload design docs 
> for major changes, but it is not always consistent and difficult to quality 
> of the docs, due to the volunteering nature of the organization.
> 
> Some of the more concrete ideas we discussed focus on building a culture to 
> improve clarity:
> 
> - Process: Large changes should have design docs posted on JIRA. One thing 
> Cody and I didn't discuss but an idea that just came to me is we should 
> create a design doc template for the project and ask everybody to follow. The 
> design doc template should also explicitly list goals and non-goals, to make 
> design doc more consistent.
> 
> - Process: Email dev@ to solicit feedback. We have some this with some 
> changes, but again very inconsistent. Just posting something on JIRA isn't 
> sufficient, because there are simply too many JIRAs and the signal get lost 
> in the noise. While this is generally impossible to enforce because we can't 
> force all volunteers to conform to a process (or they might not even be aware 
> of this),  those who are more familiar with the project can help by emailing 
> the dev@ when they see something that hasn't been.
> 
> - Culture: The design doc author(s) should be open to feedback. A design doc 
> should serve as the base for discussion and is by no means the final design. 
> Of course, this does not mean the author has to accept every feedback. They 
> should also be comfortable accepting / rejecting ideas on technical grounds.
> 
> - Process / Culture: For major ongoing projects, it can be useful to have 
> some monthly Google hangouts that are open to the world. I am actually not 
> sure how well this will work, because of the volunteering nature and we need 
> to adjust for timezones for people across the globe, but it seems worth 
> trying.
> 
> - Culture: Contributors (including committers) should be more direct in 
> setting expectations, including whether they are working on a specific issue, 
> whether they will be working on a specific issue, and whether an issue or pr 
> or jira should be rejected. Most people I know in this community are nice and 
> don't enjoy telling other people no, but it is often more annoying to a 
> contributor to not know anything than getting a no.
> 
> 
> On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia  > wrote:
> 
> Love the idea of a more visible "Spark Improvement Proposal" process that 
> solicits user input on new APIs. For what it's worth, I don't think 
> committers are trying to minimize their own work -- every committer cares 
> about making the software useful for users. However, it is always hard to get 
> user input and so it helps to have this kind of process. I've certainly 
> looked at the *IPs a lot in other software I use just to see the biggest 
> things on the roadmap.
> 
> When you're talking about "changing interfaces", are you talking about public 
> or internal APIs? I do think many people hate changing public APIs and I 
> actually think that's for the best of the project. That's a technical debate, 
> but basically, the worst thing when you're using a piece of software is that 
> the developers constantly ask you to rewrite your app to update to a new 
> version (and thus benefit from bug fixes, etc). Cue anyone who's used 
> Protobuf, or Guava. The "let's get everyone to change their code this 
> release" model works well within a single large company, but doesn't work 
> well for a community, which is why nearly all *very* widely used programming 
> interfaces (I'm talking things like Java standard library, Windows API, etc) 
> almost *never* break backwards compatibility. All this is done within reason 
> though, e.g. we do change things in major releases (2.x, 3.x, 

Re: Spark Improvement Proposals

2016-10-07 Thread Reynold Xin
I called Cody last night and talked about some of the topics in his email.
It became clear to me Cody genuinely cares about the project.

Some of the frustrations come from the success of the project itself
becoming very "hot", and it is difficult to get clarity from people who
don't dedicate all their time to Spark. In fact, it is in some ways similar
to scaling an engineering team in a successful startup: old processes that
worked well might not work so well when it gets to a certain size, cultures
can get diluted, building culture vs building process, etc.

I also really like to have a more visible process for larger changes,
especially major user facing API changes. Historically we upload design
docs for major changes, but it is not always consistent and difficult to
quality of the docs, due to the volunteering nature of the organization.

Some of the more concrete ideas we discussed focus on building a culture to
improve clarity:

- Process: Large changes should have design docs posted on JIRA. One thing
Cody and I didn't discuss but an idea that just came to me is we should
create a design doc template for the project and ask everybody to follow.
The design doc template should also explicitly list goals and non-goals, to
make design doc more consistent.

- Process: Email dev@ to solicit feedback. We have some this with some
changes, but again very inconsistent. Just posting something on JIRA isn't
sufficient, because there are simply too many JIRAs and the signal get lost
in the noise. While this is generally impossible to enforce because we
can't force all volunteers to conform to a process (or they might not even
be aware of this),  those who are more familiar with the project can help
by emailing the dev@ when they see something that hasn't been.

- Culture: The design doc author(s) should be open to feedback. A design
doc should serve as the base for discussion and is by no means the final
design. Of course, this does not mean the author has to accept every
feedback. They should also be comfortable accepting / rejecting ideas on
technical grounds.

- Process / Culture: For major ongoing projects, it can be useful to have
some monthly Google hangouts that are open to the world. I am actually not
sure how well this will work, because of the volunteering nature and we
need to adjust for timezones for people across the globe, but it seems
worth trying.

- Culture: Contributors (including committers) should be more direct in
setting expectations, including whether they are working on a specific
issue, whether they will be working on a specific issue, and whether an
issue or pr or jira should be rejected. Most people I know in this
community are nice and don't enjoy telling other people no, but it is often
more annoying to a contributor to not know anything than getting a no.


On Fri, Oct 7, 2016 at 10:03 AM, Matei Zaharia 
wrote:

>
> Love the idea of a more visible "Spark Improvement Proposal" process that
> solicits user input on new APIs. For what it's worth, I don't think
> committers are trying to minimize their own work -- every committer cares
> about making the software useful for users. However, it is always hard to
> get user input and so it helps to have this kind of process. I've certainly
> looked at the *IPs a lot in other software I use just to see the biggest
> things on the roadmap.
>
> When you're talking about "changing interfaces", are you talking about
> public or internal APIs? I do think many people hate changing public APIs
> and I actually think that's for the best of the project. That's a technical
> debate, but basically, the worst thing when you're using a piece of
> software is that the developers constantly ask you to rewrite your app to
> update to a new version (and thus benefit from bug fixes, etc). Cue anyone
> who's used Protobuf, or Guava. The "let's get everyone to change their code
> this release" model works well within a single large company, but doesn't
> work well for a community, which is why nearly all *very* widely used
> programming interfaces (I'm talking things like Java standard library,
> Windows API, etc) almost *never* break backwards compatibility. All this is
> done within reason though, e.g. we do change things in major releases (2.x,
> 3.x, etc).
>


Re: Spark Improvement Proposals

2016-10-07 Thread Nicholas Chammas
There are several important discussions happening simultaneously. Should we
perhaps split them up into separate threads? Otherwise it’s really
difficult to follow.

It seems like the discussion about having a more formal “Spark Improvement
Proposal” process should take priority here.

Other discussions that could be fleshed out in separate threads are:

   - Better managing “organic” community contributions (i.e. PRs, JIRA
   issues, etc).
   - Adjusting Spark’s governance model / adding more committers.
   - Discussing / addressing competition to Spark coming out of the Python
   community.

Nick

On Fri, Oct 7, 2016 at 1:04 PM Matei Zaharia matei.zaha...@gmail.com
 wrote:

I think people misunderstood my comment about trolls a bit -- I'm not
> saying to just dismiss what people say, but to focus on what improves the
> project instead of being upset that people criticize stuff. This stuff
> happens all the time to any project in a "hot" area, as Sean said. I don't
> think there's anyone that wants to stop adding features to streaming for
> example, or stop listening to users, etc, or who thinks the project is
> already perfect (I certainly spend much of my time looking at how to
> improve it).
>
> Just to comment on a few things:
>
> On Oct 7, 2016, at 9:16 AM, Holden Karau  wrote:
>
> First off, thanks Cody for taking the time to put together these proposals
> - I think it has kicked off some wonderful discussion.
>
> I think dismissing people's complaints with Spark as largely trolls does
> us a disservice, it’s important for us to recognize our own shortcomings -
> otherwise we are blind to the weak spots where we need to improve and
> instead focus on new features. Parts of the Python community seem to be
> actively looking for alternatives, and I’d obviously like Spark continue to
> be the place where we come together and collaborate from different
> languages.
>
> I’d be more than happy to do a review of the outstanding Python PRs (I’ve
> been keeping on top of the new ones but largely haven’t looked at the older
> ones) and if there is a committer (maybe Davies or Sean?) who would be able
> to help out with merging them once they are ready that would be awesome.
> I’m at PyData DC this weekend but I’ll also start going through some of the
> older Python JIRAs and seeing if they are still relevant, already fixed, or
> something we are unlikely to be interested in bringing into Spark.
>
>
> It would be great to also hear why people are looking for other stuff at a
> high level -- are there just many small issues in Python, or are there some
> bigger things missing? For example, one thing I'd like to see is easy
> installation of PySpark using pip install pyspark. Another idea would be
> making startup time and initialization easy enough that people use Spark
> regularly on a single machine, as a replacement for multiprocessing.
>
> - Design.
> Yes, design by committee doesn't work.  The best designs are when a
> person who understands the problem builds something that works for
> them, shares with others, and most importantly iterates when it
> doesn't work for others.  This iteration only works if you're willing
> to change interfaces, but committer and user goals are not aligned
> here.  Users want something that is clearly documented and helps them
> get their job done.  Committers (not all) want to minimize interface
> change, even at the expense of users being able to do their jobs.  In
> this situation, it is critical that you understand early what users
> need to be able to do.  This is what the improvement proposal process
> should focus on: Goals, non-goals, possible solutions, rejected
> solutions.  Not class-level design.  Most importantly, it needs a
> clear, unambiguous outcome that is visible to the public.
>
>
> Love the idea of a more visible "Spark Improvement Proposal" process that
> solicits user input on new APIs. For what it's worth, I don't think
> committers are trying to minimize their own work -- every committer cares
> about making the software useful for users. However, it is always hard to
> get user input and so it helps to have this kind of process. I've certainly
> looked at the *IPs a lot in other software I use just to see the biggest
> things on the roadmap.
>
> When you're talking about "changing interfaces", are you talking about
> public or internal APIs? I do think many people hate changing public APIs
> and I actually think that's for the best of the project. That's a technical
> debate, but basically, the worst thing when you're using a piece of
> software is that the developers constantly ask you to rewrite your app to
> update to a new version (and thus benefit from bug fixes, etc). Cue anyone
> who's used Protobuf, or Guava. The "let's get everyone to change their code
> this release" model works well within a single large company, but doesn't
> work well for a community, which is why nearly all 

Re: Spark Improvement Proposals

2016-10-07 Thread Matei Zaharia
I think people misunderstood my comment about trolls a bit -- I'm not saying to 
just dismiss what people say, but to focus on what improves the project instead 
of being upset that people criticize stuff. This stuff happens all the time to 
any project in a "hot" area, as Sean said. I don't think there's anyone that 
wants to stop adding features to streaming for example, or stop listening to 
users, etc, or who thinks the project is already perfect (I certainly spend 
much of my time looking at how to improve it).

Just to comment on a few things:

> On Oct 7, 2016, at 9:16 AM, Holden Karau  wrote:
> 
> First off, thanks Cody for taking the time to put together these proposals - 
> I think it has kicked off some wonderful discussion.
> 
> I think dismissing people's complaints with Spark as largely trolls does us a 
> disservice, it’s important for us to recognize our own shortcomings - 
> otherwise we are blind to the weak spots where we need to improve and instead 
> focus on new features. Parts of the Python community seem to be actively 
> looking for alternatives, and I’d obviously like Spark continue to be the 
> place where we come together and collaborate from different languages.
> 
> I’d be more than happy to do a review of the outstanding Python PRs (I’ve 
> been keeping on top of the new ones but largely haven’t looked at the older 
> ones) and if there is a committer (maybe Davies or Sean?) who would be able 
> to help out with merging them once they are ready that would be awesome. I’m 
> at PyData DC this weekend but I’ll also start going through some of the older 
> Python JIRAs and seeing if they are still relevant, already fixed, or 
> something we are unlikely to be interested in bringing into Spark.

It would be great to also hear why people are looking for other stuff at a high 
level -- are there just many small issues in Python, or are there some bigger 
things missing? For example, one thing I'd like to see is easy installation of 
PySpark using pip install pyspark. Another idea would be making startup time 
and initialization easy enough that people use Spark regularly on a single 
machine, as a replacement for multiprocessing.

> - Design.
> Yes, design by committee doesn't work.  The best designs are when a
> person who understands the problem builds something that works for
> them, shares with others, and most importantly iterates when it
> doesn't work for others.  This iteration only works if you're willing
> to change interfaces, but committer and user goals are not aligned
> here.  Users want something that is clearly documented and helps them
> get their job done.  Committers (not all) want to minimize interface
> change, even at the expense of users being able to do their jobs.  In
> this situation, it is critical that you understand early what users
> need to be able to do.  This is what the improvement proposal process
> should focus on: Goals, non-goals, possible solutions, rejected
> solutions.  Not class-level design.  Most importantly, it needs a
> clear, unambiguous outcome that is visible to the public.

Love the idea of a more visible "Spark Improvement Proposal" process that 
solicits user input on new APIs. For what it's worth, I don't think committers 
are trying to minimize their own work -- every committer cares about making the 
software useful for users. However, it is always hard to get user input and so 
it helps to have this kind of process. I've certainly looked at the *IPs a lot 
in other software I use just to see the biggest things on the roadmap.

When you're talking about "changing interfaces", are you talking about public 
or internal APIs? I do think many people hate changing public APIs and I 
actually think that's for the best of the project. That's a technical debate, 
but basically, the worst thing when you're using a piece of software is that 
the developers constantly ask you to rewrite your app to update to a new 
version (and thus benefit from bug fixes, etc). Cue anyone who's used Protobuf, 
or Guava. The "let's get everyone to change their code this release" model 
works well within a single large company, but doesn't work well for a 
community, which is why nearly all *very* widely used programming interfaces 
(I'm talking things like Java standard library, Windows API, etc) almost 
*never* break backwards compatibility. All this is done within reason though, 
e.g. we do change things in major releases (2.x, 3.x, etc).

> - Trolling
> It's not just trolling.  Event time and kafka are technically
> important and should not be ignored.  I've been banging this drum for
> years.  These concerns haven't been fully heard and understood by
> committers.  This one example of why diversity of enfranchised users
> is important and governance concerns shouldn't be ignored.

I agree about empowering people interested here to contribute, but I'm 
wondering, do you think there are technical things that people don't want to 

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread 安全部
Thanks for replying.
When could you send out the PR?

发件人: Yanbo Liang >
日期: 2016年10月7日 星期五 下午11:35
至: didi >
抄送: "dev@spark.apache.org" 
>, 
"u...@spark.apache.org" 
>
主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?

It's a good question and I had similar requirement in my work. I'm copying the 
implementation from mllib to ml currently, and then exposing the maximum log 
likelihood. I will send this PR soon.

Thanks.
Yanbo

On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) 
> wrote:

Hi,

Do you guys sometimes need to get the log likelihood of EM algorithm in MLLIB?

I mean the value in this line 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L228

Now copying the code here:


val sums = breezeData.treeAggregate(ExpectationSum.zero(k, 
d))(compute.value, _ += _)

// Create new distributions based on the partial assignments
// (often referred to as the "M" step in literature)
val sumWeights = sums.weights.sum

if (shouldDistributeGaussians) {
val numPartitions = math.min(k, 1024)
val tuples =
Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, 
sigma, weight) =>
updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
}.collect().unzip
Array.copy(ws.toArray, 0, weights, 0, ws.length)
Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
} else {
var i = 0
while (i < k) {
val (weight, gaussian) =
updateWeightsAndGaussians(sums.means(i), sums.sigmas(i), 
sums.weights(i), sumWeights)
weights(i) = weight
gaussians(i) = gaussian
i = i + 1
}
}

llhp = llh // current becomes previous
llh = sums.logLikelihood // this is the freshly computed log-likelihood
iter += 1
compute.destroy(blocking = false)
In my application, I need to know log likelihood to compare effect for 
different number of clusters.
And then I use the cluster number with the maximum log likelihood.

Is it a good idea to expose this value?






Re: Spark Improvement Proposals

2016-10-07 Thread Holden Karau
First off, thanks Cody for taking the time to put together these proposals
- I think it has kicked off some wonderful discussion.

I think dismissing people's complaints with Spark as largely trolls does us
a disservice, it’s important for us to recognize our own shortcomings -
otherwise we are blind to the weak spots where we need to improve and
instead focus on new features. Parts of the Python community seem to be
actively looking for alternatives, and I’d obviously like Spark continue to
be the place where we come together and collaborate from different
languages.

I’d be more than happy to do a review of the outstanding Python PRs (I’ve
been keeping on top of the new ones but largely haven’t looked at the older
ones) and if there is a committer (maybe Davies or Sean?) who would be able
to help out with merging them once they are ready that would be awesome.
I’m at PyData DC this weekend but I’ll also start going through some of the
older Python JIRAs and seeing if they are still relevant, already fixed, or
something we are unlikely to be interested in bringing into Spark.

I’m giving a talk later on this month on how to get started contributing to
Apache Spark at OSCON London, and when I’ve given this talk before I’ve had
to include a fair number of warnings about the challenges that can face a
new contributor. I’d love to be able to drop those in future versions :)

P.S.

As one of the non-committers who has been working on Spark for several
years (see http://bit.ly/hkspmg ) I have strong feelings around the current
process being used for committers - but since I’m not on the PMC (catch-22
style) it's difficult to have any visibility into the process, so someone
who does will have to weigh in on that :)


On Fri, Oct 7, 2016 at 8:00 AM, Cody Koeninger  wrote:

> Sean, that was very eloquently put, and I 100% agree.  If I ever meet
> you in person, I'll buy you multiple rounds of beverages of your
> choice ;)
> This is probably reiterating some of what you said in a less clear
> manner, but I'll throw more of my 2 cents in.
>
> - Design.
> Yes, design by committee doesn't work.  The best designs are when a
> person who understands the problem builds something that works for
> them, shares with others, and most importantly iterates when it
> doesn't work for others.  This iteration only works if you're willing
> to change interfaces, but committer and user goals are not aligned
> here.  Users want something that is clearly documented and helps them
> get their job done.  Committers (not all) want to minimize interface
> change, even at the expense of users being able to do their jobs.  In
> this situation, it is critical that you understand early what users
> need to be able to do.  This is what the improvement proposal process
> should focus on: Goals, non-goals, possible solutions, rejected
> solutions.  Not class-level design.  Most importantly, it needs a
> clear, unambiguous outcome that is visible to the public.
>
> - Trolling
> It's not just trolling.  Event time and kafka are technically
> important and should not be ignored.  I've been banging this drum for
> years.  These concerns haven't been fully heard and understood by
> committers.  This one example of why diversity of enfranchised users
> is important and governance concerns shouldn't be ignored.
>
> - Jira
> Concretely, automate closing stale jiras after X amount of time.  It's
> really surprising to me how much reluctance a community of programmers
> have shown towards automating their own processes around stuff like
> this (not to mention automatic code formatting of modified files).  I
> understand the arguments against. but the current alternative doesn't
> work.
> Concretely, clearly reject and close jiras.  I have a backlog of 50+
> kafka jiras, many of which are irrelevant at this point, but I do not
> feel that I have the political power to close them.
> Concretely, make it clear who is working on something.  This can be as
> simple as just "I'm working on this", assign it to me, if I don't
> follow up in X amount of time, close it or reassign.  That doesn't
> mean there can't be competing work, but it does mean those people
> should talk to each other.  Conversely, if committers currently don't
> have time to work on something that is important, make that clear in
> the ticket.
>
>
> On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen  wrote:
> > Suggestion actions way at the bottom.
> >
> > On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia 
> > wrote:
> >>
> >> since March. But it's true that other things such as the Kafka source
> for
> >> it didn't have as much design on JIRA. Nonetheless, this component is
> still
> >> early on and there's still a lot of time to change it, which is
> happening.
> >
> >
> > It's hard to drive design discussions in OSS. Even when diligently
> > publishing design docs, the doc happens after brainstorming, and that
> > happens inside someone's 

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread Yanbo Liang
It's a good question and I had similar requirement in my work. I'm copying
the implementation from mllib to ml currently, and then exposing the
maximum log likelihood. I will send this PR soon.

Thanks.
Yanbo

On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) 
wrote:

>
> Hi,
>
> Do you guys sometimes need to get the log likelihood of EM algorithm in
> MLLIB?
>
> I mean the value in this line https://github.com/apache/spark/blob/master/
> mllib/src/main/scala/org/apache/spark/mllib/clustering/
> GaussianMixture.scala#L228
>
> Now copying the code here:
>
>
> val sums = breezeData.treeAggregate(ExpectationSum.zero(k,
> d))(compute.value, _ += _)
> // Create new distributions based on the partial assignments
> // (often referred to as the "M" step in literature)
> val sumWeights = sums.weights.sum
> if (shouldDistributeGaussians) {
> val numPartitions = math.min(k, 1024)
> val tuples =
> Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
> val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean,
> sigma, weight) =>
> updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
> }.collect().unzip
> Array.copy(ws.toArray, 0, weights, 0, ws.length)
> Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
> } else {
> var i = 0
> while (i < k) {
> val (weight, gaussian) =
> updateWeightsAndGaussians(sums.means(i), sums.sigmas(i), sums.weights(i),
> sumWeights)
> weights(i) = weight
> gaussians(i) = gaussian
> i = i + 1
> }
> }
> llhp = llh // current becomes previous
> llh = sums.logLikelihood // this is the freshly computed log-likelihood
> iter += 1
> compute.destroy(blocking = false) In my application, I need to know log
> likelihood to compare effect for different number of clusters.
> And then I use the cluster number with the maximum log likelihood.
>
> Is it a good idea to expose this value?
>
>
>
>


Re: Spark Improvement Proposals

2016-10-07 Thread Cody Koeninger
Sean, that was very eloquently put, and I 100% agree.  If I ever meet
you in person, I'll buy you multiple rounds of beverages of your
choice ;)
This is probably reiterating some of what you said in a less clear
manner, but I'll throw more of my 2 cents in.

- Design.
Yes, design by committee doesn't work.  The best designs are when a
person who understands the problem builds something that works for
them, shares with others, and most importantly iterates when it
doesn't work for others.  This iteration only works if you're willing
to change interfaces, but committer and user goals are not aligned
here.  Users want something that is clearly documented and helps them
get their job done.  Committers (not all) want to minimize interface
change, even at the expense of users being able to do their jobs.  In
this situation, it is critical that you understand early what users
need to be able to do.  This is what the improvement proposal process
should focus on: Goals, non-goals, possible solutions, rejected
solutions.  Not class-level design.  Most importantly, it needs a
clear, unambiguous outcome that is visible to the public.

- Trolling
It's not just trolling.  Event time and kafka are technically
important and should not be ignored.  I've been banging this drum for
years.  These concerns haven't been fully heard and understood by
committers.  This one example of why diversity of enfranchised users
is important and governance concerns shouldn't be ignored.

- Jira
Concretely, automate closing stale jiras after X amount of time.  It's
really surprising to me how much reluctance a community of programmers
have shown towards automating their own processes around stuff like
this (not to mention automatic code formatting of modified files).  I
understand the arguments against. but the current alternative doesn't
work.
Concretely, clearly reject and close jiras.  I have a backlog of 50+
kafka jiras, many of which are irrelevant at this point, but I do not
feel that I have the political power to close them.
Concretely, make it clear who is working on something.  This can be as
simple as just "I'm working on this", assign it to me, if I don't
follow up in X amount of time, close it or reassign.  That doesn't
mean there can't be competing work, but it does mean those people
should talk to each other.  Conversely, if committers currently don't
have time to work on something that is important, make that clear in
the ticket.


On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen  wrote:
> Suggestion actions way at the bottom.
>
> On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia 
> wrote:
>>
>> since March. But it's true that other things such as the Kafka source for
>> it didn't have as much design on JIRA. Nonetheless, this component is still
>> early on and there's still a lot of time to change it, which is happening.
>
>
> It's hard to drive design discussions in OSS. Even when diligently
> publishing design docs, the doc happens after brainstorming, and that
> happens inside someone's head or in chats.
>
> The lazy consensus model that works for small changes doesn't work well
> here. If a committer wants a change, that change will basically be made
> modulo small edits; vetoes are for dire disagreement. (Otherwise we'd get
> nothing done.) However this model means it's hard to significantly change a
> design after draft 1.
>
> I've heard this complaint a few times, and it has never been down to bad
> faith. We should err further towards over-including early and often. I've
> seen some great discussions start more with a problem statement and an RFC,
> not a design doc. Keeping regular contributors enfranchised is essential, so
> that they're willing and able to participate when design time comes. (See
> below.)
>
>
>>
>> 2) About what people say at Reactive Summit -- there will always be
>> trolls, but just ignore them and build a great project. Those of us involved
>> in the project for a while have long seen similar stuff, e.g. a
>
>
> The hype cycle may be turning against Spark, as is normal for this stage of
> maturity. People idealize technologies they don't really use as greener
> grass; it's the things they use and need to work that they love to hate.
>
> I would not dismiss this as just trolling. Customer anecdotes I see suggest
> that Spark underperforms their (inflated) expectations, and generally does
> not Just Work. It takes expertise, tuning, patience, workarounds. And then
> it gets great things done. I do see a gap between how the group here talks
> about the technology, and how the users I see talk about it. The gap
> manifests in attention given to making yet more things, and attention given
> to fixing and project mechanics.
>
> I would also not dismiss criticism of governance. We can recognize some big
> problems that were resolved over even the past 3 months. Usually I hear,
> well, we do better than most projects, right? and that is true. But, Spark
> is bigger 

Re: Spark Improvement Proposals

2016-10-07 Thread Sean Owen
Suggestion actions way at the bottom.

On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia 
wrote:

since March. But it's true that other things such as the Kafka source for
it didn't have as much design on JIRA. Nonetheless, this component is still
early on and there's still a lot of time to change it, which is happening.


It's hard to drive design discussions in OSS. Even when diligently
publishing design docs, the doc happens after brainstorming, and that
happens inside someone's head or in chats.

The lazy consensus model that works for small changes doesn't work well
here. If a committer wants a change, that change will basically be made
modulo small edits; vetoes are for dire disagreement. (Otherwise we'd get
nothing done.) However this model means it's hard to significantly change a
design after draft 1.

I've heard this complaint a few times, and it has never been down to bad
faith. We should err further towards over-including early and often. I've
seen some great discussions start more with a problem statement and an RFC,
not a design doc. Keeping regular contributors enfranchised is essential,
so that they're willing and able to participate when design time comes.
(See below.)



2) About what people say at Reactive Summit -- there will always be trolls,
but just ignore them and build a great project. Those of us involved in the
project for a while have long seen similar stuff, e.g. a


The hype cycle may be turning against Spark, as is normal for this stage of
maturity. People idealize technologies they don't really use as greener
grass; it's the things they use and need to work that they love to hate.

I would not dismiss this as just trolling. Customer anecdotes I see suggest
that Spark underperforms their (inflated) expectations, and generally does
not Just Work. It takes expertise, tuning, patience, workarounds. And then
it gets great things done. I do see a gap between how the group here talks
about the technology, and how the users I see talk about it. The gap
manifests in attention given to making yet more things, and attention given
to fixing and project mechanics.

I would also not dismiss criticism of governance. We can recognize some big
problems that were resolved over even the past 3 months. Usually I hear,
well, we do better than most projects, right? and that is true. But, Spark
is bigger and busier than most any other project. Exceptional projects need
exceptional governance and we have merely "good". See next.


3) About number and diversity of committers -- the PMC is always working to
expand these, and you should email people on the PMC (or even the whole
list) if you have people you'd like to propose. In


If you're suggesting that it's mostly a matter of asking, then this doesn't
match my experience. I have seen a few people consistently soft-reject most
proposals. The reasons given usually sound like "concerns about quality",
which is probably the right answer to a somewhat wrong question.

We should probably be asking primarily who will net-net add efficiency to
some part of the project's mechanics. Per above, it wouldn't hurt to ask
who would expand coverage and add diversity of perspective too.

I disagree that committers are being added at a sufficient rate. The
overall committer-attention hours is dropping as the project grows -- am I
the only one that perceives many regular committers aren't working nearly
as much as before on the project?

I call it a problem because we have IMHO people who 'qualify', and not
giving them some stake is going to cost the project down the road. Always
Be Recruiting. This is what I would worry about, since the governance and
enfranchisement issues above kind of stem from this.



4) Finally, about better organizing JIRA, marking dead issues, etc, this
would be great and I think we just need a concrete proposal for how to do
it. It would be best to point to an existing process that someone else has
used here BTW so that we can see it in action.


I don't think we're wanting for proposals. I went on and on about it last
year, and don't think anyone disagreed about actions. I wouldn't suggest
that clearing out dead issues is more complex than just putting in time to
do it. It's just grunt work and understandably not appealing. (Thank you
Xiao for your recent run at SQL JIRAs.)

It requires saying 'no', which is hard, because it requires some
conviction. I have encountered reluctance to do this in Spark and think
that culture should change. Is it weird to say that a broader group of
gatekeepers can actually with more confidence and efficiency tackle the
triage issue? that pushing back on 'bad' contribution actually increases
the rate of 'good'?

FWIW I also find the project unpleasant to deal with day to day, mostly
because of the scale of the triage, and think we could use all the
qualified help we can get. I am looking to do less with the project over
time, which is no big deal in itself, but is a big deal if these 

Re: Monitoring system extensibility

2016-10-07 Thread Pete Robbins
Which has happened. The last comment being in August with someone saying it
was important to them. They PR has been around since March and despite a
request to be reviewed has not got any committer's attention. Without that,
it is going nowhere. The historic Jiras requesting other sinks such as
Kafka, OpenTSBD etc have also been ignored.

So for now we continue creating classes in o.a.s package.

On Fri, 7 Oct 2016 at 09:50 Reynold Xin  wrote:

> So to be constructive and in order to actually open up these APIs, it
> would be useful for users to comment on the JIRA ticket on their use cases
> (rather than "I want this to be public"), and then we can design an API
> that would address those use cases. In some cases the solution is to just
> make the existing internal API public. But turning some internal API public
> without thinking about whether those APIs are sufficiently expressive and
> maintainable is not a great way to design APIs in general.
>
> On Friday, October 7, 2016, Pete Robbins  wrote:
>
> I brought this up last year and there was a Jira raised:
> https://issues.apache.org/jira/browse/SPARK-14151
>
> For now I just have my SInk and Source in an o.a.s package name which is
> not ideal but the only way round this.
>
> On Fri, 7 Oct 2016 at 08:30 Reynold Xin  wrote:
>
> They have always been private, haven't they?
>
>
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/metrics/source/Source.scala
>
>
>
> On Thu, Oct 6, 2016 at 7:38 AM, Alexander Oleynikov  > wrote:
>
> Hi.
>
> As of v2.0.1, the traits `org.apache.spark.metrics.source.Source` and
> `org.apache.spark.metrics.sink.Sink` are defined as private to ‘spark’
> package, so it becomes troublesome to create a new implementation in the
> user’s code (but still possible in a hacky way).
> This seems to be the only missing piece to allow extension of the metrics
> system and I wonder whether is was conscious design decision to limit the
> visibility. Is it possible to broaden the visibility scope for these traits
> in the future versions?
>
> Thanks,
> Alexander
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>


Re: Monitoring system extensibility

2016-10-07 Thread Reynold Xin
So to be constructive and in order to actually open up these APIs, it would
be useful for users to comment on the JIRA ticket on their use cases
(rather than "I want this to be public"), and then we can design an API
that would address those use cases. In some cases the solution is to just
make the existing internal API public. But turning some internal API public
without thinking about whether those APIs are sufficiently expressive and
maintainable is not a great way to design APIs in general.

On Friday, October 7, 2016, Pete Robbins  wrote:

> I brought this up last year and there was a Jira raised:
> https://issues.apache.org/jira/browse/SPARK-14151
>
> For now I just have my SInk and Source in an o.a.s package name which is
> not ideal but the only way round this.
>
> On Fri, 7 Oct 2016 at 08:30 Reynold Xin  > wrote:
>
>> They have always been private, haven't they?
>>
>> https://github.com/apache/spark/blob/branch-1.6/core/
>> src/main/scala/org/apache/spark/metrics/source/Source.scala
>>
>>
>>
>> On Thu, Oct 6, 2016 at 7:38 AM, Alexander Oleynikov <
>> oleyniko...@gmail.com
>> > wrote:
>>
>>> Hi.
>>>
>>> As of v2.0.1, the traits `org.apache.spark.metrics.source.Source` and
>>> `org.apache.spark.metrics.sink.Sink` are defined as private to ‘spark’
>>> package, so it becomes troublesome to create a new implementation in the
>>> user’s code (but still possible in a hacky way).
>>> This seems to be the only missing piece to allow extension of the
>>> metrics system and I wonder whether is was conscious design decision to
>>> limit the visibility. Is it possible to broaden the visibility scope for
>>> these traits in the future versions?
>>>
>>> Thanks,
>>> Alexander
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
>>>
>>>
>>


Re: Monitoring system extensibility

2016-10-07 Thread Pete Robbins
I brought this up last year and there was a Jira raised:
https://issues.apache.org/jira/browse/SPARK-14151

For now I just have my SInk and Source in an o.a.s package name which is
not ideal but the only way round this.

On Fri, 7 Oct 2016 at 08:30 Reynold Xin  wrote:

> They have always been private, haven't they?
>
>
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/metrics/source/Source.scala
>
>
>
> On Thu, Oct 6, 2016 at 7:38 AM, Alexander Oleynikov  > wrote:
>
> Hi.
>
> As of v2.0.1, the traits `org.apache.spark.metrics.source.Source` and
> `org.apache.spark.metrics.sink.Sink` are defined as private to ‘spark’
> package, so it becomes troublesome to create a new implementation in the
> user’s code (but still possible in a hacky way).
> This seems to be the only missing piece to allow extension of the metrics
> system and I wonder whether is was conscious design decision to limit the
> visibility. Is it possible to broaden the visibility scope for these traits
> in the future versions?
>
> Thanks,
> Alexander
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>


Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread 安全部

Hi,

Do you guys sometimes need to get the log likelihood of EM algorithm in MLLIB?

I mean the value in this line 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L228

Now copying the code here:


val sums = breezeData.treeAggregate(ExpectationSum.zero(k, 
d))(compute.value, _ += _)

// Create new distributions based on the partial assignments
// (often referred to as the "M" step in literature)
val sumWeights = sums.weights.sum

if (shouldDistributeGaussians) {
val numPartitions = math.min(k, 1024)
val tuples =
Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, 
sigma, weight) =>
updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
}.collect().unzip
Array.copy(ws.toArray, 0, weights, 0, ws.length)
Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
} else {
var i = 0
while (i < k) {
val (weight, gaussian) =
updateWeightsAndGaussians(sums.means(i), sums.sigmas(i), 
sums.weights(i), sumWeights)
weights(i) = weight
gaussians(i) = gaussian
i = i + 1
}
}

llhp = llh // current becomes previous
llh = sums.logLikelihood // this is the freshly computed log-likelihood
iter += 1
compute.destroy(blocking = false)
In my application, I need to know log likelihood to compare effect for 
different number of clusters.
And then I use the cluster number with the maximum log likelihood.

Is it a good idea to expose this value?





Re: Looking for a Spark-Python expert

2016-10-07 Thread Sean Owen
dev@ is for the project's own development discussions, so not the right
place. user@ is better, but job postings are discouraged in general on ASF
lists. I think people get away with the occasional legitimate, targeted
message prefixed with [JOBS], but I hesitate to open the flood gates,
because we also have no real way of banning the inevitable spam.

On Fri, Oct 7, 2016 at 8:45 AM Boris Lenzinger 
wrote:

>
> Hi all,
>
> I don't know where to post this announce so I really apologize to pollute
> the ML with such a mail.
>
> I'm looking for an expert in Spark 2.0 and its Python API. I have a
> customer that is looking for an expertise mission (for one month but I
> guess it can spread on 2 month seeing the goals to reach).
>
> Here is the context : there is a team (3 personnes) that is studying
> different solutions for an image processing framework and Spark has been
> identified as a candidate. So they want to make a proof of concept around
> this with a known use case.
>
> Where does the mission take place ? Sophia-Antipolis in France (French
> Riviera). Remote ? Not sure but could be a good solution. I will check and
> potentially update the post.
>
> Dates : the mission should start, in a perfect world, mid-October but tell
> me your availability and I will try to negociate.
>
> Price : first let's get in touch and send me your resume (or if you are
> part of the authors of the framework, I guess it will be ok as a resume :-)
> but I'm still interested in your general background so please send me a
> resume )
>
> I know that the deadlines are quite short so even if you cannot exactly on
> those dates, do not hesitate to apply.
>
> I hope that some of you will be interested in this.
>
> Again sorry for posting on the dev list.
>
> Have a nice day,
>
> boris
>
>
>
>


Re: Looking for a Spark-Python expert

2016-10-07 Thread Reynold Xin
Boris,

Thanks for the email, but this is not a list for soliciting job
applications. Please do not post any recruiting messages -- otherwise we
will ban your account.


On Fri, Oct 7, 2016 at 12:44 AM, Boris Lenzinger 
wrote:

>
> Hi all,
>
> I don't know where to post this announce so I really apologize to pollute
> the ML with such a mail.
>
> I'm looking for an expert in Spark 2.0 and its Python API. I have a
> customer that is looking for an expertise mission (for one month but I
> guess it can spread on 2 month seeing the goals to reach).
>
> Here is the context : there is a team (3 personnes) that is studying
> different solutions for an image processing framework and Spark has been
> identified as a candidate. So they want to make a proof of concept around
> this with a known use case.
>
> Where does the mission take place ? Sophia-Antipolis in France (French
> Riviera). Remote ? Not sure but could be a good solution. I will check and
> potentially update the post.
>
> Dates : the mission should start, in a perfect world, mid-October but tell
> me your availability and I will try to negociate.
>
> Price : first let's get in touch and send me your resume (or if you are
> part of the authors of the framework, I guess it will be ok as a resume :-)
> but I'm still interested in your general background so please send me a
> resume )
>
> I know that the deadlines are quite short so even if you cannot exactly on
> those dates, do not hesitate to apply.
>
> I hope that some of you will be interested in this.
>
> Again sorry for posting on the dev list.
>
> Have a nice day,
>
> boris
>
>
>
>


Fwd: Looking for a Spark-Python expert

2016-10-07 Thread Boris Lenzinger
Hi all,

I don't know where to post this announce so I really apologize to pollute
the ML with such a mail.

I'm looking for an expert in Spark 2.0 and its Python API. I have a
customer that is looking for an expertise mission (for one month but I
guess it can spread on 2 month seeing the goals to reach).

Here is the context : there is a team (3 personnes) that is studying
different solutions for an image processing framework and Spark has been
identified as a candidate. So they want to make a proof of concept around
this with a known use case.

Where does the mission take place ? Sophia-Antipolis in France (French
Riviera). Remote ? Not sure but could be a good solution. I will check and
potentially update the post.

Dates : the mission should start, in a perfect world, mid-October but tell
me your availability and I will try to negociate.

Price : first let's get in touch and send me your resume (or if you are
part of the authors of the framework, I guess it will be ok as a resume :-)
but I'm still interested in your general background so please send me a
resume )

I know that the deadlines are quite short so even if you cannot exactly on
those dates, do not hesitate to apply.

I hope that some of you will be interested in this.

Again sorry for posting on the dev list.

Have a nice day,

boris


Re: Monitoring system extensibility

2016-10-07 Thread Reynold Xin
They have always been private, haven't they?

https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/metrics/source/Source.scala



On Thu, Oct 6, 2016 at 7:38 AM, Alexander Oleynikov 
wrote:

> Hi.
>
> As of v2.0.1, the traits `org.apache.spark.metrics.source.Source` and
> `org.apache.spark.metrics.sink.Sink` are defined as private to ‘spark’
> package, so it becomes troublesome to create a new implementation in the
> user’s code (but still possible in a hacky way).
> This seems to be the only missing piece to allow extension of the metrics
> system and I wonder whether is was conscious design decision to limit the
> visibility. Is it possible to broaden the visibility scope for these traits
> in the future versions?
>
> Thanks,
> Alexander
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-07 Thread Sean Owen
I believe Reynold mentioned he already did that. For anyone following:
https://issues.apache.org/jira/browse/INFRA-12717

On Fri, Oct 7, 2016 at 1:35 AM Luciano Resende  wrote:

> I have created a Infra jira to track the issue with the maven artifacts
> for Spark 2.0.1
>
> On Wed, Oct 5, 2016 at 10:18 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
> Yeah I see the apache maven repos have the 2.0.1 artifacts at
>
> https://repository.apache.org/content/repositories/releases/org/apache/spark/spark-core_2.11/
> -- Not sure why they haven't synced to maven central yet
>
> Shivaram
>
> On Wed, Oct 5, 2016 at 8:37 PM, Luciano Resende 
> wrote:
> > It usually don't take that long to be synced, I still don't see any 2.0.1
> > related artifacts on maven central
> >
> >
> http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.spark%22%20AND%20v%3A%222.0.1%22
>
>