Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread Yanbo Liang
Let's move the discussion to JIRA. Thanks! On Fri, Oct 7, 2016 at 8:43 PM, 王磊(安全部) wrote: > https://issues.apache.org/jira/browse/SPARK-17825 > > Actually I had created a JIRA. Could you let me your progress to avoid > duplicated work. > > Thanks! > > 发件人: didi

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread 安全部
https://issues.apache.org/jira/browse/SPARK-17825 Actually I had created a JIRA. Could you let me your progress to avoid duplicated work. Thanks! 发件人: didi > 日期: 2016年10月8日 星期六 上午12:21 至: Yanbo Liang

Issue with Spark Streaming with checkpointing in Spark 2.0

2016-10-07 Thread Arijit
In a Spark Streaming sample code I am trying to implicitly convert an RDD to DS and save to permanent storage. Below is the snippet of the code I am trying to run. The job runs fine first time when started with the checkpoint directory empty. However, if I kill and restart the job with the same

Re: Spark Improvement Proposals

2016-10-07 Thread Reynold Xin
Alright looks like there are quite a bit of support. We should wait to hear from more people too. To push this forward, Cody and I will be working together in the next couple of weeks to come up with a concrete, detailed proposal on what this entails, and then we can discuss this the specific

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Nicholas Chammas
Ah yes, on a given JIRA issue the number of watchers is often a better indicator of community interest than votes. But yeah, it could be any metric or formula we want, as long as it yielded a "reasonable" bar to cross for unsolicited contributions to get committer review--or at the very least a

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Cody Koeninger
I really like the idea of using jira votes (and/or watchers?) as a filter! On Fri, Oct 7, 2016 at 4:41 PM, Nicholas Chammas wrote: > I agree with Cody and others that we need some automation — or at least an > adjusted process — to help us manage organic contributions

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Nicholas Chammas
I agree with Cody and others that we need some automation — or at least an adjusted process — to help us manage organic contributions better. The objections about automated closing being potentially abrasive are understood, but I wouldn’t accept that as a defeat for automation. Instead, it seems

Re: Spark Improvement Proposals

2016-10-07 Thread Cody Koeninger
Yeah, in case it wasn't clear, I was talking about SIPs for major user-facing or cross-cutting changes, not minor feature adds. On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos < stavros.kontopou...@lightbend.com> wrote: > +1 to the SIP label as long as it does not slow down things and it

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
> > Without a hell of a lot more work, Assign would be the only strategy > usable. How would the current "subscribe" break?

Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Cody Koeninger
Matei asked: > I agree about empowering people interested here to contribute, but I'm > wondering, do you think there are technical things that people don't want to > work on, or is it a matter of what there's been time to do? It's a matter of mismanagement and miscommunication. The

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Cody Koeninger
Without a hell of a lot more work, Assign would be the only strategy usable. On Fri, Oct 7, 2016 at 3:25 PM, Michael Armbrust wrote: >> The implementation is totally and completely different however, in ways >> that leak to the end user. > > > Can you elaborate?

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
> 0.10 consumers won't work on an earlier broker. > Earlier consumers will (should?) work on a 0.10 broker. > This lines up with my testing. Is there a page I'm missing that describes this? Like does a 0.9 client work with 0.8 broker? Is it always old clients can talk to new brokers but not

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
> > The implementation is totally and completely different however, in ways > that leak to the end user. Can you elaborate? Especially in the context of the interface provided by structured streaming.

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Cody Koeninger
0.10 consumers won't work on an earlier broker. Earlier consumers will (should?) work on a 0.10 broker. The main things earlier consumers lack from a user perspective is support for SSL, and pre-fetching messages. The implementation is totally and completely different however, in ways that leak

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Reynold Xin
Does Kafka 0.10 work on a Kafka 0.8/0.9 cluster? On Fri, Oct 7, 2016 at 1:14 PM, Jeremy Smith wrote: > +1 > > We're on CDH, and it will probably be a while before they support Kafka > 0.10. At the same time, we don't use their Spark and we're looking forward > to

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Jeremy Smith
+1 We're on CDH, and it will probably be a while before they support Kafka 0.10. At the same time, we don't use their Spark and we're looking forward to upgrading to 2.0.x and using structured streaming. I was just going to write our own Kafka Source implementation which uses the existing

Re: Spark Improvement Proposals

2016-10-07 Thread Cody Koeninger
+1 to adding an SIP label and linking it from the website. I think it needs - template that focuses it towards soliciting user goals / non goals - clear resolution as to which strategy was chosen to pursue. I'd recommend a vote. Matei asked me to clarify what I meant by changing interfaces, I

Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
We recently merged support for Kafak 0.10.0 in Structured Streaming, but I've been hearing a few people tell me that they are stuck on an older version of Kafka and cannot upgrade. I'm considering revisiting SPARK-17344 , but it would be good to

Re: Reading back hdfs files saved as case class

2016-10-07 Thread Deepak Sharma
Thanks for the answer Reynold. Yes I can use the dataset but it will solve the purpose I am supposed to use it for. I am trying to work on a solution where I need to save the case class along with data in hdfs. Further this data will move to different folders corresponding to different case

Re: Reading back hdfs files saved as case class

2016-10-07 Thread Reynold Xin
You can use the Dataset API -- it should solve this issue for case classes that are not very complex. On Fri, Oct 7, 2016 at 12:20 PM, Deepak Sharma wrote: > Hi > I am saving RDD[Example] in hdfs from spark program , where Example is > case class. > Now when i am trying

Reading back hdfs files saved as case class

2016-10-07 Thread Deepak Sharma
Hi I am saving RDD[Example] in hdfs from spark program , where Example is case class. Now when i am trying to read it back , it returns RDD[String] with the content as below: *Example(1,name,value)* The workaround can be to write as a string in hdfs and read it back as string and perform further

Re: Spark Improvement Proposals

2016-10-07 Thread Reynold Xin
I like the lightweight proposal to add a SIP label. During Spark 2.0 development, Tom (Graves) and I suggested using wiki to track the list of major changes, but that never really materialized due to the overhead. Adding a SIP label on major JIRAs and then link to them prominently on the Spark

Re: Spark Improvement Proposals

2016-10-07 Thread Hyukjin Kwon
I am glad that it was not only what I was thinking. I also do agree with Holden, Sean and Cody. All I wanted to say were all said. 2016-10-08 1:16 GMT+09:00 Holden Karau : > First off, thanks Cody for taking the time to put together these proposals > - I think it has

Anyone interested in Spark & Cloud got time to look at the SPARK-7481 PR?

2016-10-07 Thread Steve Loughran
Some people may have noticed I've been working on adding packaging, docs & testing for getting Spark to work with S3, Azure and openstack into a Spark distribution, https://github.com/apache/spark/pull/12004 It's been a WiP, but now I've got tests for all three cloud infrastructures, tests

Re: Spark Improvement Proposals

2016-10-07 Thread Matei Zaharia
For the improvement proposals, I think one major point was to make them really visible to users who are not contributors, so we should do more than sending stuff to dev@. One very lightweight idea is to have a new type of JIRA called a SIP and have a link to a filter that shows all such JIRAs

Re: Spark Improvement Proposals

2016-10-07 Thread Reynold Xin
I called Cody last night and talked about some of the topics in his email. It became clear to me Cody genuinely cares about the project. Some of the frustrations come from the success of the project itself becoming very "hot", and it is difficult to get clarity from people who don't dedicate all

Re: Spark Improvement Proposals

2016-10-07 Thread Nicholas Chammas
There are several important discussions happening simultaneously. Should we perhaps split them up into separate threads? Otherwise it’s really difficult to follow. It seems like the discussion about having a more formal “Spark Improvement Proposal” process should take priority here. Other

Re: Spark Improvement Proposals

2016-10-07 Thread Matei Zaharia
I think people misunderstood my comment about trolls a bit -- I'm not saying to just dismiss what people say, but to focus on what improves the project instead of being upset that people criticize stuff. This stuff happens all the time to any project in a "hot" area, as Sean said. I don't think

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread 安全部
Thanks for replying. When could you send out the PR? 发件人: Yanbo Liang > 日期: 2016年10月7日 星期五 下午11:35 至: didi > 抄送: "dev@spark.apache.org"

Re: Spark Improvement Proposals

2016-10-07 Thread Holden Karau
First off, thanks Cody for taking the time to put together these proposals - I think it has kicked off some wonderful discussion. I think dismissing people's complaints with Spark as largely trolls does us a disservice, it’s important for us to recognize our own shortcomings - otherwise we are

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread Yanbo Liang
It's a good question and I had similar requirement in my work. I'm copying the implementation from mllib to ml currently, and then exposing the maximum log likelihood. I will send this PR soon. Thanks. Yanbo On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) wrote: > > Hi,

Re: Spark Improvement Proposals

2016-10-07 Thread Cody Koeninger
Sean, that was very eloquently put, and I 100% agree. If I ever meet you in person, I'll buy you multiple rounds of beverages of your choice ;) This is probably reiterating some of what you said in a less clear manner, but I'll throw more of my 2 cents in. - Design. Yes, design by committee

Re: Spark Improvement Proposals

2016-10-07 Thread Sean Owen
Suggestion actions way at the bottom. On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia wrote: since March. But it's true that other things such as the Kafka source for it didn't have as much design on JIRA. Nonetheless, this component is still early on and there's still a

Re: Monitoring system extensibility

2016-10-07 Thread Pete Robbins
Which has happened. The last comment being in August with someone saying it was important to them. They PR has been around since March and despite a request to be reviewed has not got any committer's attention. Without that, it is going nowhere. The historic Jiras requesting other sinks such as

Re: Monitoring system extensibility

2016-10-07 Thread Reynold Xin
So to be constructive and in order to actually open up these APIs, it would be useful for users to comment on the JIRA ticket on their use cases (rather than "I want this to be public"), and then we can design an API that would address those use cases. In some cases the solution is to just make

Re: Monitoring system extensibility

2016-10-07 Thread Pete Robbins
I brought this up last year and there was a Jira raised: https://issues.apache.org/jira/browse/SPARK-14151 For now I just have my SInk and Source in an o.a.s package name which is not ideal but the only way round this. On Fri, 7 Oct 2016 at 08:30 Reynold Xin wrote: > They

Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread 安全部
Hi, Do you guys sometimes need to get the log likelihood of EM algorithm in MLLIB? I mean the value in this line https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L228 Now copying the code here: val sums =

Re: Looking for a Spark-Python expert

2016-10-07 Thread Sean Owen
dev@ is for the project's own development discussions, so not the right place. user@ is better, but job postings are discouraged in general on ASF lists. I think people get away with the occasional legitimate, targeted message prefixed with [JOBS], but I hesitate to open the flood gates, because

Re: Looking for a Spark-Python expert

2016-10-07 Thread Reynold Xin
Boris, Thanks for the email, but this is not a list for soliciting job applications. Please do not post any recruiting messages -- otherwise we will ban your account. On Fri, Oct 7, 2016 at 12:44 AM, Boris Lenzinger wrote: > > Hi all, > > I don't know where to post

Fwd: Looking for a Spark-Python expert

2016-10-07 Thread Boris Lenzinger
Hi all, I don't know where to post this announce so I really apologize to pollute the ML with such a mail. I'm looking for an expert in Spark 2.0 and its Python API. I have a customer that is looking for an expertise mission (for one month but I guess it can spread on 2 month seeing the goals to

Re: Monitoring system extensibility

2016-10-07 Thread Reynold Xin
They have always been private, haven't they? https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/metrics/source/Source.scala On Thu, Oct 6, 2016 at 7:38 AM, Alexander Oleynikov wrote: > Hi. > > As of v2.0.1, the traits

Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-07 Thread Sean Owen
I believe Reynold mentioned he already did that. For anyone following: https://issues.apache.org/jira/browse/INFRA-12717 On Fri, Oct 7, 2016 at 1:35 AM Luciano Resende wrote: > I have created a Infra jira to track the issue with the maven artifacts > for Spark 2.0.1 > > On