Re: Spark build time

2015-04-22 Thread Nicholas Chammas
I suggest searching the archives for this list as there were several previous discussions about this problem. JIRA also has several issues related to this. Some pointers: - SPARK-3431 https://issues.apache.org/jira/browse/SPARK-3431: Parallelize Scala/Java test execution -

Indices of SparseVector must be ordered while computing SVD

2015-04-22 Thread Chunnan Yao
Hi all, I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This really confused me today. At first I thought my implementation is wrong. It turns out it's an issue in MLlib. Fortunately, I've figured it out. I suggest to add a hint on user document of MLlib ( as far as I know,

Re: python/run-tests fails at spark master branch

2015-04-22 Thread Saisai Shao
Hi Hrishikesh, Seems the behavior of Kafka-assembly is a little different when using Maven to sbt. The assembly jar name and location is different while using `mvn package`. This is a actually bug, I'm fixing this now. Thanks Jerry 2015-04-22 13:37 GMT+08:00 Hrishikesh Subramonian

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Tathagata Das
It could very well be that your executor memory is not enough to store the state RDDs AND operate on the data. 1G per executor is quite low. Definitely give more memory. And have you tried increasing the number of partitions (specify number of partitions in updateStateByKey) ? On Wed, Apr 22,

Re: Dataframe.fillna from 1.3.0

2015-04-22 Thread Reynold Xin
It is actually different. coalesce expression is to pick the first value that is not null: https://msdn.microsoft.com/en-us/library/ms190349.aspx Would be great to update the documentation for it (both Scala and Java) to explain that it is different from coalesce function on a DataFrame/RDD. Do

Should we let everyone set Assignee?

2015-04-22 Thread Sean Owen
Anecdotally, there are a number of people asking to set the Assignee field. This is currently restricted to Committers in JIRA. I know the logic was to prevent people from Assigning a JIRA and then leaving it; it also matters a bit for questions of credit. Still I wonder if it's best to just let

Re: Should we let everyone set Assignee?

2015-04-22 Thread Patrick Wendell
One over arching issue is that it's pretty unclear what Assigned to X in JIAR means from a process perspective. Personally I actually feel it's better for this to be more historical - i.e. who ended up submitting a patch for this feature that was merged - rather than creating an exclusive

GradientBoostTrees leaks a persisted RDD

2015-04-22 Thread jimfcarroll
Hi all, It appears GradientBoostedTrees.scala can call 'persist' on an RDD and never unpersist it. In the master branch it's here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala#L181 In 1.3.1 it's here:

Re: GradientBoostTrees leaks a persisted RDD

2015-04-22 Thread Joseph Bradley
Hi Jim, You're right; that should be unpersisted. Could you please create a JIRA and submit a patch? Thanks! Joseph On Wed, Apr 22, 2015 at 6:00 PM, jimfcarroll jimfcarr...@gmail.com wrote: Hi all, It appears GradientBoostedTrees.scala can call 'persist' on an RDD and never unpersist it.

Re: Indices of SparseVector must be ordered while computing SVD

2015-04-22 Thread Joseph Bradley
Hi Chunnan, There is currently Scala documentation for the constructor parameters: https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L515 There is one benefit to not checking for validity (ordering)

Re: Should we let everyone set Assignee?

2015-04-22 Thread Nicholas Chammas
To repeat what Patrick said (literally): If an issue is “assigned” to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems

Re: Should we let everyone set Assignee?

2015-04-22 Thread Vinod Kumar Vavilapalli
Last one for the day. Everyone, as I said clearly, I was not alluding to anything fishy in practice, I was describing how things go wrong in such an environment. Sandy's email lays down some of these problems. Assigning a JIRA in other projects is not a reservation. It is a clear intention

Re: Should we let everyone set Assignee?

2015-04-22 Thread Vinod Kumar Vavilapalli
I watch these lists, so I have a fair understanding of how things work around here. I don't give direct input in the day to day activities though, like Greg Stein on the other thread, so I can understand if it looks like it came from up above. Apache Members come around and give opinions time

Re: Should we let everyone set Assignee?

2015-04-22 Thread Patrick Wendell
Sandy - I definitely agree with that. We should have a convention of signaling someone intends to work - for instance by commenting on the JIRA and we should document this on the contribution guide. The nice thing about having that convention is that multiple people can say they are going to work

Re: Should we let everyone set Assignee?

2015-04-22 Thread Sandy Ryza
I think one of the benefits of assignee fields that I've seen in other projects is their potential to coordinate and prevent duplicate work. It's really frustrating to put a lot of work into a patch and then find out that someone has been doing the same. It's helpful for the project etiquette to

Re: Should we let everyone set Assignee?

2015-04-22 Thread Sean Owen
I can get behind that point of view too. That's what I've told people who expect Assignee is a necessary part of workflow. The existence of a PR link is a signal someone's working on it. In that case we need not do anything. On Wed, Apr 22, 2015 at 8:32 PM, Patrick Wendell pwend...@gmail.com

Re: Should we let everyone set Assignee?

2015-04-22 Thread Vinod Kumar Vavilapalli
Actually what this community got away with is pretty much an anti-pattern compared to every other Apache project I have seen. And may I say in a not so Apache way. Waiting for a committer to assign a patch to someone leaves it as a privilege to a committer. Not alluding to anything fishy in

Re: Should we let everyone set Assignee?

2015-04-22 Thread Reynold Xin
Woh hold on a minute. Spark has been among the projects that are the most welcoming to new contributors. And thanks to this, the sheer number of activities in Spark is much larger than other projects, and our workflow has to accommodate this fact. In practice, people just create pull requests on

Re: Should we let everyone set Assignee?

2015-04-22 Thread Patrick Wendell
Hi Vinod, Thanks for you thoughts - However, I do not agree with your sentiment and implications. Spark is broadly quite an inclusive project and we spend a lot of effort culturally to help make newcomers feel welcome. - Patrick On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli

Re: Should we let everyone set Assignee?

2015-04-22 Thread Mark Hamstra
Agreed. The Spark project and community that Vinod describes do not resemble the ones with which I am familiar. On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Vinod, Thanks for you thoughts - However, I do not agree with your sentiment and implications. Spark

Re: Should we let everyone set Assignee?

2015-04-22 Thread Sean Owen
I think you misread the thread, since that's the opposite of what Patrick suggested. He's suggesting that *nobody ever waits* to be assigned a JIRA to work on it; that anyone may work on a JIRA without waiting for it to be assigned. The point is: assigning JIRAs discourages others from doing work

Re: Should we let everyone set Assignee?

2015-04-22 Thread Vinod Kumar Vavilapalli
If it is true what you say, what is the reason for this committer-only-assigns-JIRA tickets policy? If anyone can send a pull request, anyone should be able to assign tickets to himself/herself too. +Vinod On Apr 22, 2015, at 1:18 PM, Reynold Xin r...@databricks.commailto:r...@databricks.com

Re: Should we let everyone set Assignee?

2015-04-22 Thread Ganelin, Ilya
As a contributor, I¹ve never felt shut out from the Spark community, nor have I seen any examples of territorial behavior. A few times I¹ve expressed interest in more challenging work and the response I received was generally ³go ahead and give it a shot, just understand that this is sensitive

Re: Graphical display of metrics on application UI page

2015-04-22 Thread Akhil Das
​There were some PR's about graphical representation with D3.js, you can possibly see it on the github. Here's a few of them https://github.com/apache/spark/pulls?utf8=%E2%9C%93q=d3​ Thanks Best Regards On Wed, Apr 22, 2015 at 8:08 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Dear

Re: Addition of new Metrics for killed executors.

2015-04-22 Thread twinkle sachdeva
Hi, Looks interesting. It is quite interesting to know about what could have been the reason for not showing these stats in UI. As per the description of Patrick W in https://spark-project.atlassian.net/browse/SPARK-999, it does not mention any exception w.r.t failed tasks/executors. Can

Re: Graphical display of metrics on application UI page

2015-04-22 Thread Punyashloka Biswal
Thanks for the pointers! It looks like others are pretty active on this so I'll comment on those PRs and try to coordinate before starting any new work. Punya On Wed, Apr 22, 2015 at 2:49 AM Akhil Das ak...@sigmoidanalytics.com wrote: ​There were some PR's about graphical representation with

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Sourav Chandra
Anyone? On Wed, Apr 22, 2015 at 12:29 PM, Sourav Chandra sourav.chan...@livestream.com wrote: Hi Olivier, *the update function is as below*: *val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long, Long)]) = {* * val previousCount = state.getOrElse((0L, 0L))._2* *

Re: Dataframe.fillna from 1.3.0

2015-04-22 Thread Olivier Girardot
Where should this *coalesce* come from ? Is it related to the partition manipulation coalesce method ? Thanks ! Le lun. 20 avr. 2015 à 22:48, Reynold Xin r...@databricks.com a écrit : Ah ic. You can do something like df.select(coalesce(df(a), lit(0.0))) On Mon, Apr 20, 2015 at 1:44 PM,

Re: Dataframe.fillna from 1.3.0

2015-04-22 Thread Olivier Girardot
I think I found the Coalesce you were talking about, but this is a catalyst class that I think is not available from pyspark Regards, Olivier. Le mer. 22 avr. 2015 à 11:56, Olivier Girardot o.girar...@lateral-thoughts.com a écrit : Where should this *coalesce* come from ? Is it related to

RE: Is spark-ec2 for production use?

2015-04-22 Thread nate
Replacement for production-ish is beyond a stretch phrasing, UX just isn’t there yet for average end user wanting push-button. Up until a bit ago focus was heavily focused on infrastructure folks and people building their own distros. Project is turning towards end users so anyone from ops to

Re: Spark build time

2015-04-22 Thread Olivier Girardot
I agree, it's what I did :) I was just wondering if it was considered a problem or something to work on, I personally think so because the feedback loop should be as quick as possible, and therefore if there was someone I could help. Le mar. 21 avr. 2015 à 22:20, Reynold Xin r...@databricks.com a

Pipeline in pyspark

2015-04-22 Thread Suraj Shetiya
Hi, I came across documentation for creating a pipeline in mlib library of pyspark. I wanted to know if something similar exists for pyspark input transformations. I have a use case where I have my input files in different formats and would like to convert them to rdd and store them in memory and