Re: Spark build time
I suggest searching the archives for this list as there were several previous discussions about this problem. JIRA also has several issues related to this. Some pointers: - SPARK-3431 https://issues.apache.org/jira/browse/SPARK-3431: Parallelize Scala/Java test execution - http://apache-spark-developers-list.1001551.n3.nabble.com/Unit-tests-in-lt-5-minutes-td7757.html - SPARK-4746 https://issues.apache.org/jira/browse/SPARK-4746: integration tests should be separated from faster unit tests - http://apache-spark-developers-list.1001551.n3.nabble.com/Building-Spark-with-Pants-td10397.html Summary is, everyone agrees the long times are a problem and wants the build and tests to run faster. There are several things that can be done, but they all require a lot of work. Nick On Wed, Apr 22, 2015 at 5:25 AM Olivier Girardot ssab...@gmail.com wrote: I agree, it's what I did :) I was just wondering if it was considered a problem or something to work on, I personally think so because the feedback loop should be as quick as possible, and therefore if there was someone I could help. Le mar. 21 avr. 2015 à 22:20, Reynold Xin r...@databricks.com a écrit : It runs tons of integration tests. I think most developers just let Jenkins run the full suite of them. On Tue, Apr 21, 2015 at 12:54 PM, Olivier Girardot ssab...@gmail.com wrote: Hi everyone, I was just wandering about the Spark full build time (including tests), 1h48 seems to me quite... spacious. What's taking most of the time ? Is the build mainly integration tests ? Is there any roadmap or jiras dedicated to that we can chip in ? Regards, Olivier.
Indices of SparseVector must be ordered while computing SVD
Hi all, I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This really confused me today. At first I thought my implementation is wrong. It turns out it's an issue in MLlib. Fortunately, I've figured it out. I suggest to add a hint on user document of MLlib ( as far as I know, there have not been such hints yet) that indices of Local Sparse Vector must be ordered in ascending manner. Because of ignorance of this point, I spent a lot of time looking for reasons why computeSVD of RowMatrix did not run correctly on Sparse data. I don't know the influence of Sparse Vector without ordered indices on other functions, but I believe it is necessary to let the users know or fix it. Actually, it's very easy to fix. Just add a sortBy function in internal construction of SparseVector. Here is an example to reproduce the affect of unordered Sparse Vector on computeSVD. //in spark-shell, Spark 1.3.1 import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector, Vectors} val sparseData_ordered = Seq( Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), Vectors.sparse(3, Array(0,1,2), Array(3.0, 4.0, 5.0)), Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), Vectors.sparse(3, Array(0,2), Array(9.0, 1.0)) ) val sparseMat_ordered = new RowMatrix(sc.parallelize(sparseData_ordered, 2)) val sparseData_not_ordered = Seq( Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), Vectors.sparse(3, Array(2,1,0), Array(5.0,4.0,3.0)), Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), Vectors.sparse(3, Array(2,0), Array(1.0,9.0)) ) val sparseMat_not_ordered = new RowMatrix(sc.parallelize(sparseData_not_ordered, 2)) //apparently, sparseMat_ordered and sparseMat_not_ordered are essentially the same matirx //however, the computeSVD result of these two matrixes are different. Users should be notified about this situation. println(sparseMat_ordered.computeSVD(2, true).U.rows.collect.mkString(\n)) println(===) println(sparseMat_not_ordered.computeSVD(2, true).U.rows.collect.mkString(\n)) == The results are: ordered: [-0.10972870132786407,-0.18850811494220537] [-0.44712472003608356,-0.24828866611663725] [-0.784520738744303,-0.3080692172910691] [-0.4154110101064339,0.8988385762953358] not ordered: [-0.10830447119599484,-0.1559341848984378] [-0.4522713511277327,-0.23449829541447448] [-0.7962382310594706,-0.3130624059305111] [-0.43131320303494614,0.8453864703362308] Looking into this issue, I can see it's reason locates in RowMatrix.scala(line 629). The implementation of Sparse dspr here requires ordered indices. Because it is scanning the indices consecutively to skip empty columns. - Feel the sparking Spark! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Indices-of-SparseVector-must-be-ordered-while-computing-SVD-tp11731.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: python/run-tests fails at spark master branch
Hi Hrishikesh, Seems the behavior of Kafka-assembly is a little different when using Maven to sbt. The assembly jar name and location is different while using `mvn package`. This is a actually bug, I'm fixing this now. Thanks Jerry 2015-04-22 13:37 GMT+08:00 Hrishikesh Subramonian hrishikesh.subramon...@flytxt.com: Hi, The *python/run-tests* executes successfully after I ran *'build/sbt assembly*' command. But the tests fail if I run it after *'mvn -Dskiptests clean package'* command. Why does it run in *sbt assembly* and not in* mvn package*? -- Hrishikesh On Wednesday 22 April 2015 07:38 AM, Saisai Shao wrote: Hi Hrishikesh, Now we add Kafka unit test for python which relies on Kafka assembly jar, so you need to run `sbt assembly` or mvn `package` at first to get an assemble jar. 2015-04-22 1:15 GMT+08:00 Marcelo Vanzin van...@cloudera.com: On Tue, Apr 21, 2015 at 1:30 AM, Hrishikesh Subramonian hrishikesh.subramon...@flytxt.com wrote: Run streaming tests ... Failed to find Spark Streaming Kafka assembly jar in /home/xyz/spark/external/kafka-assembly You need to build Spark with 'build/sbt assembly/assembly streaming-kafka-assembly/assembly' or 'build/mvn package' before running this program Is anybody facing the same problem? Have you built the assemblies before running the tests? (mvn package -DskipTests, or sbt assembly) -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error
It could very well be that your executor memory is not enough to store the state RDDs AND operate on the data. 1G per executor is quite low. Definitely give more memory. And have you tried increasing the number of partitions (specify number of partitions in updateStateByKey) ? On Wed, Apr 22, 2015 at 2:34 AM, Sourav Chandra sourav.chan...@livestream.com wrote: Anyone? On Wed, Apr 22, 2015 at 12:29 PM, Sourav Chandra sourav.chan...@livestream.com wrote: Hi Olivier, *the update function is as below*: *val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long, Long)]) = {* * val previousCount = state.getOrElse((0L, 0L))._2* * var startValue: IConcurrentUsers = ConcurrentViewers(0)* * var currentCount = 0L* * val lastIndexOfConcurrentUsers =* *values.lastIndexWhere(_.isInstanceOf[ConcurrentViewers])* * val subList = values.slice(0, lastIndexOfConcurrentUsers)* * val currentCountFromSubList = subList.foldLeft(startValue)(_ op _).count + previousCount* * val lastConcurrentViewersCount = values(lastIndexOfConcurrentUsers).count* * if (math.abs(lastConcurrentViewersCount - currentCountFromSubList) = 1) {* *logger.error(* * sCount using state updation $currentCountFromSubList, +* *sConcurrentUsers count $lastConcurrentViewersCount +* *s resetting to $lastConcurrentViewersCount* *)* *currentCount = lastConcurrentViewersCount* * }* * val remainingValuesList = values.diff(subList)* * startValue = ConcurrentViewers(currentCount)* * currentCount = remainingValuesList.foldLeft(startValue)(_ op _).count* * if (currentCount 0) {* *logger.error(* * sERROR: Got new count $currentCount 0, value:$values, state:$state, resetting to 0* *)* *currentCount = 0* * }* * // to stop pushing subsequent 0 after receiving first 0* * if (currentCount == 0 previousCount == 0) None* * else Some(previousCount, currentCount)* *}* *trait IConcurrentUsers {* * val count: Long* * def op(a: IConcurrentUsers): IConcurrentUsers = IConcurrentUsers.op(this, a)* *}* *object IConcurrentUsers {* * def op(a: IConcurrentUsers, b: IConcurrentUsers): IConcurrentUsers = (a, b) match {* *case (_, _: ConcurrentViewers) = * * ConcurrentViewers(b.count)* *case (_: ConcurrentViewers, _: IncrementConcurrentViewers) = * * ConcurrentViewers(a.count + b.count)* *case (_: ConcurrentViewers, _: DecrementConcurrentViewers) = * * ConcurrentViewers(a.count - b.count)* * }* *}* *case class IncrementConcurrentViewers(count: Long) extends IConcurrentUsers* *case class DecrementConcurrentViewers(count: Long) extends IConcurrentUsers* *case class ConcurrentViewers(count: Long) extends IConcurrentUsers* *also the error stack trace copied from executor logs is:* *java.lang.OutOfMemoryError: Java heap space* *at org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)* *at org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2564)* *at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)* *at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)* *at org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43)* *at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927)* *at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)* *at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)* *at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)* *at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)* *at java.lang.reflect.Method.invoke(Method.java:601)* *at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)* *at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)* *at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)* *at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)* *at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)* *at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)* *at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:236)* *at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readObject$1.apply$mcV$sp(TorrentBroadcast.scala:169)* *at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927)* *at
Re: Dataframe.fillna from 1.3.0
It is actually different. coalesce expression is to pick the first value that is not null: https://msdn.microsoft.com/en-us/library/ms190349.aspx Would be great to update the documentation for it (both Scala and Java) to explain that it is different from coalesce function on a DataFrame/RDD. Do you want to submit a pull request? On Wed, Apr 22, 2015 at 3:05 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: I think I found the Coalesce you were talking about, but this is a catalyst class that I think is not available from pyspark Regards, Olivier. Le mer. 22 avr. 2015 à 11:56, Olivier Girardot o.girar...@lateral-thoughts.com a écrit : Where should this *coalesce* come from ? Is it related to the partition manipulation coalesce method ? Thanks ! Le lun. 20 avr. 2015 à 22:48, Reynold Xin r...@databricks.com a écrit : Ah ic. You can do something like df.select(coalesce(df(a), lit(0.0))) On Mon, Apr 20, 2015 at 1:44 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: From PySpark it seems to me that the fillna is relying on Java/Scala code, that's why I was wondering. Thank you for answering :) Le lun. 20 avr. 2015 à 22:22, Reynold Xin r...@databricks.com a écrit : You can just create fillna function based on the 1.3.1 implementation of fillna, no? On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: a UDF might be a good idea no ? Le lun. 20 avr. 2015 à 11:17, Olivier Girardot o.girar...@lateral-thoughts.com a écrit : Hi everyone, let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna* API in PySpark, is there any efficient alternative to mapping the records myself ? Regards, Olivier.
Should we let everyone set Assignee?
Anecdotally, there are a number of people asking to set the Assignee field. This is currently restricted to Committers in JIRA. I know the logic was to prevent people from Assigning a JIRA and then leaving it; it also matters a bit for questions of credit. Still I wonder if it's best to just let people go ahead and set it, as the lesser evil. People can already do a lot like resolve JIRAs and set shepherd and critical priority and all that. I think the intent was to let Developers set this, but maybe due to an error, that's not how the current JIRA permission is implemented. I ask because I'm about to ping INFRA to update our scheme. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should we let everyone set Assignee?
One over arching issue is that it's pretty unclear what Assigned to X in JIAR means from a process perspective. Personally I actually feel it's better for this to be more historical - i.e. who ended up submitting a patch for this feature that was merged - rather than creating an exclusive reservation for a particular user to work on something. If an issue is assigned to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. IMO, its fine if multiple people want to submit competing patches for something, provided everyone comments on JIRA saying they are intending to submit a patch, and everyone understands there is duplicate effort. So commenting with an intention to submit a patch, IMO seems like the healthiest workflow since it is non exclusive. To me the main benefit of assigning something ahead of time is if you have a committer that really wants to see someone specific work on a patch, it just acts as a strong signal that there is someone endorsed to work on that patch. That doesn't mean no one else can submit a patch, but it is IMO more of a warning that there may be existing work which is likely to be high quality, to avoid duplicated effort. When it was really easy to assign features to themselves, I saw a lot of anti-patterns in the community that seemed unhealthy, specifically: - It was really unclear what it means semantically if someone is assigned to a JIRA. - People assign JIRA's to themselves that aren't a good fit, given the authors level of experience. - People expect if they assign JIRA's to themselves that others won't submit patches, and become upset if they do. - People are discouraged from working on a patch because someone else was officially assigned. - Patrick On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote: Anecdotally, there are a number of people asking to set the Assignee field. This is currently restricted to Committers in JIRA. I know the logic was to prevent people from Assigning a JIRA and then leaving it; it also matters a bit for questions of credit. Still I wonder if it's best to just let people go ahead and set it, as the lesser evil. People can already do a lot like resolve JIRAs and set shepherd and critical priority and all that. I think the intent was to let Developers set this, but maybe due to an error, that's not how the current JIRA permission is implemented. I ask because I'm about to ping INFRA to update our scheme. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
GradientBoostTrees leaks a persisted RDD
Hi all, It appears GradientBoostedTrees.scala can call 'persist' on an RDD and never unpersist it. In the master branch it's here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala#L181 In 1.3.1 it's here: https://github.com/apache/spark/blob/v1.3.1/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala#L138 Let me know if you want a fix for this. Jim -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/GradientBoostTrees-leaks-a-persisted-RDD-tp11750.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: GradientBoostTrees leaks a persisted RDD
Hi Jim, You're right; that should be unpersisted. Could you please create a JIRA and submit a patch? Thanks! Joseph On Wed, Apr 22, 2015 at 6:00 PM, jimfcarroll jimfcarr...@gmail.com wrote: Hi all, It appears GradientBoostedTrees.scala can call 'persist' on an RDD and never unpersist it. In the master branch it's here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala#L181 In 1.3.1 it's here: https://github.com/apache/spark/blob/v1.3.1/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala#L138 Let me know if you want a fix for this. Jim -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/GradientBoostTrees-leaks-a-persisted-RDD-tp11750.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Indices of SparseVector must be ordered while computing SVD
Hi Chunnan, There is currently Scala documentation for the constructor parameters: https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L515 There is one benefit to not checking for validity (ordering) within the constructor: If you need to translate between SparseVector and some other library's type (e.g., Breeze), you can do so with a few reference copies, rather than iterating through or copying the actual data. It might be good to provide this check within Vectors.sparse(), but we'd need to check through MLlib for uses of Vectors.sparse which expect it to be a cheap operation. What do you think? It is documented in the programming guide too: https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/docs/mllib-data-types.md But perhaps that should be more prominent. If you think it would be helpful, then please do make a JIRA about adding a check to Vectors.sparse(). Joseph On Wed, Apr 22, 2015 at 8:29 AM, Chunnan Yao yaochun...@gmail.com wrote: Hi all, I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This really confused me today. At first I thought my implementation is wrong. It turns out it's an issue in MLlib. Fortunately, I've figured it out. I suggest to add a hint on user document of MLlib ( as far as I know, there have not been such hints yet) that indices of Local Sparse Vector must be ordered in ascending manner. Because of ignorance of this point, I spent a lot of time looking for reasons why computeSVD of RowMatrix did not run correctly on Sparse data. I don't know the influence of Sparse Vector without ordered indices on other functions, but I believe it is necessary to let the users know or fix it. Actually, it's very easy to fix. Just add a sortBy function in internal construction of SparseVector. Here is an example to reproduce the affect of unordered Sparse Vector on computeSVD. //in spark-shell, Spark 1.3.1 import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector, Vectors} val sparseData_ordered = Seq( Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), Vectors.sparse(3, Array(0,1,2), Array(3.0, 4.0, 5.0)), Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), Vectors.sparse(3, Array(0,2), Array(9.0, 1.0)) ) val sparseMat_ordered = new RowMatrix(sc.parallelize(sparseData_ordered, 2)) val sparseData_not_ordered = Seq( Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), Vectors.sparse(3, Array(2,1,0), Array(5.0,4.0,3.0)), Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), Vectors.sparse(3, Array(2,0), Array(1.0,9.0)) ) val sparseMat_not_ordered = new RowMatrix(sc.parallelize(sparseData_not_ordered, 2)) //apparently, sparseMat_ordered and sparseMat_not_ordered are essentially the same matirx //however, the computeSVD result of these two matrixes are different. Users should be notified about this situation. println(sparseMat_ordered.computeSVD(2, true).U.rows.collect.mkString(\n)) println(===) println(sparseMat_not_ordered.computeSVD(2, true).U.rows.collect.mkString(\n)) == The results are: ordered: [-0.10972870132786407,-0.18850811494220537] [-0.44712472003608356,-0.24828866611663725] [-0.784520738744303,-0.3080692172910691] [-0.4154110101064339,0.8988385762953358] not ordered: [-0.10830447119599484,-0.1559341848984378] [-0.4522713511277327,-0.23449829541447448] [-0.7962382310594706,-0.3130624059305111] [-0.43131320303494614,0.8453864703362308] Looking into this issue, I can see it's reason locates in RowMatrix.scala(line 629). The implementation of Sparse dspr here requires ordered indices. Because it is scanning the indices consecutively to skip empty columns. - Feel the sparking Spark! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Indices-of-SparseVector-must-be-ordered-while-computing-SVD-tp11731.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should we let everyone set Assignee?
To repeat what Patrick said (literally): If an issue is “assigned” to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. No-one in the Spark community dictates who gets to do work. When an issue is assigned to someone in JIRA, it’s either because a) they did the work and the issue is now resolved, or b) they are signaling to others that they are working on it. In the case of b), nothing stops other people from working on the issue and it’s quite normal for other people to complete issues that were technically assigned to someone else. There is no land grabbing or stalling. Anyone who has contributed to Spark for any amount of time knows this. Vinod, I want to take this opportunity to call out the approach to communication you took here. As a random contributor to Spark and active participant on this list, my reaction when I read your email was this: - You do not know how the Spark community actually works. - You read a thread that contains some trigger phrases. - You wrote a lengthy response as a knee-jerk reaction. I’m not trying to mock, but I want to be direct and honest about how you came off in this thread to me and probably many others. Why not ask questions first—many questions? Why not make doubly sure that you understand the situation correctly before responding? In many ways this is much like filing a bug report. “I’m seeing this. It seems wrong to me. Is this expected?” I think we all know from experience that this kind of bug report is polite and will likely lead to a productive discussion. On the other hand: “You’re returning a -1 here? This is obviously wrong! And, boy, lemme tell you how wrong you are!!!” No-one likes to deal with bug reports like this. More importantly, they get in the way of fixing the actual problem, if there is one. This is not about the Apache Way or not. It’s about basic etiquette and effective communication. I understand that there are legitimate potential concerns here, and it’s important that, as an Apache project, Spark work according to Apache principles. But when some person who has never participated on this list pops up out of nowhere with a lengthy lecture on the Apache Way and whatnot, I have to say that that is not an effective way to communicate. Pretty much the same thing happened with Greg Stein on an earlier thread some months ago about designating maintainers for components. The concerns are legitimate, I’m sure, and we want to keep Spark in line with the Apache Way. And certainly, there have been many times when a project veered off course and needed to corrected. But when we want to make things right, I hope we can do it in a way that respectfully and tactfully engages the community. These “lectures delivered from above” — which is how they come off — are not helpful. Nick On Wed, Apr 22, 2015 at 4:31 PM Ganelin, Ilya ilya.gane...@capitalone.com wrote: As a contributor, I¹ve never felt shut out from the Spark community, nor have I seen any examples of territorial behavior. A few times I¹ve expressed interest in more challenging work and the response I received was generally ³go ahead and give it a shot, just understand that this is sensitive code so we may end up modifying the PR substantially.² Honestly, that seems fine, and in general, I think it¹s completely fair to go with the PR model - e.g. If a JIRA has an open PR then it¹s an active effort, otherwise it¹s fair game unless otherwise stated. At the end of the day, it¹s about moving the project forward and the only way to do that is to have actual code in the pipes -speculation and intent don¹t really help, and there¹s nothing preventing an interested party from submitting a PR against an issue. Thank you, Ilya Ganelin On 4/22/15, 1:25 PM, Mark Hamstra m...@clearstorydata.com wrote: Agreed. The Spark project and community that Vinod describes do not resemble the ones with which I am familiar. On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Vinod, Thanks for you thoughts - However, I do not agree with your sentiment and implications. Spark is broadly quite an inclusive project and we spend a lot of effort culturally to help make newcomers feel welcome. - Patrick On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Actually what this community got away with is pretty much an anti-pattern compared to every other Apache project I have seen. And may I say in a not so Apache way. Waiting for a committer to assign a patch to someone leaves it as a privilege to a committer. Not alluding to anything fishy in practice, but this also leaves a lot of open ground for self-interest. Committers defining notions of good fit / level of experience do not work,
Re: Should we let everyone set Assignee?
Last one for the day. Everyone, as I said clearly, I was not alluding to anything fishy in practice, I was describing how things go wrong in such an environment. Sandy's email lays down some of these problems. Assigning a JIRA in other projects is not a reservation. It is a clear intention of working on design or code. You don't need a new convention of signaling. In almost all other projects, it is assigning tickets - that's how it is used. +Vinod On Apr 22, 2015, at 2:37 PM, Patrick Wendell pwend...@gmail.com wrote: Sandy - I definitely agree with that. We should have a convention of signaling someone intends to work - for instance by commenting on the JIRA and we should document this on the contribution guide. The nice thing about having that convention is that multiple people can say they are going to work on something, whereas only one person can be given the assignee slot on a JIRA. On Wed, Apr 22, 2015 at 2:33 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: To repeat what Patrick said (literally): If an issue is assigned to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. No-one in the Spark community dictates who gets to do work. When an issue is assigned to someone in JIRA, it's either because a) they did the work and the issue is now resolved, or b) they are signaling to others that they are working on it. In the case of b), nothing stops other people from working on the issue and it's quite normal for other people to complete issues that were technically assigned to someone else. There is no land grabbing or stalling. Anyone who has contributed to Spark for any amount of time knows this. Vinod, I want to take this opportunity to call out the approach to communication you took here. As a random contributor to Spark and active participant on this list, my reaction when I read your email was this: You do not know how the Spark community actually works. You read a thread that contains some trigger phrases. You wrote a lengthy response as a knee-jerk reaction. I'm not trying to mock, but I want to be direct and honest about how you came off in this thread to me and probably many others. Why not ask questions first--many questions? Why not make doubly sure that you understand the situation correctly before responding? In many ways this is much like filing a bug report. I'm seeing this. It seems wrong to me. Is this expected? I think we all know from experience that this kind of bug report is polite and will likely lead to a productive discussion. On the other hand: You're returning a -1 here? This is obviously wrong! And, boy, lemme tell you how wrong you are!!! No-one likes to deal with bug reports like this. More importantly, they get in the way of fixing the actual problem, if there is one. This is not about the Apache Way or not. It's about basic etiquette and effective communication. I understand that there are legitimate potential concerns here, and it's important that, as an Apache project, Spark work according to Apache principles. But when some person who has never participated on this list pops up out of nowhere with a lengthy lecture on the Apache Way and whatnot, I have to say that that is not an effective way to communicate. Pretty much the same thing happened with Greg Stein on an earlier thread some months ago about designating maintainers for components. The concerns are legitimate, I'm sure, and we want to keep Spark in line with the Apache Way. And certainly, there have been many times when a project veered off course and needed to corrected. But when we want to make things right, I hope we can do it in a way that respectfully and tactfully engages the community. These lectures delivered from above -- which is how they come off -- are not helpful. Nick On Wed, Apr 22, 2015 at 4:31 PM Ganelin, Ilya ilya.gane...@capitalone.com wrote: As a contributor, I¹ve never felt shut out from the Spark community, nor have I seen any examples of territorial behavior. A few times I¹ve expressed interest in more challenging work and the response I received was generally ³go ahead and give it a shot, just understand that this is sensitive code so we may end up modifying the PR substantially.² Honestly, that seems fine, and in general, I think it¹s completely fair to go with the PR model - e.g. If a JIRA has an open PR then it¹s an active effort, otherwise it¹s fair game unless otherwise stated. At the end of the day, it¹s about moving the project forward and the only way to do that is to have actual code in the pipes -speculation and intent don¹t really help, and there¹s nothing preventing an interested party from submitting a PR against an issue. Thank you, Ilya Ganelin
Re: Should we let everyone set Assignee?
I watch these lists, so I have a fair understanding of how things work around here. I don't give direct input in the day to day activities though, like Greg Stein on the other thread, so I can understand if it looks like it came from up above. Apache Members come around and give opinions time to time, you don't need to take it as somebody up above forcing things down. Thanks +Vinod On Apr 22, 2015, at 2:33 PM, Nicholas Chammas nicholas.cham...@gmail.commailto:nicholas.cham...@gmail.com wrote: I want to take this opportunity to call out the approach to communication you took here. As a random contributor to Spark and active participant on this list, my reaction when I read your email was this: * You do not know how the Spark community actually works. * You read a thread that contains some trigger phrases. * You wrote a lengthy response as a knee-jerk reaction. I’m not trying to mock, but I want to be direct and honest about how you came off in this thread to me and probably many others. Why not ask questions first—many questions? Why not make doubly sure that you understand the situation correctly before responding? In many ways this is much like filing a bug report. “I’m seeing this. It seems wrong to me. Is this expected?” I think we all know from experience that this kind of bug report is polite and will likely lead to a productive discussion. On the other hand: “You’re returning a -1 here? This is obviously wrong! And, boy, lemme tell you how wrong you are!!!” No-one likes to deal with bug reports like this. More importantly, they get in the way of fixing the actual problem, if there is one. This is not about the Apache Way or not. It’s about basic etiquette and effective communication. I understand that there are legitimate potential concerns here, and it’s important that, as an Apache project, Spark work according to Apache principles. But when some person who has never participated on this list pops up out of nowhere with a lengthy lecture on the Apache Way and whatnot, I have to say that that is not an effective way to communicate. Pretty much the same thing happened with Greg Stein on an earlier thread some months ago about designating maintainers for components. The concerns are legitimate, I’m sure, and we want to keep Spark in line with the Apache Way. And certainly, there have been many times when a project veered off course and needed to corrected. But when we want to make things right, I hope we can do it in a way that respectfully and tactfully engages the community. These “lectures delivered from above” — which is how they come off — are not helpful. Nick
Re: Should we let everyone set Assignee?
Sandy - I definitely agree with that. We should have a convention of signaling someone intends to work - for instance by commenting on the JIRA and we should document this on the contribution guide. The nice thing about having that convention is that multiple people can say they are going to work on something, whereas only one person can be given the assignee slot on a JIRA. On Wed, Apr 22, 2015 at 2:33 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: To repeat what Patrick said (literally): If an issue is assigned to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. No-one in the Spark community dictates who gets to do work. When an issue is assigned to someone in JIRA, it's either because a) they did the work and the issue is now resolved, or b) they are signaling to others that they are working on it. In the case of b), nothing stops other people from working on the issue and it's quite normal for other people to complete issues that were technically assigned to someone else. There is no land grabbing or stalling. Anyone who has contributed to Spark for any amount of time knows this. Vinod, I want to take this opportunity to call out the approach to communication you took here. As a random contributor to Spark and active participant on this list, my reaction when I read your email was this: You do not know how the Spark community actually works. You read a thread that contains some trigger phrases. You wrote a lengthy response as a knee-jerk reaction. I'm not trying to mock, but I want to be direct and honest about how you came off in this thread to me and probably many others. Why not ask questions first--many questions? Why not make doubly sure that you understand the situation correctly before responding? In many ways this is much like filing a bug report. I'm seeing this. It seems wrong to me. Is this expected? I think we all know from experience that this kind of bug report is polite and will likely lead to a productive discussion. On the other hand: You're returning a -1 here? This is obviously wrong! And, boy, lemme tell you how wrong you are!!! No-one likes to deal with bug reports like this. More importantly, they get in the way of fixing the actual problem, if there is one. This is not about the Apache Way or not. It's about basic etiquette and effective communication. I understand that there are legitimate potential concerns here, and it's important that, as an Apache project, Spark work according to Apache principles. But when some person who has never participated on this list pops up out of nowhere with a lengthy lecture on the Apache Way and whatnot, I have to say that that is not an effective way to communicate. Pretty much the same thing happened with Greg Stein on an earlier thread some months ago about designating maintainers for components. The concerns are legitimate, I'm sure, and we want to keep Spark in line with the Apache Way. And certainly, there have been many times when a project veered off course and needed to corrected. But when we want to make things right, I hope we can do it in a way that respectfully and tactfully engages the community. These lectures delivered from above -- which is how they come off -- are not helpful. Nick On Wed, Apr 22, 2015 at 4:31 PM Ganelin, Ilya ilya.gane...@capitalone.com wrote: As a contributor, I¹ve never felt shut out from the Spark community, nor have I seen any examples of territorial behavior. A few times I¹ve expressed interest in more challenging work and the response I received was generally ³go ahead and give it a shot, just understand that this is sensitive code so we may end up modifying the PR substantially.² Honestly, that seems fine, and in general, I think it¹s completely fair to go with the PR model - e.g. If a JIRA has an open PR then it¹s an active effort, otherwise it¹s fair game unless otherwise stated. At the end of the day, it¹s about moving the project forward and the only way to do that is to have actual code in the pipes -speculation and intent don¹t really help, and there¹s nothing preventing an interested party from submitting a PR against an issue. Thank you, Ilya Ganelin On 4/22/15, 1:25 PM, Mark Hamstra m...@clearstorydata.com wrote: Agreed. The Spark project and community that Vinod describes do not resemble the ones with which I am familiar. On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Vinod, Thanks for you thoughts - However, I do not agree with your sentiment and implications. Spark is broadly quite an inclusive project and we spend a lot of effort culturally to help make newcomers feel welcome. - Patrick On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar
Re: Should we let everyone set Assignee?
I think one of the benefits of assignee fields that I've seen in other projects is their potential to coordinate and prevent duplicate work. It's really frustrating to put a lot of work into a patch and then find out that someone has been doing the same. It's helpful for the project etiquette to include a way to signal to others that you are working or intend to work on a patch. Obviously there are limits to how long someone should be able to hold on to a JIRA without making progress on it, but a signal is still useful. Historically, in other projects, the assignee field serves as this signal. If we don't want to use the assignee field for this, I think it's important to have some alternative, even if it's just encouraging contributors to comment I'm planning to work on this on JIRA. -Sandy On Wed, Apr 22, 2015 at 1:30 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: As a contributor, I¹ve never felt shut out from the Spark community, nor have I seen any examples of territorial behavior. A few times I¹ve expressed interest in more challenging work and the response I received was generally ³go ahead and give it a shot, just understand that this is sensitive code so we may end up modifying the PR substantially.² Honestly, that seems fine, and in general, I think it¹s completely fair to go with the PR model - e.g. If a JIRA has an open PR then it¹s an active effort, otherwise it¹s fair game unless otherwise stated. At the end of the day, it¹s about moving the project forward and the only way to do that is to have actual code in the pipes -speculation and intent don¹t really help, and there¹s nothing preventing an interested party from submitting a PR against an issue. Thank you, Ilya Ganelin On 4/22/15, 1:25 PM, Mark Hamstra m...@clearstorydata.com wrote: Agreed. The Spark project and community that Vinod describes do not resemble the ones with which I am familiar. On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Vinod, Thanks for you thoughts - However, I do not agree with your sentiment and implications. Spark is broadly quite an inclusive project and we spend a lot of effort culturally to help make newcomers feel welcome. - Patrick On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Actually what this community got away with is pretty much an anti-pattern compared to every other Apache project I have seen. And may I say in a not so Apache way. Waiting for a committer to assign a patch to someone leaves it as a privilege to a committer. Not alluding to anything fishy in practice, but this also leaves a lot of open ground for self-interest. Committers defining notions of good fit / level of experience do not work, highly subjective and lead to group control. In terms of semantics, here is what most other projects (dare I say every Apache project?) that I have seen do - A new contributor comes in who is not yet added to the JIRA project. He/she requests one of the project's JIRA admins to add him/her. - After that, he or she is free to assign tickets to themselves. - What this means -- Assigning a ticket to oneself is a signal to the rest of the community that he/she is actively working on the said patch. -- If multiple contributors want to work on the same patch, it needs to resolved amicably through open communication. On JIRA, or on mailing lists. Not by the whim of a committer. - Common issues -- Land grabbing: Other contributors can nudge him/her in case of inactivity and take them over. Again, amicably instead of a committer making subjective decisions. -- Progress stalling: One contributor assigns the ticket to himself/herself is actively debating but with no real code/docs contribution or with any real intention of making progress. Here workable, reviewable code for review usually wins. Assigning patches is not a privilege. Contributors at Apache are a bunch of volunteers, the PMC should let volunteers contribute as they see fit. We do not assign work at Apache. +Vinod On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com wrote: One over arching issue is that it's pretty unclear what Assigned to X in JIAR means from a process perspective. Personally I actually feel it's better for this to be more historical - i.e. who ended up submitting a patch for this feature that was merged - rather than creating an exclusive reservation for a particular user to work on something. If an issue is assigned to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. IMO, its fine if multiple people want to submit competing patches
Re: Should we let everyone set Assignee?
I can get behind that point of view too. That's what I've told people who expect Assignee is a necessary part of workflow. The existence of a PR link is a signal someone's working on it. In that case we need not do anything. On Wed, Apr 22, 2015 at 8:32 PM, Patrick Wendell pwend...@gmail.com wrote: One over arching issue is that it's pretty unclear what Assigned to X in JIAR means from a process perspective. Personally I actually feel it's better for this to be more historical - i.e. who ended up submitting a patch for this feature that was merged - rather than creating an exclusive reservation for a particular user to work on something. If an issue is assigned to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. IMO, its fine if multiple people want to submit competing patches for something, provided everyone comments on JIRA saying they are intending to submit a patch, and everyone understands there is duplicate effort. So commenting with an intention to submit a patch, IMO seems like the healthiest workflow since it is non exclusive. To me the main benefit of assigning something ahead of time is if you have a committer that really wants to see someone specific work on a patch, it just acts as a strong signal that there is someone endorsed to work on that patch. That doesn't mean no one else can submit a patch, but it is IMO more of a warning that there may be existing work which is likely to be high quality, to avoid duplicated effort. When it was really easy to assign features to themselves, I saw a lot of anti-patterns in the community that seemed unhealthy, specifically: - It was really unclear what it means semantically if someone is assigned to a JIRA. - People assign JIRA's to themselves that aren't a good fit, given the authors level of experience. - People expect if they assign JIRA's to themselves that others won't submit patches, and become upset if they do. - People are discouraged from working on a patch because someone else was officially assigned. - Patrick On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote: Anecdotally, there are a number of people asking to set the Assignee field. This is currently restricted to Committers in JIRA. I know the logic was to prevent people from Assigning a JIRA and then leaving it; it also matters a bit for questions of credit. Still I wonder if it's best to just let people go ahead and set it, as the lesser evil. People can already do a lot like resolve JIRAs and set shepherd and critical priority and all that. I think the intent was to let Developers set this, but maybe due to an error, that's not how the current JIRA permission is implemented. I ask because I'm about to ping INFRA to update our scheme. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should we let everyone set Assignee?
Actually what this community got away with is pretty much an anti-pattern compared to every other Apache project I have seen. And may I say in a not so Apache way. Waiting for a committer to assign a patch to someone leaves it as a privilege to a committer. Not alluding to anything fishy in practice, but this also leaves a lot of open ground for self-interest. Committers defining notions of good fit / level of experience do not work, highly subjective and lead to group control. In terms of semantics, here is what most other projects (dare I say every Apache project?) that I have seen do - A new contributor comes in who is not yet added to the JIRA project. He/she requests one of the project's JIRA admins to add him/her. - After that, he or she is free to assign tickets to themselves. - What this means -- Assigning a ticket to oneself is a signal to the rest of the community that he/she is actively working on the said patch. -- If multiple contributors want to work on the same patch, it needs to resolved amicably through open communication. On JIRA, or on mailing lists. Not by the whim of a committer. - Common issues -- Land grabbing: Other contributors can nudge him/her in case of inactivity and take them over. Again, amicably instead of a committer making subjective decisions. -- Progress stalling: One contributor assigns the ticket to himself/herself is actively debating but with no real code/docs contribution or with any real intention of making progress. Here workable, reviewable code for review usually wins. Assigning patches is not a privilege. Contributors at Apache are a bunch of volunteers, the PMC should let volunteers contribute as they see fit. We do not assign work at Apache. +Vinod On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com wrote: One over arching issue is that it's pretty unclear what Assigned to X in JIAR means from a process perspective. Personally I actually feel it's better for this to be more historical - i.e. who ended up submitting a patch for this feature that was merged - rather than creating an exclusive reservation for a particular user to work on something. If an issue is assigned to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. IMO, its fine if multiple people want to submit competing patches for something, provided everyone comments on JIRA saying they are intending to submit a patch, and everyone understands there is duplicate effort. So commenting with an intention to submit a patch, IMO seems like the healthiest workflow since it is non exclusive. To me the main benefit of assigning something ahead of time is if you have a committer that really wants to see someone specific work on a patch, it just acts as a strong signal that there is someone endorsed to work on that patch. That doesn't mean no one else can submit a patch, but it is IMO more of a warning that there may be existing work which is likely to be high quality, to avoid duplicated effort. When it was really easy to assign features to themselves, I saw a lot of anti-patterns in the community that seemed unhealthy, specifically: - It was really unclear what it means semantically if someone is assigned to a JIRA. - People assign JIRA's to themselves that aren't a good fit, given the authors level of experience. - People expect if they assign JIRA's to themselves that others won't submit patches, and become upset if they do. - People are discouraged from working on a patch because someone else was officially assigned. - Patrick On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote: Anecdotally, there are a number of people asking to set the Assignee field. This is currently restricted to Committers in JIRA. I know the logic was to prevent people from Assigning a JIRA and then leaving it; it also matters a bit for questions of credit. Still I wonder if it's best to just let people go ahead and set it, as the lesser evil. People can already do a lot like resolve JIRAs and set shepherd and critical priority and all that. I think the intent was to let Developers set this, but maybe due to an error, that's not how the current JIRA permission is implemented. I ask because I'm about to ping INFRA to update our scheme. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe,
Re: Should we let everyone set Assignee?
Woh hold on a minute. Spark has been among the projects that are the most welcoming to new contributors. And thanks to this, the sheer number of activities in Spark is much larger than other projects, and our workflow has to accommodate this fact. In practice, people just create pull requests on github, which is a newer friendlier better model given the constraints. We even have tools that automatically tags a ticket with a link to the pull requests. On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Actually what this community got away with is pretty much an anti-pattern compared to every other Apache project I have seen. And may I say in a not so Apache way. Waiting for a committer to assign a patch to someone leaves it as a privilege to a committer. Not alluding to anything fishy in practice, but this also leaves a lot of open ground for self-interest. Committers defining notions of good fit / level of experience do not work, highly subjective and lead to group control. In terms of semantics, here is what most other projects (dare I say every Apache project?) that I have seen do - A new contributor comes in who is not yet added to the JIRA project. He/she requests one of the project's JIRA admins to add him/her. - After that, he or she is free to assign tickets to themselves. - What this means -- Assigning a ticket to oneself is a signal to the rest of the community that he/she is actively working on the said patch. -- If multiple contributors want to work on the same patch, it needs to resolved amicably through open communication. On JIRA, or on mailing lists. Not by the whim of a committer. - Common issues -- Land grabbing: Other contributors can nudge him/her in case of inactivity and take them over. Again, amicably instead of a committer making subjective decisions. -- Progress stalling: One contributor assigns the ticket to himself/herself is actively debating but with no real code/docs contribution or with any real intention of making progress. Here workable, reviewable code for review usually wins. Assigning patches is not a privilege. Contributors at Apache are a bunch of volunteers, the PMC should let volunteers contribute as they see fit. We do not assign work at Apache. +Vinod On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com wrote: One over arching issue is that it's pretty unclear what Assigned to X in JIAR means from a process perspective. Personally I actually feel it's better for this to be more historical - i.e. who ended up submitting a patch for this feature that was merged - rather than creating an exclusive reservation for a particular user to work on something. If an issue is assigned to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. IMO, its fine if multiple people want to submit competing patches for something, provided everyone comments on JIRA saying they are intending to submit a patch, and everyone understands there is duplicate effort. So commenting with an intention to submit a patch, IMO seems like the healthiest workflow since it is non exclusive. To me the main benefit of assigning something ahead of time is if you have a committer that really wants to see someone specific work on a patch, it just acts as a strong signal that there is someone endorsed to work on that patch. That doesn't mean no one else can submit a patch, but it is IMO more of a warning that there may be existing work which is likely to be high quality, to avoid duplicated effort. When it was really easy to assign features to themselves, I saw a lot of anti-patterns in the community that seemed unhealthy, specifically: - It was really unclear what it means semantically if someone is assigned to a JIRA. - People assign JIRA's to themselves that aren't a good fit, given the authors level of experience. - People expect if they assign JIRA's to themselves that others won't submit patches, and become upset if they do. - People are discouraged from working on a patch because someone else was officially assigned. - Patrick On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote: Anecdotally, there are a number of people asking to set the Assignee field. This is currently restricted to Committers in JIRA. I know the logic was to prevent people from Assigning a JIRA and then leaving it; it also matters a bit for questions of credit. Still I wonder if it's best to just let people go ahead and set it, as the lesser evil. People can already do a lot like resolve JIRAs and set shepherd and critical priority and all that. I think the intent was to let Developers set this, but maybe due to an
Re: Should we let everyone set Assignee?
Hi Vinod, Thanks for you thoughts - However, I do not agree with your sentiment and implications. Spark is broadly quite an inclusive project and we spend a lot of effort culturally to help make newcomers feel welcome. - Patrick On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Actually what this community got away with is pretty much an anti-pattern compared to every other Apache project I have seen. And may I say in a not so Apache way. Waiting for a committer to assign a patch to someone leaves it as a privilege to a committer. Not alluding to anything fishy in practice, but this also leaves a lot of open ground for self-interest. Committers defining notions of good fit / level of experience do not work, highly subjective and lead to group control. In terms of semantics, here is what most other projects (dare I say every Apache project?) that I have seen do - A new contributor comes in who is not yet added to the JIRA project. He/she requests one of the project's JIRA admins to add him/her. - After that, he or she is free to assign tickets to themselves. - What this means -- Assigning a ticket to oneself is a signal to the rest of the community that he/she is actively working on the said patch. -- If multiple contributors want to work on the same patch, it needs to resolved amicably through open communication. On JIRA, or on mailing lists. Not by the whim of a committer. - Common issues -- Land grabbing: Other contributors can nudge him/her in case of inactivity and take them over. Again, amicably instead of a committer making subjective decisions. -- Progress stalling: One contributor assigns the ticket to himself/herself is actively debating but with no real code/docs contribution or with any real intention of making progress. Here workable, reviewable code for review usually wins. Assigning patches is not a privilege. Contributors at Apache are a bunch of volunteers, the PMC should let volunteers contribute as they see fit. We do not assign work at Apache. +Vinod On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com wrote: One over arching issue is that it's pretty unclear what Assigned to X in JIAR means from a process perspective. Personally I actually feel it's better for this to be more historical - i.e. who ended up submitting a patch for this feature that was merged - rather than creating an exclusive reservation for a particular user to work on something. If an issue is assigned to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. IMO, its fine if multiple people want to submit competing patches for something, provided everyone comments on JIRA saying they are intending to submit a patch, and everyone understands there is duplicate effort. So commenting with an intention to submit a patch, IMO seems like the healthiest workflow since it is non exclusive. To me the main benefit of assigning something ahead of time is if you have a committer that really wants to see someone specific work on a patch, it just acts as a strong signal that there is someone endorsed to work on that patch. That doesn't mean no one else can submit a patch, but it is IMO more of a warning that there may be existing work which is likely to be high quality, to avoid duplicated effort. When it was really easy to assign features to themselves, I saw a lot of anti-patterns in the community that seemed unhealthy, specifically: - It was really unclear what it means semantically if someone is assigned to a JIRA. - People assign JIRA's to themselves that aren't a good fit, given the authors level of experience. - People expect if they assign JIRA's to themselves that others won't submit patches, and become upset if they do. - People are discouraged from working on a patch because someone else was officially assigned. - Patrick On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote: Anecdotally, there are a number of people asking to set the Assignee field. This is currently restricted to Committers in JIRA. I know the logic was to prevent people from Assigning a JIRA and then leaving it; it also matters a bit for questions of credit. Still I wonder if it's best to just let people go ahead and set it, as the lesser evil. People can already do a lot like resolve JIRAs and set shepherd and critical priority and all that. I think the intent was to let Developers set this, but maybe due to an error, that's not how the current JIRA permission is implemented. I ask because I'm about to ping INFRA to update our scheme. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For
Re: Should we let everyone set Assignee?
Agreed. The Spark project and community that Vinod describes do not resemble the ones with which I am familiar. On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Vinod, Thanks for you thoughts - However, I do not agree with your sentiment and implications. Spark is broadly quite an inclusive project and we spend a lot of effort culturally to help make newcomers feel welcome. - Patrick On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Actually what this community got away with is pretty much an anti-pattern compared to every other Apache project I have seen. And may I say in a not so Apache way. Waiting for a committer to assign a patch to someone leaves it as a privilege to a committer. Not alluding to anything fishy in practice, but this also leaves a lot of open ground for self-interest. Committers defining notions of good fit / level of experience do not work, highly subjective and lead to group control. In terms of semantics, here is what most other projects (dare I say every Apache project?) that I have seen do - A new contributor comes in who is not yet added to the JIRA project. He/she requests one of the project's JIRA admins to add him/her. - After that, he or she is free to assign tickets to themselves. - What this means -- Assigning a ticket to oneself is a signal to the rest of the community that he/she is actively working on the said patch. -- If multiple contributors want to work on the same patch, it needs to resolved amicably through open communication. On JIRA, or on mailing lists. Not by the whim of a committer. - Common issues -- Land grabbing: Other contributors can nudge him/her in case of inactivity and take them over. Again, amicably instead of a committer making subjective decisions. -- Progress stalling: One contributor assigns the ticket to himself/herself is actively debating but with no real code/docs contribution or with any real intention of making progress. Here workable, reviewable code for review usually wins. Assigning patches is not a privilege. Contributors at Apache are a bunch of volunteers, the PMC should let volunteers contribute as they see fit. We do not assign work at Apache. +Vinod On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com wrote: One over arching issue is that it's pretty unclear what Assigned to X in JIAR means from a process perspective. Personally I actually feel it's better for this to be more historical - i.e. who ended up submitting a patch for this feature that was merged - rather than creating an exclusive reservation for a particular user to work on something. If an issue is assigned to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. IMO, its fine if multiple people want to submit competing patches for something, provided everyone comments on JIRA saying they are intending to submit a patch, and everyone understands there is duplicate effort. So commenting with an intention to submit a patch, IMO seems like the healthiest workflow since it is non exclusive. To me the main benefit of assigning something ahead of time is if you have a committer that really wants to see someone specific work on a patch, it just acts as a strong signal that there is someone endorsed to work on that patch. That doesn't mean no one else can submit a patch, but it is IMO more of a warning that there may be existing work which is likely to be high quality, to avoid duplicated effort. When it was really easy to assign features to themselves, I saw a lot of anti-patterns in the community that seemed unhealthy, specifically: - It was really unclear what it means semantically if someone is assigned to a JIRA. - People assign JIRA's to themselves that aren't a good fit, given the authors level of experience. - People expect if they assign JIRA's to themselves that others won't submit patches, and become upset if they do. - People are discouraged from working on a patch because someone else was officially assigned. - Patrick On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote: Anecdotally, there are a number of people asking to set the Assignee field. This is currently restricted to Committers in JIRA. I know the logic was to prevent people from Assigning a JIRA and then leaving it; it also matters a bit for questions of credit. Still I wonder if it's best to just let people go ahead and set it, as the lesser evil. People can already do a lot like resolve JIRAs and set shepherd and critical priority and all that. I think the intent was to let Developers set this, but maybe due to an error, that's
Re: Should we let everyone set Assignee?
I think you misread the thread, since that's the opposite of what Patrick suggested. He's suggesting that *nobody ever waits* to be assigned a JIRA to work on it; that anyone may work on a JIRA without waiting for it to be assigned. The point is: assigning JIRAs discourages others from doing work and we don't want to do that. So the pattern so far has been to not use it (except retroactively to credit the major contributor to the resolution.) The cost of this policy is -- oops, maybe you work on something that's already being worked on. That isn't a problem in practice. We already have a way to signal that you're working on a patch: you open a PR. It automatically links to JIRA. Or you can just comment. I suppose you could also use Assignee as a strong signal that your'e working on it, and some people want to do that, and so I was floating the idea of just letting people use it as they like. But I also back the idea of not having a notion of owner of working on a JIRA. On Wed, Apr 22, 2015 at 9:11 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Actually what this community got away with is pretty much an anti-pattern compared to every other Apache project I have seen. And may I say in a not so Apache way. Waiting for a committer to assign a patch to someone leaves it as a privilege to a committer. Not alluding to anything fishy in practice, but this also leaves a lot of open ground for self-interest. Committers defining notions of good fit / level of experience do not work, highly subjective and lead to group control. In terms of semantics, here is what most other projects (dare I say every Apache project?) that I have seen do - A new contributor comes in who is not yet added to the JIRA project. He/she requests one of the project's JIRA admins to add him/her. - After that, he or she is free to assign tickets to themselves. - What this means -- Assigning a ticket to oneself is a signal to the rest of the community that he/she is actively working on the said patch. -- If multiple contributors want to work on the same patch, it needs to resolved amicably through open communication. On JIRA, or on mailing lists. Not by the whim of a committer. - Common issues -- Land grabbing: Other contributors can nudge him/her in case of inactivity and take them over. Again, amicably instead of a committer making subjective decisions. -- Progress stalling: One contributor assigns the ticket to himself/herself is actively debating but with no real code/docs contribution or with any real intention of making progress. Here workable, reviewable code for review usually wins. Assigning patches is not a privilege. Contributors at Apache are a bunch of volunteers, the PMC should let volunteers contribute as they see fit. We do not assign work at Apache. +Vinod On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com wrote: One over arching issue is that it's pretty unclear what Assigned to X in JIAR means from a process perspective. Personally I actually feel it's better for this to be more historical - i.e. who ended up submitting a patch for this feature that was merged - rather than creating an exclusive reservation for a particular user to work on something. If an issue is assigned to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. IMO, its fine if multiple people want to submit competing patches for something, provided everyone comments on JIRA saying they are intending to submit a patch, and everyone understands there is duplicate effort. So commenting with an intention to submit a patch, IMO seems like the healthiest workflow since it is non exclusive. To me the main benefit of assigning something ahead of time is if you have a committer that really wants to see someone specific work on a patch, it just acts as a strong signal that there is someone endorsed to work on that patch. That doesn't mean no one else can submit a patch, but it is IMO more of a warning that there may be existing work which is likely to be high quality, to avoid duplicated effort. When it was really easy to assign features to themselves, I saw a lot of anti-patterns in the community that seemed unhealthy, specifically: - It was really unclear what it means semantically if someone is assigned to a JIRA. - People assign JIRA's to themselves that aren't a good fit, given the authors level of experience. - People expect if they assign JIRA's to themselves that others won't submit patches, and become upset if they do. - People are discouraged from working on a patch because someone else was officially assigned. - Patrick On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote: Anecdotally, there are a number of
Re: Should we let everyone set Assignee?
If it is true what you say, what is the reason for this committer-only-assigns-JIRA tickets policy? If anyone can send a pull request, anyone should be able to assign tickets to himself/herself too. +Vinod On Apr 22, 2015, at 1:18 PM, Reynold Xin r...@databricks.commailto:r...@databricks.com wrote: Woh hold on a minute. Spark has been among the projects that are the most welcoming to new contributors. And thanks to this, the sheer number of activities in Spark is much larger than other projects, and our workflow has to accommodate this fact. In practice, people just create pull requests on github, which is a newer friendlier better model given the constraints. We even have tools that automatically tags a ticket with a link to the pull requests. On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli vino...@hortonworks.commailto:vino...@hortonworks.com wrote: Actually what this community got away with is pretty much an anti-pattern compared to every other Apache project I have seen. And may I say in a not so Apache way. Waiting for a committer to assign a patch to someone leaves it as a privilege to a committer. Not alluding to anything fishy in practice, but this also leaves a lot of open ground for self-interest. Committers defining notions of good fit / level of experience do not work, highly subjective and lead to group control. In terms of semantics, here is what most other projects (dare I say every Apache project?) that I have seen do - A new contributor comes in who is not yet added to the JIRA project. He/she requests one of the project's JIRA admins to add him/her. - After that, he or she is free to assign tickets to themselves. - What this means -- Assigning a ticket to oneself is a signal to the rest of the community that he/she is actively working on the said patch. -- If multiple contributors want to work on the same patch, it needs to resolved amicably through open communication. On JIRA, or on mailing lists. Not by the whim of a committer. - Common issues -- Land grabbing: Other contributors can nudge him/her in case of inactivity and take them over. Again, amicably instead of a committer making subjective decisions. -- Progress stalling: One contributor assigns the ticket to himself/herself is actively debating but with no real code/docs contribution or with any real intention of making progress. Here workable, reviewable code for review usually wins. Assigning patches is not a privilege. Contributors at Apache are a bunch of volunteers, the PMC should let volunteers contribute as they see fit. We do not assign work at Apache. +Vinod On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.commailto:pwend...@gmail.com wrote: One over arching issue is that it's pretty unclear what Assigned to X in JIAR means from a process perspective. Personally I actually feel it's better for this to be more historical - i.e. who ended up submitting a patch for this feature that was merged - rather than creating an exclusive reservation for a particular user to work on something. If an issue is assigned to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. IMO, its fine if multiple people want to submit competing patches for something, provided everyone comments on JIRA saying they are intending to submit a patch, and everyone understands there is duplicate effort. So commenting with an intention to submit a patch, IMO seems like the healthiest workflow since it is non exclusive. To me the main benefit of assigning something ahead of time is if you have a committer that really wants to see someone specific work on a patch, it just acts as a strong signal that there is someone endorsed to work on that patch. That doesn't mean no one else can submit a patch, but it is IMO more of a warning that there may be existing work which is likely to be high quality, to avoid duplicated effort. When it was really easy to assign features to themselves, I saw a lot of anti-patterns in the community that seemed unhealthy, specifically: - It was really unclear what it means semantically if someone is assigned to a JIRA. - People assign JIRA's to themselves that aren't a good fit, given the authors level of experience. - People expect if they assign JIRA's to themselves that others won't submit patches, and become upset if they do. - People are discouraged from working on a patch because someone else was officially assigned. - Patrick On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.commailto:so...@cloudera.com wrote: Anecdotally, there are a number of people asking to set the Assignee field. This is currently restricted to Committers in JIRA. I know the logic was to prevent people from Assigning a JIRA and
Re: Should we let everyone set Assignee?
As a contributor, I¹ve never felt shut out from the Spark community, nor have I seen any examples of territorial behavior. A few times I¹ve expressed interest in more challenging work and the response I received was generally ³go ahead and give it a shot, just understand that this is sensitive code so we may end up modifying the PR substantially.² Honestly, that seems fine, and in general, I think it¹s completely fair to go with the PR model - e.g. If a JIRA has an open PR then it¹s an active effort, otherwise it¹s fair game unless otherwise stated. At the end of the day, it¹s about moving the project forward and the only way to do that is to have actual code in the pipes -speculation and intent don¹t really help, and there¹s nothing preventing an interested party from submitting a PR against an issue. Thank you, Ilya Ganelin On 4/22/15, 1:25 PM, Mark Hamstra m...@clearstorydata.com wrote: Agreed. The Spark project and community that Vinod describes do not resemble the ones with which I am familiar. On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Vinod, Thanks for you thoughts - However, I do not agree with your sentiment and implications. Spark is broadly quite an inclusive project and we spend a lot of effort culturally to help make newcomers feel welcome. - Patrick On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Actually what this community got away with is pretty much an anti-pattern compared to every other Apache project I have seen. And may I say in a not so Apache way. Waiting for a committer to assign a patch to someone leaves it as a privilege to a committer. Not alluding to anything fishy in practice, but this also leaves a lot of open ground for self-interest. Committers defining notions of good fit / level of experience do not work, highly subjective and lead to group control. In terms of semantics, here is what most other projects (dare I say every Apache project?) that I have seen do - A new contributor comes in who is not yet added to the JIRA project. He/she requests one of the project's JIRA admins to add him/her. - After that, he or she is free to assign tickets to themselves. - What this means -- Assigning a ticket to oneself is a signal to the rest of the community that he/she is actively working on the said patch. -- If multiple contributors want to work on the same patch, it needs to resolved amicably through open communication. On JIRA, or on mailing lists. Not by the whim of a committer. - Common issues -- Land grabbing: Other contributors can nudge him/her in case of inactivity and take them over. Again, amicably instead of a committer making subjective decisions. -- Progress stalling: One contributor assigns the ticket to himself/herself is actively debating but with no real code/docs contribution or with any real intention of making progress. Here workable, reviewable code for review usually wins. Assigning patches is not a privilege. Contributors at Apache are a bunch of volunteers, the PMC should let volunteers contribute as they see fit. We do not assign work at Apache. +Vinod On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com wrote: One over arching issue is that it's pretty unclear what Assigned to X in JIAR means from a process perspective. Personally I actually feel it's better for this to be more historical - i.e. who ended up submitting a patch for this feature that was merged - rather than creating an exclusive reservation for a particular user to work on something. If an issue is assigned to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems overall off to me. IMO, its fine if multiple people want to submit competing patches for something, provided everyone comments on JIRA saying they are intending to submit a patch, and everyone understands there is duplicate effort. So commenting with an intention to submit a patch, IMO seems like the healthiest workflow since it is non exclusive. To me the main benefit of assigning something ahead of time is if you have a committer that really wants to see someone specific work on a patch, it just acts as a strong signal that there is someone endorsed to work on that patch. That doesn't mean no one else can submit a patch, but it is IMO more of a warning that there may be existing work which is likely to be high quality, to avoid duplicated effort. When it was really easy to assign features to themselves, I saw a lot of anti-patterns in the community that seemed unhealthy, specifically: - It was really unclear what it means semantically if someone is assigned to a JIRA. - People assign JIRA's to themselves that
Re: Graphical display of metrics on application UI page
There were some PR's about graphical representation with D3.js, you can possibly see it on the github. Here's a few of them https://github.com/apache/spark/pulls?utf8=%E2%9C%93q=d3 Thanks Best Regards On Wed, Apr 22, 2015 at 8:08 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Dear Spark devs, Would people find it useful to have a graphical display of metrics (such as duration, GC time, etc) on the application UI page? Has anybody worked on this before? Punya
Re: Addition of new Metrics for killed executors.
Hi, Looks interesting. It is quite interesting to know about what could have been the reason for not showing these stats in UI. As per the description of Patrick W in https://spark-project.atlassian.net/browse/SPARK-999, it does not mention any exception w.r.t failed tasks/executors. Can somebody please comment if it is a bug or some intended behaviour w.r.t performance or some other bottleneck. --Twinkle On Mon, Apr 20, 2015 at 2:47 PM, Archit Thakur archit279tha...@gmail.com wrote: Hi Twinkle, We have a use case in where we want to debug the reason of how n why an executor got killed. Could be because of stackoverflow, GC or any other unexpected scenario. If I see the driver UI there is no information present around killed executors, So was just curious how do people usually debug those things apart from scanning logs and understanding it. The metrics we are planning to add are similar to what we have for non killed executors - [data per stage specifically] - numFailedTasks, executorRunTime, inputBytes, memoryBytesSpilled .. etc. Apart from that we also intend to add all information present in an executor tabs for running executors. Thanks, Archit Thakur. On Mon, Apr 20, 2015 at 1:31 PM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi Archit, What is your use case and what kind of metrics are you planning to add? Thanks, Twinkle On Fri, Apr 17, 2015 at 4:07 PM, Archit Thakur archit279tha...@gmail.com wrote: Hi, We are planning to add new Metrics in Spark for the executors that got killed during the execution. Was just curious, why this info is not already present. Is there some reason for not adding it.? Any ideas around are welcome. Thanks and Regards, Archit Thakur.
Re: Graphical display of metrics on application UI page
Thanks for the pointers! It looks like others are pretty active on this so I'll comment on those PRs and try to coordinate before starting any new work. Punya On Wed, Apr 22, 2015 at 2:49 AM Akhil Das ak...@sigmoidanalytics.com wrote: There were some PR's about graphical representation with D3.js, you can possibly see it on the github. Here's a few of them https://github.com/apache/spark/pulls?utf8=%E2%9C%93q=d3 Thanks Best Regards On Wed, Apr 22, 2015 at 8:08 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Dear Spark devs, Would people find it useful to have a graphical display of metrics (such as duration, GC time, etc) on the application UI page? Has anybody worked on this before? Punya
Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error
Anyone? On Wed, Apr 22, 2015 at 12:29 PM, Sourav Chandra sourav.chan...@livestream.com wrote: Hi Olivier, *the update function is as below*: *val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long, Long)]) = {* * val previousCount = state.getOrElse((0L, 0L))._2* * var startValue: IConcurrentUsers = ConcurrentViewers(0)* * var currentCount = 0L* * val lastIndexOfConcurrentUsers =* *values.lastIndexWhere(_.isInstanceOf[ConcurrentViewers])* * val subList = values.slice(0, lastIndexOfConcurrentUsers)* * val currentCountFromSubList = subList.foldLeft(startValue)(_ op _).count + previousCount* * val lastConcurrentViewersCount = values(lastIndexOfConcurrentUsers).count* * if (math.abs(lastConcurrentViewersCount - currentCountFromSubList) = 1) {* *logger.error(* * sCount using state updation $currentCountFromSubList, +* *sConcurrentUsers count $lastConcurrentViewersCount +* *s resetting to $lastConcurrentViewersCount* *)* *currentCount = lastConcurrentViewersCount* * }* * val remainingValuesList = values.diff(subList)* * startValue = ConcurrentViewers(currentCount)* * currentCount = remainingValuesList.foldLeft(startValue)(_ op _).count* * if (currentCount 0) {* *logger.error(* * sERROR: Got new count $currentCount 0, value:$values, state:$state, resetting to 0* *)* *currentCount = 0* * }* * // to stop pushing subsequent 0 after receiving first 0* * if (currentCount == 0 previousCount == 0) None* * else Some(previousCount, currentCount)* *}* *trait IConcurrentUsers {* * val count: Long* * def op(a: IConcurrentUsers): IConcurrentUsers = IConcurrentUsers.op(this, a)* *}* *object IConcurrentUsers {* * def op(a: IConcurrentUsers, b: IConcurrentUsers): IConcurrentUsers = (a, b) match {* *case (_, _: ConcurrentViewers) = * * ConcurrentViewers(b.count)* *case (_: ConcurrentViewers, _: IncrementConcurrentViewers) = * * ConcurrentViewers(a.count + b.count)* *case (_: ConcurrentViewers, _: DecrementConcurrentViewers) = * * ConcurrentViewers(a.count - b.count)* * }* *}* *case class IncrementConcurrentViewers(count: Long) extends IConcurrentUsers* *case class DecrementConcurrentViewers(count: Long) extends IConcurrentUsers* *case class ConcurrentViewers(count: Long) extends IConcurrentUsers* *also the error stack trace copied from executor logs is:* *java.lang.OutOfMemoryError: Java heap space* *at org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)* *at org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2564)* *at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)* *at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)* *at org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43)* *at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927)* *at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)* *at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)* *at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)* *at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)* *at java.lang.reflect.Method.invoke(Method.java:601)* *at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)* *at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)* *at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)* *at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)* *at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)* *at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)* *at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:236)* *at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readObject$1.apply$mcV$sp(TorrentBroadcast.scala:169)* *at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927)* *at org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:155)* *at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)* *at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)* *at java.lang.reflect.Method.invoke(Method.java:601)* *at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)* *at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)* *at
Re: Dataframe.fillna from 1.3.0
Where should this *coalesce* come from ? Is it related to the partition manipulation coalesce method ? Thanks ! Le lun. 20 avr. 2015 à 22:48, Reynold Xin r...@databricks.com a écrit : Ah ic. You can do something like df.select(coalesce(df(a), lit(0.0))) On Mon, Apr 20, 2015 at 1:44 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: From PySpark it seems to me that the fillna is relying on Java/Scala code, that's why I was wondering. Thank you for answering :) Le lun. 20 avr. 2015 à 22:22, Reynold Xin r...@databricks.com a écrit : You can just create fillna function based on the 1.3.1 implementation of fillna, no? On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: a UDF might be a good idea no ? Le lun. 20 avr. 2015 à 11:17, Olivier Girardot o.girar...@lateral-thoughts.com a écrit : Hi everyone, let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna* API in PySpark, is there any efficient alternative to mapping the records myself ? Regards, Olivier.
Re: Dataframe.fillna from 1.3.0
I think I found the Coalesce you were talking about, but this is a catalyst class that I think is not available from pyspark Regards, Olivier. Le mer. 22 avr. 2015 à 11:56, Olivier Girardot o.girar...@lateral-thoughts.com a écrit : Where should this *coalesce* come from ? Is it related to the partition manipulation coalesce method ? Thanks ! Le lun. 20 avr. 2015 à 22:48, Reynold Xin r...@databricks.com a écrit : Ah ic. You can do something like df.select(coalesce(df(a), lit(0.0))) On Mon, Apr 20, 2015 at 1:44 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: From PySpark it seems to me that the fillna is relying on Java/Scala code, that's why I was wondering. Thank you for answering :) Le lun. 20 avr. 2015 à 22:22, Reynold Xin r...@databricks.com a écrit : You can just create fillna function based on the 1.3.1 implementation of fillna, no? On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: a UDF might be a good idea no ? Le lun. 20 avr. 2015 à 11:17, Olivier Girardot o.girar...@lateral-thoughts.com a écrit : Hi everyone, let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna* API in PySpark, is there any efficient alternative to mapping the records myself ? Regards, Olivier.
RE: Is spark-ec2 for production use?
Replacement for production-ish is beyond a stretch phrasing, UX just isn’t there yet for average end user wanting push-button. Up until a bit ago focus was heavily focused on infrastructure folks and people building their own distros. Project is turning towards end users so anyone from ops to dev/data-hacker will be able to extract value and get moving easily. If you are brave enough to give it a go and start playing around with it in its current state you can start here looking at puppet modules readme: https://github.com/apache/bigtop/tree/master/bigtop-deploy/puppet Currently limited (ie: no yarn, mesos variants, orchestration not added yet), things will be stepping up a great detail heading out of 1.0 release. If you do and run into stuff hop on mailing list, docs are another area updating is needed. Thanks for pointers on the json feed link, definitely handy for some smoke tests -Original Message- From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com] Sent: Tuesday, April 21, 2015 2:33 PM To: n...@reactor8.com; Spark dev list Subject: Re: Is spark-ec2 for production use? Nate, could you point us to an example of how one would use Big Top as a more production-ish replacement for spark-ec2? I look a look at the project page http://bigtop.apache.org/index.html, but couldn't find any usage examples. Perhaps we can link to them from the spark-ec2 docs. Regarding tests to validate that Spark was set up correctly, I am using the JSON feed from the Spark master web UI http://stackoverflow.com/a/29659630/877069 for starters. Y'all might find it useful for the same purpose. Nick On Tue, Apr 21, 2015 at 5:21 PM n...@reactor8.com wrote: Several of the Bigtop folks got together last week at ApacheCon, this was popular topic for next enhancements with spark related components after getting 1.0 out the door. Some leading topics were: -deployment of spark specific clusters -spark standalone, hdfs -spark over yarn, hdfs -spark on mesos (talked to mesos folk about working to include in bigtop post 1.0) -the above plus variants of other bigtop components (ie: kafka, zeppelin, demo data generators) One thing group would like some help on is tests for spark environments so things can be validated post build/deploy and enhance CI process so if you choose to deploy via bigtop in test/prod/etc you know things have gone through a certain amount of rigor beforehand Nate -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Tuesday, April 21, 2015 12:46 PM To: Nicholas Chammas Cc: Spark dev list Subject: Re: Is spark-ec2 for production use? It could be a good idea to document this a bit. The original goals were to give people an easy way to get started with Spark and also to provide a consistent environment for our own experiments and benchmarking of Spark at the AMPLab. Over time I've noticed a huge amount of scope increase in terms of what people want to do and I do know that many companies run production infrastructure based on launching the EC2 scripts. My feeling is that the general problem of deploying Spark with other applications and frameworks is fairly well covered by projects which specifically focus on packaging and automation (e.g. Whirr, BigTop, etc). So I'd like to see a narrower focus on just getting a vanilla Spark cluster up and running and make it clear that customization and extension of that functionality is really not in scope. This doesn't mean discouraging people from using it for production use cases, but more that they shouldn't expect us to merge and maintain things that seek to do broader integration with other technologies, automation, etc. - Patrick On Tue, Apr 21, 2015 at 12:05 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Is spark-ec2 intended for spinning up production Spark clusters? I think the answer is no. However, the docs for spark-ec2 https://spark.apache.org/docs/latest/ec2-scripts.html very much leave that possibility open, and indeed I see many people asking questions or opening issues that stem from some production use case they are trying to fit spark-ec2 to. Here's the latest example https://issues.apache.org/jira/browse/SPARK-6900?focusedCommentId=1 45 04236page=com.atlassian.jira.plugin.system.issuetabpanels:comment-t ab panel#comment-14504236 of someone using spark-ec2 to power their (presumably) production service. Shouldn't we actively discourage people from using spark-ec2 in this way? I understand there's no stopping people from doing what they want with it, and certainly the questions and issues we receive about spark-ec2 are still valid, even if they stem from discouraged use cases. From what I understand, spark-ec2 is intended for quick experimentation, one-off jobs, prototypes, and so forth. If that's the case, it's best to stress this in
Re: Spark build time
I agree, it's what I did :) I was just wondering if it was considered a problem or something to work on, I personally think so because the feedback loop should be as quick as possible, and therefore if there was someone I could help. Le mar. 21 avr. 2015 à 22:20, Reynold Xin r...@databricks.com a écrit : It runs tons of integration tests. I think most developers just let Jenkins run the full suite of them. On Tue, Apr 21, 2015 at 12:54 PM, Olivier Girardot ssab...@gmail.com wrote: Hi everyone, I was just wandering about the Spark full build time (including tests), 1h48 seems to me quite... spacious. What's taking most of the time ? Is the build mainly integration tests ? Is there any roadmap or jiras dedicated to that we can chip in ? Regards, Olivier.
Pipeline in pyspark
Hi, I came across documentation for creating a pipeline in mlib library of pyspark. I wanted to know if something similar exists for pyspark input transformations. I have a use case where I have my input files in different formats and would like to convert them to rdd and store them in memory and perform certain custom tasks in a pipeline without storing it back to disc in any step. I came across luigi(http://luigi.readthedocs.org/en/latest/), but I found that it stores the contents onto disc and reloads it for the next phase of the pipeline. -- Thanks and regards, Suraj