Re: Spark build time

2015-04-22 Thread Nicholas Chammas
I suggest searching the archives for this list as there were several
previous discussions about this problem. JIRA also has several issues
related to this.

Some pointers:

   - SPARK-3431 https://issues.apache.org/jira/browse/SPARK-3431:
   Parallelize Scala/Java test execution
   -
   
http://apache-spark-developers-list.1001551.n3.nabble.com/Unit-tests-in-lt-5-minutes-td7757.html
   - SPARK-4746 https://issues.apache.org/jira/browse/SPARK-4746:
   integration tests should be separated from faster unit tests
   -
   
http://apache-spark-developers-list.1001551.n3.nabble.com/Building-Spark-with-Pants-td10397.html

Summary is, everyone agrees the long times are a problem and wants the
build and tests to run faster. There are several things that can be done,
but they all require a lot of work.

Nick
​

On Wed, Apr 22, 2015 at 5:25 AM Olivier Girardot ssab...@gmail.com wrote:

 I agree, it's what I did :)
 I was just wondering if it was considered a problem or something to work
 on, I personally think so because the feedback loop should be as quick as
 possible, and therefore if there was someone I could help.

 Le mar. 21 avr. 2015 à 22:20, Reynold Xin r...@databricks.com a écrit :

  It runs tons of integration tests. I think most developers just let
  Jenkins run the full suite of them.
 
  On Tue, Apr 21, 2015 at 12:54 PM, Olivier Girardot ssab...@gmail.com
  wrote:
 
  Hi everyone,
  I was just wandering about the Spark full build time (including tests),
  1h48 seems to me quite... spacious. What's taking most of the time ? Is
  the
  build mainly integration tests ? Is there any roadmap or jiras dedicated
  to
  that we can chip in ?
 
  Regards,
 
  Olivier.
 
 
 



Indices of SparseVector must be ordered while computing SVD

2015-04-22 Thread Chunnan Yao
Hi all, 
I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This really
confused me today. At first I thought my implementation is wrong. It turns
out it's an issue in MLlib. Fortunately, I've figured it out. 

I suggest to add a hint on user document of MLlib ( as far as I know, there
have not been such hints yet) that  indices of Local Sparse Vector must be
ordered in ascending manner. Because of ignorance of this point, I spent a
lot of time looking for reasons why computeSVD of RowMatrix did not run
correctly on Sparse data. I don't know the influence of Sparse Vector
without ordered indices on other functions, but I believe it is necessary to
let the users know or fix it. Actually, it's very easy to fix. Just add a
sortBy function in internal construction of SparseVector. 

Here is an example to reproduce the affect of unordered Sparse Vector on
computeSVD. 
 
//in spark-shell, Spark 1.3.1 
 import org.apache.spark.mllib.linalg.distributed.RowMatrix 
 import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector,
Vectors} 

  val sparseData_ordered = Seq( 
Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), 
Vectors.sparse(3, Array(0,1,2), Array(3.0, 4.0, 5.0)), 
Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), 
Vectors.sparse(3, Array(0,2), Array(9.0, 1.0)) 
  ) 
  val sparseMat_ordered = new RowMatrix(sc.parallelize(sparseData_ordered,
2)) 

  val sparseData_not_ordered = Seq( 
Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), 
Vectors.sparse(3, Array(2,1,0), Array(5.0,4.0,3.0)), 
Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), 
Vectors.sparse(3, Array(2,0), Array(1.0,9.0)) 
  ) 
 val sparseMat_not_ordered = new
RowMatrix(sc.parallelize(sparseData_not_ordered, 2)) 

//apparently, sparseMat_ordered and sparseMat_not_ordered are essentially
the same matirx 
//however, the computeSVD result of these two matrixes are different. Users
should be notified about this situation. 
  println(sparseMat_ordered.computeSVD(2,
true).U.rows.collect.mkString(\n)) 
  println(===) 
  println(sparseMat_not_ordered.computeSVD(2,
true).U.rows.collect.mkString(\n)) 
== 
The results are: 
ordered: 
[-0.10972870132786407,-0.18850811494220537] 
[-0.44712472003608356,-0.24828866611663725] 
[-0.784520738744303,-0.3080692172910691] 
[-0.4154110101064339,0.8988385762953358] 

not ordered: 
[-0.10830447119599484,-0.1559341848984378] 
[-0.4522713511277327,-0.23449829541447448] 
[-0.7962382310594706,-0.3130624059305111] 
[-0.43131320303494614,0.8453864703362308] 

Looking into this issue, I can see it's reason locates in
RowMatrix.scala(line 629). The implementation of Sparse dspr here requires
ordered indices. Because it is scanning the indices consecutively to skip
empty columns. 



-
Feel the sparking Spark!
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Indices-of-SparseVector-must-be-ordered-while-computing-SVD-tp11731.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: python/run-tests fails at spark master branch

2015-04-22 Thread Saisai Shao
Hi Hrishikesh,

Seems the behavior of Kafka-assembly is a little different when using Maven
to sbt. The assembly jar name and location is different while using `mvn
package`. This is a actually bug, I'm fixing this now.

Thanks
Jerry


2015-04-22 13:37 GMT+08:00 Hrishikesh Subramonian 
hrishikesh.subramon...@flytxt.com:

  Hi,

 The *python/run-tests* executes successfully after I ran *'build/sbt
 assembly*' command. But the tests fail if I run it after *'mvn
 -Dskiptests clean package'* command. Why does it run in *sbt assembly*
 and not in* mvn package*?

 --
 Hrishikesh

 On Wednesday 22 April 2015 07:38 AM, Saisai Shao wrote:

 Hi Hrishikesh,

  Now we add Kafka unit test for python which relies on Kafka assembly
 jar, so you need to run `sbt assembly` or mvn `package` at first to get an
 assemble jar.



 2015-04-22 1:15 GMT+08:00 Marcelo Vanzin van...@cloudera.com:

 On Tue, Apr 21, 2015 at 1:30 AM, Hrishikesh Subramonian
 hrishikesh.subramon...@flytxt.com wrote:

  Run streaming tests ...
  Failed to find Spark Streaming Kafka assembly jar in
  /home/xyz/spark/external/kafka-assembly
  You need to build Spark with  'build/sbt assembly/assembly
  streaming-kafka-assembly/assembly' or 'build/mvn package' before running
  this program
 
 
  Is anybody facing the same problem?

 Have you built the assemblies before running the tests? (mvn package
 -DskipTests, or sbt assembly)


 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org






Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Tathagata Das
It could very well be that your executor memory is not enough to store the
state RDDs AND operate on the data. 1G per executor is quite low.
Definitely give more memory. And have you tried increasing the number of
partitions (specify number of partitions in updateStateByKey) ?

On Wed, Apr 22, 2015 at 2:34 AM, Sourav Chandra 
sourav.chan...@livestream.com wrote:

 Anyone?

 On Wed, Apr 22, 2015 at 12:29 PM, Sourav Chandra 
 sourav.chan...@livestream.com wrote:

  Hi Olivier,
 
  *the update function is as below*:
 
  *val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long,
  Long)]) = {*
  *  val previousCount = state.getOrElse((0L, 0L))._2*
  *  var startValue: IConcurrentUsers = ConcurrentViewers(0)*
  *  var currentCount = 0L*
  *  val lastIndexOfConcurrentUsers =*
  *values.lastIndexWhere(_.isInstanceOf[ConcurrentViewers])*
  *  val subList = values.slice(0, lastIndexOfConcurrentUsers)*
  *  val currentCountFromSubList = subList.foldLeft(startValue)(_ op
  _).count + previousCount*
  *  val lastConcurrentViewersCount =
  values(lastIndexOfConcurrentUsers).count*
 
  *  if (math.abs(lastConcurrentViewersCount - currentCountFromSubList)
  = 1) {*
  *logger.error(*
  *  sCount using state updation $currentCountFromSubList,  +*
  *sConcurrentUsers count $lastConcurrentViewersCount +*
  *s resetting to $lastConcurrentViewersCount*
  *)*
  *currentCount = lastConcurrentViewersCount*
  *  }*
  *  val remainingValuesList = values.diff(subList)*
  *  startValue = ConcurrentViewers(currentCount)*
  *  currentCount = remainingValuesList.foldLeft(startValue)(_ op
  _).count*
 
  *  if (currentCount  0) {*
 
  *logger.error(*
  *  sERROR: Got new count $currentCount  0, value:$values,
  state:$state, resetting to 0*
  *)*
  *currentCount = 0*
  *  }*
  *  // to stop pushing subsequent 0 after receiving first 0*
  *  if (currentCount == 0  previousCount == 0) None*
  *  else Some(previousCount, currentCount)*
  *}*
 
  *trait IConcurrentUsers {*
  *  val count: Long*
  *  def op(a: IConcurrentUsers): IConcurrentUsers =
  IConcurrentUsers.op(this, a)*
  *}*
 
  *object IConcurrentUsers {*
  *  def op(a: IConcurrentUsers, b: IConcurrentUsers): IConcurrentUsers =
  (a, b) match {*
  *case (_, _: ConcurrentViewers) = *
  *  ConcurrentViewers(b.count)*
  *case (_: ConcurrentViewers, _: IncrementConcurrentViewers) = *
  *  ConcurrentViewers(a.count + b.count)*
  *case (_: ConcurrentViewers, _: DecrementConcurrentViewers) = *
  *  ConcurrentViewers(a.count - b.count)*
  *  }*
  *}*
 
  *case class IncrementConcurrentViewers(count: Long) extends
  IConcurrentUsers*
  *case class DecrementConcurrentViewers(count: Long) extends
  IConcurrentUsers*
  *case class ConcurrentViewers(count: Long) extends IConcurrentUsers*
 
 
  *also the error stack trace copied from executor logs is:*
 
  *java.lang.OutOfMemoryError: Java heap space*
  *at
 
 org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)*
  *at
  org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2564)*
  *at
  org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)*
  *at
  org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)*
  *at
 
 org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43)*
  *at
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927)*
  *at
 
 org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)*
  *at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)*
  *at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)*
  *at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
  *at java.lang.reflect.Method.invoke(Method.java:601)*
  *at
  java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)*
  *at
  java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)*
  *at
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)*
  *at
  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)*
  *at
  java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)*
  *at
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)*
  *at
 
 org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:236)*
  *at
 
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readObject$1.apply$mcV$sp(TorrentBroadcast.scala:169)*
  *at
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927)*
  *at
 
 

Re: Dataframe.fillna from 1.3.0

2015-04-22 Thread Reynold Xin
It is actually different.

coalesce expression is to pick the first value that is not null:
https://msdn.microsoft.com/en-us/library/ms190349.aspx

Would be great to update the documentation for it (both Scala and Java) to
explain that it is different from coalesce function on a DataFrame/RDD. Do
you want to submit a pull request?



On Wed, Apr 22, 2015 at 3:05 AM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 I think I found the Coalesce you were talking about, but this is a
 catalyst class that I think is not available from pyspark

 Regards,

 Olivier.

 Le mer. 22 avr. 2015 à 11:56, Olivier Girardot 
 o.girar...@lateral-thoughts.com a écrit :

 Where should this *coalesce* come from ? Is it related to the partition
 manipulation coalesce method ?
 Thanks !

 Le lun. 20 avr. 2015 à 22:48, Reynold Xin r...@databricks.com a écrit :

 Ah ic. You can do something like


 df.select(coalesce(df(a), lit(0.0)))

 On Mon, Apr 20, 2015 at 1:44 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 From PySpark it seems to me that the fillna is relying on Java/Scala
 code, that's why I was wondering.
 Thank you for answering :)

 Le lun. 20 avr. 2015 à 22:22, Reynold Xin r...@databricks.com a
 écrit :

 You can just create fillna function based on the 1.3.1 implementation
 of fillna, no?


 On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 a UDF might be a good idea no ?

 Le lun. 20 avr. 2015 à 11:17, Olivier Girardot 
 o.girar...@lateral-thoughts.com a écrit :

  Hi everyone,
  let's assume I'm stuck in 1.3.0, how can I benefit from the
 *fillna* API
  in PySpark, is there any efficient alternative to mapping the
 records
  myself ?
 
  Regards,
 
  Olivier.
 






Should we let everyone set Assignee?

2015-04-22 Thread Sean Owen
Anecdotally, there are a number of people asking to set the Assignee
field. This is currently restricted to Committers in JIRA. I know the
logic was to prevent people from Assigning a JIRA and then leaving it;
it also matters a bit for questions of credit.

Still I wonder if it's best to just let people go ahead and set it, as
the lesser evil. People can already do a lot like resolve JIRAs and
set shepherd and critical priority and all that.

I think the intent was to let Developers set this, but maybe due to
an error, that's not how the current JIRA permission is implemented.

I ask because I'm about to ping INFRA to update our scheme.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Should we let everyone set Assignee?

2015-04-22 Thread Patrick Wendell
One over arching issue is that it's pretty unclear what Assigned to
X in JIAR means from a process perspective. Personally I actually
feel it's better for this to be more historical - i.e. who ended up
submitting a patch for this feature that was merged - rather than
creating an exclusive reservation for a particular user to work on
something.

If an issue is assigned to person X, but some other person Y submits
a great patch for it, I think we have some obligation to Spark users
and to the community to merge the better patch. So the idea of
reserving the right to add a feature, it just seems overall off to me.
IMO, its fine if multiple people want to submit competing patches for
something, provided everyone comments on JIRA saying they are
intending to submit a patch, and everyone understands there is
duplicate effort. So commenting with an intention to submit a patch,
IMO seems like the healthiest workflow since it is non exclusive.

To me the main benefit of assigning something ahead of time is if
you have a committer that really wants to see someone specific work on
a patch, it just acts as a strong signal that there is someone
endorsed to work on that patch. That doesn't mean no one else can
submit a patch, but it is IMO more of a warning that there may be
existing work which is likely to be high quality, to avoid duplicated
effort.

When it was really easy to assign features to themselves, I saw a lot
of anti-patterns in the community that seemed unhealthy, specifically:

- It was really unclear what it means semantically if someone is
assigned to a JIRA.
- People assign JIRA's to themselves that aren't a good fit, given the
authors level of experience.
- People expect if they assign JIRA's to themselves that others won't
submit patches, and become upset if they do.
- People are discouraged from working on a patch because someone else
was officially assigned.

- Patrick

On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote:
 Anecdotally, there are a number of people asking to set the Assignee
 field. This is currently restricted to Committers in JIRA. I know the
 logic was to prevent people from Assigning a JIRA and then leaving it;
 it also matters a bit for questions of credit.

 Still I wonder if it's best to just let people go ahead and set it, as
 the lesser evil. People can already do a lot like resolve JIRAs and
 set shepherd and critical priority and all that.

 I think the intent was to let Developers set this, but maybe due to
 an error, that's not how the current JIRA permission is implemented.

 I ask because I'm about to ping INFRA to update our scheme.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



GradientBoostTrees leaks a persisted RDD

2015-04-22 Thread jimfcarroll
Hi all,

It appears GradientBoostedTrees.scala can call 'persist' on an RDD and never
unpersist it. In the master branch it's here:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala#L181

In 1.3.1 it's here:

https://github.com/apache/spark/blob/v1.3.1/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala#L138

Let me know if you want a fix for this.

Jim




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/GradientBoostTrees-leaks-a-persisted-RDD-tp11750.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: GradientBoostTrees leaks a persisted RDD

2015-04-22 Thread Joseph Bradley
Hi Jim,

You're right; that should be unpersisted.  Could you please create a JIRA
and submit a patch?

Thanks!
Joseph

On Wed, Apr 22, 2015 at 6:00 PM, jimfcarroll jimfcarr...@gmail.com wrote:

 Hi all,

 It appears GradientBoostedTrees.scala can call 'persist' on an RDD and
 never
 unpersist it. In the master branch it's here:


 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala#L181

 In 1.3.1 it's here:


 https://github.com/apache/spark/blob/v1.3.1/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala#L138

 Let me know if you want a fix for this.

 Jim




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/GradientBoostTrees-leaks-a-persisted-RDD-tp11750.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Indices of SparseVector must be ordered while computing SVD

2015-04-22 Thread Joseph Bradley
Hi Chunnan,

There is currently Scala documentation for the constructor parameters:
https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L515

There is one benefit to not checking for validity (ordering) within the
constructor: If you need to translate between SparseVector and some other
library's type (e.g., Breeze), you can do so with a few reference copies,
rather than iterating through or copying the actual data.  It might be good
to provide this check within Vectors.sparse(), but we'd need to check
through MLlib for uses of Vectors.sparse which expect it to be a cheap
operation.  What do you think?

It is documented in the programming guide too:
https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/docs/mllib-data-types.md
But perhaps that should be more prominent.

If you think it would be helpful, then please do make a JIRA about adding a
check to Vectors.sparse().

Joseph

On Wed, Apr 22, 2015 at 8:29 AM, Chunnan Yao yaochun...@gmail.com wrote:

 Hi all,
 I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This
 really
 confused me today. At first I thought my implementation is wrong. It turns
 out it's an issue in MLlib. Fortunately, I've figured it out.

 I suggest to add a hint on user document of MLlib ( as far as I know, there
 have not been such hints yet) that  indices of Local Sparse Vector must be
 ordered in ascending manner. Because of ignorance of this point, I spent a
 lot of time looking for reasons why computeSVD of RowMatrix did not run
 correctly on Sparse data. I don't know the influence of Sparse Vector
 without ordered indices on other functions, but I believe it is necessary
 to
 let the users know or fix it. Actually, it's very easy to fix. Just add a
 sortBy function in internal construction of SparseVector.

 Here is an example to reproduce the affect of unordered Sparse Vector on
 computeSVD.
 
 //in spark-shell, Spark 1.3.1
  import org.apache.spark.mllib.linalg.distributed.RowMatrix
  import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector,
 Vectors}

   val sparseData_ordered = Seq(
 Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)),
 Vectors.sparse(3, Array(0,1,2), Array(3.0, 4.0, 5.0)),
 Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)),
 Vectors.sparse(3, Array(0,2), Array(9.0, 1.0))
   )
   val sparseMat_ordered = new RowMatrix(sc.parallelize(sparseData_ordered,
 2))

   val sparseData_not_ordered = Seq(
 Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)),
 Vectors.sparse(3, Array(2,1,0), Array(5.0,4.0,3.0)),
 Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)),
 Vectors.sparse(3, Array(2,0), Array(1.0,9.0))
   )
  val sparseMat_not_ordered = new
 RowMatrix(sc.parallelize(sparseData_not_ordered, 2))

 //apparently, sparseMat_ordered and sparseMat_not_ordered are essentially
 the same matirx
 //however, the computeSVD result of these two matrixes are different. Users
 should be notified about this situation.
   println(sparseMat_ordered.computeSVD(2,
 true).U.rows.collect.mkString(\n))
   println(===)
   println(sparseMat_not_ordered.computeSVD(2,
 true).U.rows.collect.mkString(\n))
 ==
 The results are:
 ordered:
 [-0.10972870132786407,-0.18850811494220537]
 [-0.44712472003608356,-0.24828866611663725]
 [-0.784520738744303,-0.3080692172910691]
 [-0.4154110101064339,0.8988385762953358]

 not ordered:
 [-0.10830447119599484,-0.1559341848984378]
 [-0.4522713511277327,-0.23449829541447448]
 [-0.7962382310594706,-0.3130624059305111]
 [-0.43131320303494614,0.8453864703362308]

 Looking into this issue, I can see it's reason locates in
 RowMatrix.scala(line 629). The implementation of Sparse dspr here requires
 ordered indices. Because it is scanning the indices consecutively to skip
 empty columns.



 -
 Feel the sparking Spark!
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Indices-of-SparseVector-must-be-ordered-while-computing-SVD-tp11731.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Should we let everyone set Assignee?

2015-04-22 Thread Nicholas Chammas
To repeat what Patrick said (literally):

If an issue is “assigned” to person X, but some other person Y submits
a great patch for it, I think we have some obligation to Spark users
and to the community to merge the better patch. So the idea of
reserving the right to add a feature, it just seems overall off to me.

No-one in the Spark community dictates who gets to do work. When an issue
is assigned to someone in JIRA, it’s either because a) they did the work
and the issue is now resolved, or b) they are signaling to others that they
are working on it.

In the case of b), nothing stops other people from working on the issue and
it’s quite normal for other people to complete issues that were technically
assigned to someone else. There is no land grabbing or stalling. Anyone who
has contributed to Spark for any amount of time knows this.

Vinod,

I want to take this opportunity to call out the approach to communication
you took here.

As a random contributor to Spark and active participant on this list, my
reaction when I read your email was this:

   - You do not know how the Spark community actually works.
   - You read a thread that contains some trigger phrases.
   - You wrote a lengthy response as a knee-jerk reaction.

I’m not trying to mock, but I want to be direct and honest about how you
came off in this thread to me and probably many others.

Why not ask questions first—many questions? Why not make doubly sure that
you understand the situation correctly before responding?

In many ways this is much like filing a bug report. “I’m seeing this. It
seems wrong to me. Is this expected?” I think we all know from experience
that this kind of bug report is polite and will likely lead to a productive
discussion. On the other hand: “You’re returning a -1 here? This is
obviously wrong! And, boy, lemme tell you how wrong you are!!!” No-one
likes to deal with bug reports like this. More importantly, they get in the
way of fixing the actual problem, if there is one.

This is not about the Apache Way or not. It’s about basic etiquette and
effective communication.

I understand that there are legitimate potential concerns here, and it’s
important that, as an Apache project, Spark work according to Apache
principles. But when some person who has never participated on this list
pops up out of nowhere with a lengthy lecture on the Apache Way and
whatnot, I have to say that that is not an effective way to communicate.
Pretty much the same thing happened with Greg Stein on an earlier thread
some months ago about designating maintainers for components.

The concerns are legitimate, I’m sure, and we want to keep Spark in line
with the Apache Way. And certainly, there have been many times when a
project veered off course and needed to corrected.

But when we want to make things right, I hope we can do it in a way that
respectfully and tactfully engages the community. These “lectures delivered
from above” — which is how they come off — are not helpful.

Nick
​

On Wed, Apr 22, 2015 at 4:31 PM Ganelin, Ilya ilya.gane...@capitalone.com
wrote:

 As a contributor, I¹ve never felt shut out from the Spark community, nor
 have I seen any examples of territorial behavior. A few times I¹ve
 expressed interest in more challenging work and the response I received
 was generally ³go ahead and give it a shot, just understand that this is
 sensitive code so we may end up modifying the PR substantially.² Honestly,
 that seems fine, and in general, I think it¹s completely fair to go with
 the PR model - e.g. If a JIRA has an open PR then it¹s an active effort,
 otherwise it¹s fair game unless otherwise stated. At the end of the day,
 it¹s about moving the project forward and the only way to do that is to
 have actual code in the pipes -speculation and intent don¹t really help,
 and there¹s nothing preventing an interested party from submitting a PR
 against an issue.

 Thank you,
 Ilya Ganelin






 On 4/22/15, 1:25 PM, Mark Hamstra m...@clearstorydata.com wrote:

 Agreed.  The Spark project and community that Vinod describes do not
 resemble the ones with which I am familiar.
 
 On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Hi Vinod,
 
  Thanks for you thoughts - However, I do not agree with your sentiment
  and implications. Spark is broadly quite an inclusive project and we
  spend a lot of effort culturally to help make newcomers feel welcome.
 
  - Patrick
 
  On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli
  vino...@hortonworks.com wrote:
   Actually what this community got away with is pretty much an
  anti-pattern compared to every other Apache project I have seen. And
 may I
  say in a not so Apache way.
  
   Waiting for a committer to assign a patch to someone leaves it as a
  privilege to a committer. Not alluding to anything fishy in practice,
 but
  this also leaves a lot of open ground for self-interest. Committers
  defining notions of good fit / level of experience do not work, 

Re: Should we let everyone set Assignee?

2015-04-22 Thread Vinod Kumar Vavilapalli

Last one for the day.

Everyone, as I said clearly, I was not alluding to anything fishy in 
practice, I was describing how things go wrong in such an environment. Sandy's 
email lays down some of these problems.

Assigning a JIRA in other projects is not a reservation. It is a clear 
intention of working on design or code.

You don't need a new convention of signaling. In almost all other projects, it 
is assigning tickets - that's how it is used.

+Vinod

On Apr 22, 2015, at 2:37 PM, Patrick Wendell pwend...@gmail.com wrote:

 Sandy - I definitely agree with that. We should have a convention of
 signaling someone intends to work - for instance by commenting on the
 JIRA and we should document this on the contribution guide. The nice
 thing about having that convention is that multiple people can say
 they are going to work on something, whereas only one person can be
 given the assignee slot on a JIRA.
 
 
 On Wed, Apr 22, 2015 at 2:33 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
 To repeat what Patrick said (literally):
 
 If an issue is assigned to person X, but some other person Y submits
 a great patch for it, I think we have some obligation to Spark users
 and to the community to merge the better patch. So the idea of
 reserving the right to add a feature, it just seems overall off to me.
 
 No-one in the Spark community dictates who gets to do work. When an issue is
 assigned to someone in JIRA, it's either because a) they did the work and
 the issue is now resolved, or b) they are signaling to others that they are
 working on it.
 
 In the case of b), nothing stops other people from working on the issue and
 it's quite normal for other people to complete issues that were technically
 assigned to someone else. There is no land grabbing or stalling. Anyone who
 has contributed to Spark for any amount of time knows this.
 
 Vinod,
 
 I want to take this opportunity to call out the approach to communication
 you took here.
 
 As a random contributor to Spark and active participant on this list, my
 reaction when I read your email was this:
 
 You do not know how the Spark community actually works.
 You read a thread that contains some trigger phrases.
 You wrote a lengthy response as a knee-jerk reaction.
 
 I'm not trying to mock, but I want to be direct and honest about how you
 came off in this thread to me and probably many others.
 
 Why not ask questions first--many questions? Why not make doubly sure that
 you understand the situation correctly before responding?
 
 In many ways this is much like filing a bug report. I'm seeing this. It
 seems wrong to me. Is this expected? I think we all know from experience
 that this kind of bug report is polite and will likely lead to a productive
 discussion. On the other hand: You're returning a -1 here? This is
 obviously wrong! And, boy, lemme tell you how wrong you are!!! No-one likes
 to deal with bug reports like this. More importantly, they get in the way of
 fixing the actual problem, if there is one.
 
 This is not about the Apache Way or not. It's about basic etiquette and
 effective communication.
 
 I understand that there are legitimate potential concerns here, and it's
 important that, as an Apache project, Spark work according to Apache
 principles. But when some person who has never participated on this list
 pops up out of nowhere with a lengthy lecture on the Apache Way and whatnot,
 I have to say that that is not an effective way to communicate. Pretty much
 the same thing happened with Greg Stein on an earlier thread some months ago
 about designating maintainers for components.
 
 The concerns are legitimate, I'm sure, and we want to keep Spark in line
 with the Apache Way. And certainly, there have been many times when a
 project veered off course and needed to corrected.
 
 But when we want to make things right, I hope we can do it in a way that
 respectfully and tactfully engages the community. These lectures delivered
 from above -- which is how they come off -- are not helpful.
 
 Nick
 
 
 On Wed, Apr 22, 2015 at 4:31 PM Ganelin, Ilya ilya.gane...@capitalone.com
 wrote:
 
 As a contributor, I¹ve never felt shut out from the Spark community, nor
 have I seen any examples of territorial behavior. A few times I¹ve
 expressed interest in more challenging work and the response I received
 was generally ³go ahead and give it a shot, just understand that this is
 sensitive code so we may end up modifying the PR substantially.² Honestly,
 that seems fine, and in general, I think it¹s completely fair to go with
 the PR model - e.g. If a JIRA has an open PR then it¹s an active effort,
 otherwise it¹s fair game unless otherwise stated. At the end of the day,
 it¹s about moving the project forward and the only way to do that is to
 have actual code in the pipes -speculation and intent don¹t really help,
 and there¹s nothing preventing an interested party from submitting a PR
 against an issue.
 
 Thank you,
 Ilya Ganelin
 
 
 
 

Re: Should we let everyone set Assignee?

2015-04-22 Thread Vinod Kumar Vavilapalli

I watch these lists, so I have a fair understanding of how things work around 
here. I don't give direct input in the day to day activities though, like Greg 
Stein on the other thread, so I can understand if it looks like it came from up 
above. Apache Members come around and give opinions time to time, you don't 
need to take it as somebody up above forcing things down.

Thanks
+Vinod

On Apr 22, 2015, at 2:33 PM, Nicholas Chammas 
nicholas.cham...@gmail.commailto:nicholas.cham...@gmail.com wrote:

I want to take this opportunity to call out the approach to communication you 
took here.

As a random contributor to Spark and active participant on this list, my 
reaction when I read your email was this:

  *   You do not know how the Spark community actually works.
  *   You read a thread that contains some trigger phrases.
  *   You wrote a lengthy response as a knee-jerk reaction.

I’m not trying to mock, but I want to be direct and honest about how you came 
off in this thread to me and probably many others.

Why not ask questions first—many questions? Why not make doubly sure that you 
understand the situation correctly before responding?

In many ways this is much like filing a bug report. “I’m seeing this. It seems 
wrong to me. Is this expected?” I think we all know from experience that this 
kind of bug report is polite and will likely lead to a productive discussion. 
On the other hand: “You’re returning a -1 here? This is obviously wrong! And, 
boy, lemme tell you how wrong you are!!!” No-one likes to deal with bug reports 
like this. More importantly, they get in the way of fixing the actual problem, 
if there is one.

This is not about the Apache Way or not. It’s about basic etiquette and 
effective communication.

I understand that there are legitimate potential concerns here, and it’s 
important that, as an Apache project, Spark work according to Apache 
principles. But when some person who has never participated on this list pops 
up out of nowhere with a lengthy lecture on the Apache Way and whatnot, I have 
to say that that is not an effective way to communicate. Pretty much the same 
thing happened with Greg Stein on an earlier thread some months ago about 
designating maintainers for components.

The concerns are legitimate, I’m sure, and we want to keep Spark in line with 
the Apache Way. And certainly, there have been many times when a project veered 
off course and needed to corrected.

But when we want to make things right, I hope we can do it in a way that 
respectfully and tactfully engages the community. These “lectures delivered 
from above” — which is how they come off — are not helpful.

Nick


Re: Should we let everyone set Assignee?

2015-04-22 Thread Patrick Wendell
Sandy - I definitely agree with that. We should have a convention of
signaling someone intends to work - for instance by commenting on the
JIRA and we should document this on the contribution guide. The nice
thing about having that convention is that multiple people can say
they are going to work on something, whereas only one person can be
given the assignee slot on a JIRA.


On Wed, Apr 22, 2015 at 2:33 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 To repeat what Patrick said (literally):

 If an issue is assigned to person X, but some other person Y submits
 a great patch for it, I think we have some obligation to Spark users
 and to the community to merge the better patch. So the idea of
 reserving the right to add a feature, it just seems overall off to me.

 No-one in the Spark community dictates who gets to do work. When an issue is
 assigned to someone in JIRA, it's either because a) they did the work and
 the issue is now resolved, or b) they are signaling to others that they are
 working on it.

 In the case of b), nothing stops other people from working on the issue and
 it's quite normal for other people to complete issues that were technically
 assigned to someone else. There is no land grabbing or stalling. Anyone who
 has contributed to Spark for any amount of time knows this.

 Vinod,

 I want to take this opportunity to call out the approach to communication
 you took here.

 As a random contributor to Spark and active participant on this list, my
 reaction when I read your email was this:

 You do not know how the Spark community actually works.
 You read a thread that contains some trigger phrases.
 You wrote a lengthy response as a knee-jerk reaction.

 I'm not trying to mock, but I want to be direct and honest about how you
 came off in this thread to me and probably many others.

 Why not ask questions first--many questions? Why not make doubly sure that
 you understand the situation correctly before responding?

 In many ways this is much like filing a bug report. I'm seeing this. It
 seems wrong to me. Is this expected? I think we all know from experience
 that this kind of bug report is polite and will likely lead to a productive
 discussion. On the other hand: You're returning a -1 here? This is
 obviously wrong! And, boy, lemme tell you how wrong you are!!! No-one likes
 to deal with bug reports like this. More importantly, they get in the way of
 fixing the actual problem, if there is one.

 This is not about the Apache Way or not. It's about basic etiquette and
 effective communication.

 I understand that there are legitimate potential concerns here, and it's
 important that, as an Apache project, Spark work according to Apache
 principles. But when some person who has never participated on this list
 pops up out of nowhere with a lengthy lecture on the Apache Way and whatnot,
 I have to say that that is not an effective way to communicate. Pretty much
 the same thing happened with Greg Stein on an earlier thread some months ago
 about designating maintainers for components.

 The concerns are legitimate, I'm sure, and we want to keep Spark in line
 with the Apache Way. And certainly, there have been many times when a
 project veered off course and needed to corrected.

 But when we want to make things right, I hope we can do it in a way that
 respectfully and tactfully engages the community. These lectures delivered
 from above -- which is how they come off -- are not helpful.

 Nick


 On Wed, Apr 22, 2015 at 4:31 PM Ganelin, Ilya ilya.gane...@capitalone.com
 wrote:

 As a contributor, I¹ve never felt shut out from the Spark community, nor
 have I seen any examples of territorial behavior. A few times I¹ve
 expressed interest in more challenging work and the response I received
 was generally ³go ahead and give it a shot, just understand that this is
 sensitive code so we may end up modifying the PR substantially.² Honestly,
 that seems fine, and in general, I think it¹s completely fair to go with
 the PR model - e.g. If a JIRA has an open PR then it¹s an active effort,
 otherwise it¹s fair game unless otherwise stated. At the end of the day,
 it¹s about moving the project forward and the only way to do that is to
 have actual code in the pipes -speculation and intent don¹t really help,
 and there¹s nothing preventing an interested party from submitting a PR
 against an issue.

 Thank you,
 Ilya Ganelin






 On 4/22/15, 1:25 PM, Mark Hamstra m...@clearstorydata.com wrote:

 Agreed.  The Spark project and community that Vinod describes do not
 resemble the ones with which I am familiar.
 
 On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Hi Vinod,
 
  Thanks for you thoughts - However, I do not agree with your sentiment
  and implications. Spark is broadly quite an inclusive project and we
  spend a lot of effort culturally to help make newcomers feel welcome.
 
  - Patrick
 
  On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar 

Re: Should we let everyone set Assignee?

2015-04-22 Thread Sandy Ryza
I think one of the benefits of assignee fields that I've seen in other
projects is their potential to coordinate and prevent duplicate work.  It's
really frustrating to put a lot of work into a patch and then find out that
someone has been doing the same.  It's helpful for the project etiquette to
include a way to signal to others that you are working or intend to work on
a patch.  Obviously there are limits to how long someone should be able to
hold on to a JIRA without making progress on it, but a signal is still
useful.  Historically, in other projects, the assignee field serves as this
signal.  If we don't want to use the assignee field for this, I think it's
important to have some alternative, even if it's just encouraging
contributors to comment I'm planning to work on this on JIRA.

-Sandy



On Wed, Apr 22, 2015 at 1:30 PM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:

 As a contributor, I¹ve never felt shut out from the Spark community, nor
 have I seen any examples of territorial behavior. A few times I¹ve
 expressed interest in more challenging work and the response I received
 was generally ³go ahead and give it a shot, just understand that this is
 sensitive code so we may end up modifying the PR substantially.² Honestly,
 that seems fine, and in general, I think it¹s completely fair to go with
 the PR model - e.g. If a JIRA has an open PR then it¹s an active effort,
 otherwise it¹s fair game unless otherwise stated. At the end of the day,
 it¹s about moving the project forward and the only way to do that is to
 have actual code in the pipes -speculation and intent don¹t really help,
 and there¹s nothing preventing an interested party from submitting a PR
 against an issue.

 Thank you,
 Ilya Ganelin






 On 4/22/15, 1:25 PM, Mark Hamstra m...@clearstorydata.com wrote:

 Agreed.  The Spark project and community that Vinod describes do not
 resemble the ones with which I am familiar.
 
 On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Hi Vinod,
 
  Thanks for you thoughts - However, I do not agree with your sentiment
  and implications. Spark is broadly quite an inclusive project and we
  spend a lot of effort culturally to help make newcomers feel welcome.
 
  - Patrick
 
  On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli
  vino...@hortonworks.com wrote:
   Actually what this community got away with is pretty much an
  anti-pattern compared to every other Apache project I have seen. And
 may I
  say in a not so Apache way.
  
   Waiting for a committer to assign a patch to someone leaves it as a
  privilege to a committer. Not alluding to anything fishy in practice,
 but
  this also leaves a lot of open ground for self-interest. Committers
  defining notions of good fit / level of experience do not work, highly
  subjective and lead to group control.
  
   In terms of semantics, here is what most other projects (dare I say
  every Apache project?) that I have seen do
- A new contributor comes in who is not yet added to the JIRA
 project.
  He/she requests one of the project's JIRA admins to add him/her.
- After that, he or she is free to assign tickets to themselves.
- What this means
   -- Assigning a ticket to oneself is a signal to the rest of the
  community that he/she is actively working on the said patch.
   -- If multiple contributors want to work on the same patch, it
 needs
  to resolved amicably through open communication. On JIRA, or on mailing
  lists. Not by the whim of a committer.
- Common issues
   -- Land grabbing: Other contributors can nudge him/her in case of
  inactivity and take them over. Again, amicably instead of a committer
  making subjective decisions.
   -- Progress stalling: One contributor assigns the ticket to
  himself/herself is actively debating but with no real code/docs
  contribution or with any real intention of making progress. Here
 workable,
  reviewable code for review usually wins.
  
   Assigning patches is not a privilege. Contributors at Apache are a
 bunch
  of volunteers, the PMC should let volunteers contribute as they see
 fit. We
  do not assign work at Apache.
  
   +Vinod
  
   On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com
  wrote:
  
   One over arching issue is that it's pretty unclear what Assigned to
   X in JIAR means from a process perspective. Personally I actually
   feel it's better for this to be more historical - i.e. who ended up
   submitting a patch for this feature that was merged - rather than
   creating an exclusive reservation for a particular user to work on
   something.
  
   If an issue is assigned to person X, but some other person Y
 submits
   a great patch for it, I think we have some obligation to Spark users
   and to the community to merge the better patch. So the idea of
   reserving the right to add a feature, it just seems overall off to
 me.
   IMO, its fine if multiple people want to submit competing patches 

Re: Should we let everyone set Assignee?

2015-04-22 Thread Sean Owen
I can get behind that point of view too. That's what I've told people
who expect Assignee is a necessary part of workflow. The existence of
a PR link is a signal someone's working on it.

In that case we need not do anything.

On Wed, Apr 22, 2015 at 8:32 PM, Patrick Wendell pwend...@gmail.com wrote:
 One over arching issue is that it's pretty unclear what Assigned to
 X in JIAR means from a process perspective. Personally I actually
 feel it's better for this to be more historical - i.e. who ended up
 submitting a patch for this feature that was merged - rather than
 creating an exclusive reservation for a particular user to work on
 something.

 If an issue is assigned to person X, but some other person Y submits
 a great patch for it, I think we have some obligation to Spark users
 and to the community to merge the better patch. So the idea of
 reserving the right to add a feature, it just seems overall off to me.
 IMO, its fine if multiple people want to submit competing patches for
 something, provided everyone comments on JIRA saying they are
 intending to submit a patch, and everyone understands there is
 duplicate effort. So commenting with an intention to submit a patch,
 IMO seems like the healthiest workflow since it is non exclusive.

 To me the main benefit of assigning something ahead of time is if
 you have a committer that really wants to see someone specific work on
 a patch, it just acts as a strong signal that there is someone
 endorsed to work on that patch. That doesn't mean no one else can
 submit a patch, but it is IMO more of a warning that there may be
 existing work which is likely to be high quality, to avoid duplicated
 effort.

 When it was really easy to assign features to themselves, I saw a lot
 of anti-patterns in the community that seemed unhealthy, specifically:

 - It was really unclear what it means semantically if someone is
 assigned to a JIRA.
 - People assign JIRA's to themselves that aren't a good fit, given the
 authors level of experience.
 - People expect if they assign JIRA's to themselves that others won't
 submit patches, and become upset if they do.
 - People are discouraged from working on a patch because someone else
 was officially assigned.

 - Patrick

 On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote:
 Anecdotally, there are a number of people asking to set the Assignee
 field. This is currently restricted to Committers in JIRA. I know the
 logic was to prevent people from Assigning a JIRA and then leaving it;
 it also matters a bit for questions of credit.

 Still I wonder if it's best to just let people go ahead and set it, as
 the lesser evil. People can already do a lot like resolve JIRAs and
 set shepherd and critical priority and all that.

 I think the intent was to let Developers set this, but maybe due to
 an error, that's not how the current JIRA permission is implemented.

 I ask because I'm about to ping INFRA to update our scheme.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Should we let everyone set Assignee?

2015-04-22 Thread Vinod Kumar Vavilapalli
Actually what this community got away with is pretty much an anti-pattern 
compared to every other Apache project I have seen. And may I say in a not so 
Apache way.

Waiting for a committer to assign a patch to someone leaves it as a privilege 
to a committer. Not alluding to anything fishy in practice, but this also 
leaves a lot of open ground for self-interest. Committers defining notions of 
good fit / level of experience do not work, highly subjective and lead to group 
control.

In terms of semantics, here is what most other projects (dare I say every 
Apache project?) that I have seen do
 - A new contributor comes in who is not yet added to the JIRA project. He/she 
requests one of the project's JIRA admins to add him/her.
 - After that, he or she is free to assign tickets to themselves.
 - What this means
-- Assigning a ticket to oneself is a signal to the rest of the community 
that he/she is actively working on the said patch.
-- If multiple contributors want to work on the same patch, it needs to 
resolved amicably through open communication. On JIRA, or on mailing lists. Not 
by the whim of a committer.
 - Common issues
-- Land grabbing: Other contributors can nudge him/her in case of 
inactivity and take them over. Again, amicably instead of a committer making 
subjective decisions.
-- Progress stalling: One contributor assigns the ticket to himself/herself 
is actively debating but with no real code/docs contribution or with any real 
intention of making progress. Here workable, reviewable code for review usually 
wins.

Assigning patches is not a privilege. Contributors at Apache are a bunch of 
volunteers, the PMC should let volunteers contribute as they see fit. We do not 
assign work at Apache.

+Vinod

On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com wrote:

 One over arching issue is that it's pretty unclear what Assigned to
 X in JIAR means from a process perspective. Personally I actually
 feel it's better for this to be more historical - i.e. who ended up
 submitting a patch for this feature that was merged - rather than
 creating an exclusive reservation for a particular user to work on
 something.
 
 If an issue is assigned to person X, but some other person Y submits
 a great patch for it, I think we have some obligation to Spark users
 and to the community to merge the better patch. So the idea of
 reserving the right to add a feature, it just seems overall off to me.
 IMO, its fine if multiple people want to submit competing patches for
 something, provided everyone comments on JIRA saying they are
 intending to submit a patch, and everyone understands there is
 duplicate effort. So commenting with an intention to submit a patch,
 IMO seems like the healthiest workflow since it is non exclusive.
 
 To me the main benefit of assigning something ahead of time is if
 you have a committer that really wants to see someone specific work on
 a patch, it just acts as a strong signal that there is someone
 endorsed to work on that patch. That doesn't mean no one else can
 submit a patch, but it is IMO more of a warning that there may be
 existing work which is likely to be high quality, to avoid duplicated
 effort.
 
 When it was really easy to assign features to themselves, I saw a lot
 of anti-patterns in the community that seemed unhealthy, specifically:
 
 - It was really unclear what it means semantically if someone is
 assigned to a JIRA.
 - People assign JIRA's to themselves that aren't a good fit, given the
 authors level of experience.
 - People expect if they assign JIRA's to themselves that others won't
 submit patches, and become upset if they do.
 - People are discouraged from working on a patch because someone else
 was officially assigned.
 
 - Patrick
 
 On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote:
 Anecdotally, there are a number of people asking to set the Assignee
 field. This is currently restricted to Committers in JIRA. I know the
 logic was to prevent people from Assigning a JIRA and then leaving it;
 it also matters a bit for questions of credit.
 
 Still I wonder if it's best to just let people go ahead and set it, as
 the lesser evil. People can already do a lot like resolve JIRAs and
 set shepherd and critical priority and all that.
 
 I think the intent was to let Developers set this, but maybe due to
 an error, that's not how the current JIRA permission is implemented.
 
 I ask because I'm about to ping INFRA to update our scheme.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, 

Re: Should we let everyone set Assignee?

2015-04-22 Thread Reynold Xin
Woh hold on a minute.

Spark has been among the projects that are the most welcoming to new
contributors. And thanks to this, the sheer number of activities in Spark
is much larger than other projects, and our workflow has to accommodate
this fact.

In practice, people just create pull requests on github, which is a newer 
friendlier  better model given the constraints. We even have tools that
automatically tags a ticket with a link to the pull requests.


On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli 
vino...@hortonworks.com wrote:

 Actually what this community got away with is pretty much an anti-pattern
 compared to every other Apache project I have seen. And may I say in a not
 so Apache way.

 Waiting for a committer to assign a patch to someone leaves it as a
 privilege to a committer. Not alluding to anything fishy in practice, but
 this also leaves a lot of open ground for self-interest. Committers
 defining notions of good fit / level of experience do not work, highly
 subjective and lead to group control.

 In terms of semantics, here is what most other projects (dare I say every
 Apache project?) that I have seen do
  - A new contributor comes in who is not yet added to the JIRA project.
 He/she requests one of the project's JIRA admins to add him/her.
  - After that, he or she is free to assign tickets to themselves.
  - What this means
 -- Assigning a ticket to oneself is a signal to the rest of the
 community that he/she is actively working on the said patch.
 -- If multiple contributors want to work on the same patch, it needs
 to resolved amicably through open communication. On JIRA, or on mailing
 lists. Not by the whim of a committer.
  - Common issues
 -- Land grabbing: Other contributors can nudge him/her in case of
 inactivity and take them over. Again, amicably instead of a committer
 making subjective decisions.
 -- Progress stalling: One contributor assigns the ticket to
 himself/herself is actively debating but with no real code/docs
 contribution or with any real intention of making progress. Here workable,
 reviewable code for review usually wins.

 Assigning patches is not a privilege. Contributors at Apache are a bunch
 of volunteers, the PMC should let volunteers contribute as they see fit. We
 do not assign work at Apache.

 +Vinod

 On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com wrote:

  One over arching issue is that it's pretty unclear what Assigned to
  X in JIAR means from a process perspective. Personally I actually
  feel it's better for this to be more historical - i.e. who ended up
  submitting a patch for this feature that was merged - rather than
  creating an exclusive reservation for a particular user to work on
  something.
 
  If an issue is assigned to person X, but some other person Y submits
  a great patch for it, I think we have some obligation to Spark users
  and to the community to merge the better patch. So the idea of
  reserving the right to add a feature, it just seems overall off to me.
  IMO, its fine if multiple people want to submit competing patches for
  something, provided everyone comments on JIRA saying they are
  intending to submit a patch, and everyone understands there is
  duplicate effort. So commenting with an intention to submit a patch,
  IMO seems like the healthiest workflow since it is non exclusive.
 
  To me the main benefit of assigning something ahead of time is if
  you have a committer that really wants to see someone specific work on
  a patch, it just acts as a strong signal that there is someone
  endorsed to work on that patch. That doesn't mean no one else can
  submit a patch, but it is IMO more of a warning that there may be
  existing work which is likely to be high quality, to avoid duplicated
  effort.
 
  When it was really easy to assign features to themselves, I saw a lot
  of anti-patterns in the community that seemed unhealthy, specifically:
 
  - It was really unclear what it means semantically if someone is
  assigned to a JIRA.
  - People assign JIRA's to themselves that aren't a good fit, given the
  authors level of experience.
  - People expect if they assign JIRA's to themselves that others won't
  submit patches, and become upset if they do.
  - People are discouraged from working on a patch because someone else
  was officially assigned.
 
  - Patrick
 
  On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote:
  Anecdotally, there are a number of people asking to set the Assignee
  field. This is currently restricted to Committers in JIRA. I know the
  logic was to prevent people from Assigning a JIRA and then leaving it;
  it also matters a bit for questions of credit.
 
  Still I wonder if it's best to just let people go ahead and set it, as
  the lesser evil. People can already do a lot like resolve JIRAs and
  set shepherd and critical priority and all that.
 
  I think the intent was to let Developers set this, but maybe due to
  an 

Re: Should we let everyone set Assignee?

2015-04-22 Thread Patrick Wendell
Hi Vinod,

Thanks for you thoughts - However, I do not agree with your sentiment
and implications. Spark is broadly quite an inclusive project and we
spend a lot of effort culturally to help make newcomers feel welcome.

- Patrick

On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli
vino...@hortonworks.com wrote:
 Actually what this community got away with is pretty much an anti-pattern 
 compared to every other Apache project I have seen. And may I say in a not so 
 Apache way.

 Waiting for a committer to assign a patch to someone leaves it as a privilege 
 to a committer. Not alluding to anything fishy in practice, but this also 
 leaves a lot of open ground for self-interest. Committers defining notions of 
 good fit / level of experience do not work, highly subjective and lead to 
 group control.

 In terms of semantics, here is what most other projects (dare I say every 
 Apache project?) that I have seen do
  - A new contributor comes in who is not yet added to the JIRA project. 
 He/she requests one of the project's JIRA admins to add him/her.
  - After that, he or she is free to assign tickets to themselves.
  - What this means
 -- Assigning a ticket to oneself is a signal to the rest of the community 
 that he/she is actively working on the said patch.
 -- If multiple contributors want to work on the same patch, it needs to 
 resolved amicably through open communication. On JIRA, or on mailing lists. 
 Not by the whim of a committer.
  - Common issues
 -- Land grabbing: Other contributors can nudge him/her in case of 
 inactivity and take them over. Again, amicably instead of a committer making 
 subjective decisions.
 -- Progress stalling: One contributor assigns the ticket to 
 himself/herself is actively debating but with no real code/docs contribution 
 or with any real intention of making progress. Here workable, reviewable code 
 for review usually wins.

 Assigning patches is not a privilege. Contributors at Apache are a bunch of 
 volunteers, the PMC should let volunteers contribute as they see fit. We do 
 not assign work at Apache.

 +Vinod

 On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com wrote:

 One over arching issue is that it's pretty unclear what Assigned to
 X in JIAR means from a process perspective. Personally I actually
 feel it's better for this to be more historical - i.e. who ended up
 submitting a patch for this feature that was merged - rather than
 creating an exclusive reservation for a particular user to work on
 something.

 If an issue is assigned to person X, but some other person Y submits
 a great patch for it, I think we have some obligation to Spark users
 and to the community to merge the better patch. So the idea of
 reserving the right to add a feature, it just seems overall off to me.
 IMO, its fine if multiple people want to submit competing patches for
 something, provided everyone comments on JIRA saying they are
 intending to submit a patch, and everyone understands there is
 duplicate effort. So commenting with an intention to submit a patch,
 IMO seems like the healthiest workflow since it is non exclusive.

 To me the main benefit of assigning something ahead of time is if
 you have a committer that really wants to see someone specific work on
 a patch, it just acts as a strong signal that there is someone
 endorsed to work on that patch. That doesn't mean no one else can
 submit a patch, but it is IMO more of a warning that there may be
 existing work which is likely to be high quality, to avoid duplicated
 effort.

 When it was really easy to assign features to themselves, I saw a lot
 of anti-patterns in the community that seemed unhealthy, specifically:

 - It was really unclear what it means semantically if someone is
 assigned to a JIRA.
 - People assign JIRA's to themselves that aren't a good fit, given the
 authors level of experience.
 - People expect if they assign JIRA's to themselves that others won't
 submit patches, and become upset if they do.
 - People are discouraged from working on a patch because someone else
 was officially assigned.

 - Patrick

 On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote:
 Anecdotally, there are a number of people asking to set the Assignee
 field. This is currently restricted to Committers in JIRA. I know the
 logic was to prevent people from Assigning a JIRA and then leaving it;
 it also matters a bit for questions of credit.

 Still I wonder if it's best to just let people go ahead and set it, as
 the lesser evil. People can already do a lot like resolve JIRAs and
 set shepherd and critical priority and all that.

 I think the intent was to let Developers set this, but maybe due to
 an error, that's not how the current JIRA permission is implemented.

 I ask because I'm about to ping INFRA to update our scheme.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For 

Re: Should we let everyone set Assignee?

2015-04-22 Thread Mark Hamstra
Agreed.  The Spark project and community that Vinod describes do not
resemble the ones with which I am familiar.

On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hi Vinod,

 Thanks for you thoughts - However, I do not agree with your sentiment
 and implications. Spark is broadly quite an inclusive project and we
 spend a lot of effort culturally to help make newcomers feel welcome.

 - Patrick

 On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli
 vino...@hortonworks.com wrote:
  Actually what this community got away with is pretty much an
 anti-pattern compared to every other Apache project I have seen. And may I
 say in a not so Apache way.
 
  Waiting for a committer to assign a patch to someone leaves it as a
 privilege to a committer. Not alluding to anything fishy in practice, but
 this also leaves a lot of open ground for self-interest. Committers
 defining notions of good fit / level of experience do not work, highly
 subjective and lead to group control.
 
  In terms of semantics, here is what most other projects (dare I say
 every Apache project?) that I have seen do
   - A new contributor comes in who is not yet added to the JIRA project.
 He/she requests one of the project's JIRA admins to add him/her.
   - After that, he or she is free to assign tickets to themselves.
   - What this means
  -- Assigning a ticket to oneself is a signal to the rest of the
 community that he/she is actively working on the said patch.
  -- If multiple contributors want to work on the same patch, it needs
 to resolved amicably through open communication. On JIRA, or on mailing
 lists. Not by the whim of a committer.
   - Common issues
  -- Land grabbing: Other contributors can nudge him/her in case of
 inactivity and take them over. Again, amicably instead of a committer
 making subjective decisions.
  -- Progress stalling: One contributor assigns the ticket to
 himself/herself is actively debating but with no real code/docs
 contribution or with any real intention of making progress. Here workable,
 reviewable code for review usually wins.
 
  Assigning patches is not a privilege. Contributors at Apache are a bunch
 of volunteers, the PMC should let volunteers contribute as they see fit. We
 do not assign work at Apache.
 
  +Vinod
 
  On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  One over arching issue is that it's pretty unclear what Assigned to
  X in JIAR means from a process perspective. Personally I actually
  feel it's better for this to be more historical - i.e. who ended up
  submitting a patch for this feature that was merged - rather than
  creating an exclusive reservation for a particular user to work on
  something.
 
  If an issue is assigned to person X, but some other person Y submits
  a great patch for it, I think we have some obligation to Spark users
  and to the community to merge the better patch. So the idea of
  reserving the right to add a feature, it just seems overall off to me.
  IMO, its fine if multiple people want to submit competing patches for
  something, provided everyone comments on JIRA saying they are
  intending to submit a patch, and everyone understands there is
  duplicate effort. So commenting with an intention to submit a patch,
  IMO seems like the healthiest workflow since it is non exclusive.
 
  To me the main benefit of assigning something ahead of time is if
  you have a committer that really wants to see someone specific work on
  a patch, it just acts as a strong signal that there is someone
  endorsed to work on that patch. That doesn't mean no one else can
  submit a patch, but it is IMO more of a warning that there may be
  existing work which is likely to be high quality, to avoid duplicated
  effort.
 
  When it was really easy to assign features to themselves, I saw a lot
  of anti-patterns in the community that seemed unhealthy, specifically:
 
  - It was really unclear what it means semantically if someone is
  assigned to a JIRA.
  - People assign JIRA's to themselves that aren't a good fit, given the
  authors level of experience.
  - People expect if they assign JIRA's to themselves that others won't
  submit patches, and become upset if they do.
  - People are discouraged from working on a patch because someone else
  was officially assigned.
 
  - Patrick
 
  On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote:
  Anecdotally, there are a number of people asking to set the Assignee
  field. This is currently restricted to Committers in JIRA. I know the
  logic was to prevent people from Assigning a JIRA and then leaving it;
  it also matters a bit for questions of credit.
 
  Still I wonder if it's best to just let people go ahead and set it, as
  the lesser evil. People can already do a lot like resolve JIRAs and
  set shepherd and critical priority and all that.
 
  I think the intent was to let Developers set this, but maybe due to
  an error, that's 

Re: Should we let everyone set Assignee?

2015-04-22 Thread Sean Owen
I think you misread the thread, since that's the opposite of what
Patrick suggested. He's suggesting that *nobody ever waits* to be
assigned a JIRA to work on it; that anyone may work on a JIRA without
waiting for it to be assigned.

The point is: assigning JIRAs discourages others from doing work and
we don't want to do that. So the pattern so far has been to not use it
(except retroactively to credit the major contributor to the
resolution.)

The cost of this policy is -- oops, maybe you work on something that's
already being worked on. That isn't a problem in practice. We already
have a way to signal that you're working on a patch: you open a PR. It
automatically links to JIRA. Or you can just comment.

I suppose you could also use Assignee as a strong signal that your'e
working on it, and some people want to do that, and so I was floating
the idea of just letting people use it as they like. But I also back
the idea of not having a notion of owner of working on a JIRA.

On Wed, Apr 22, 2015 at 9:11 PM, Vinod Kumar Vavilapalli
vino...@hortonworks.com wrote:
 Actually what this community got away with is pretty much an anti-pattern 
 compared to every other Apache project I have seen. And may I say in a not so 
 Apache way.

 Waiting for a committer to assign a patch to someone leaves it as a privilege 
 to a committer. Not alluding to anything fishy in practice, but this also 
 leaves a lot of open ground for self-interest. Committers defining notions of 
 good fit / level of experience do not work, highly subjective and lead to 
 group control.

 In terms of semantics, here is what most other projects (dare I say every 
 Apache project?) that I have seen do
  - A new contributor comes in who is not yet added to the JIRA project. 
 He/she requests one of the project's JIRA admins to add him/her.
  - After that, he or she is free to assign tickets to themselves.
  - What this means
 -- Assigning a ticket to oneself is a signal to the rest of the community 
 that he/she is actively working on the said patch.
 -- If multiple contributors want to work on the same patch, it needs to 
 resolved amicably through open communication. On JIRA, or on mailing lists. 
 Not by the whim of a committer.
  - Common issues
 -- Land grabbing: Other contributors can nudge him/her in case of 
 inactivity and take them over. Again, amicably instead of a committer making 
 subjective decisions.
 -- Progress stalling: One contributor assigns the ticket to 
 himself/herself is actively debating but with no real code/docs contribution 
 or with any real intention of making progress. Here workable, reviewable code 
 for review usually wins.

 Assigning patches is not a privilege. Contributors at Apache are a bunch of 
 volunteers, the PMC should let volunteers contribute as they see fit. We do 
 not assign work at Apache.

 +Vinod

 On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com wrote:

 One over arching issue is that it's pretty unclear what Assigned to
 X in JIAR means from a process perspective. Personally I actually
 feel it's better for this to be more historical - i.e. who ended up
 submitting a patch for this feature that was merged - rather than
 creating an exclusive reservation for a particular user to work on
 something.

 If an issue is assigned to person X, but some other person Y submits
 a great patch for it, I think we have some obligation to Spark users
 and to the community to merge the better patch. So the idea of
 reserving the right to add a feature, it just seems overall off to me.
 IMO, its fine if multiple people want to submit competing patches for
 something, provided everyone comments on JIRA saying they are
 intending to submit a patch, and everyone understands there is
 duplicate effort. So commenting with an intention to submit a patch,
 IMO seems like the healthiest workflow since it is non exclusive.

 To me the main benefit of assigning something ahead of time is if
 you have a committer that really wants to see someone specific work on
 a patch, it just acts as a strong signal that there is someone
 endorsed to work on that patch. That doesn't mean no one else can
 submit a patch, but it is IMO more of a warning that there may be
 existing work which is likely to be high quality, to avoid duplicated
 effort.

 When it was really easy to assign features to themselves, I saw a lot
 of anti-patterns in the community that seemed unhealthy, specifically:

 - It was really unclear what it means semantically if someone is
 assigned to a JIRA.
 - People assign JIRA's to themselves that aren't a good fit, given the
 authors level of experience.
 - People expect if they assign JIRA's to themselves that others won't
 submit patches, and become upset if they do.
 - People are discouraged from working on a patch because someone else
 was officially assigned.

 - Patrick

 On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen so...@cloudera.com wrote:
 Anecdotally, there are a number of 

Re: Should we let everyone set Assignee?

2015-04-22 Thread Vinod Kumar Vavilapalli

If it is true what you say, what is the reason for this 
committer-only-assigns-JIRA tickets policy? If anyone can send a pull request, 
anyone should be able to assign tickets to himself/herself too.

+Vinod

On Apr 22, 2015, at 1:18 PM, Reynold Xin 
r...@databricks.commailto:r...@databricks.com wrote:

Woh hold on a minute.

Spark has been among the projects that are the most welcoming to new 
contributors. And thanks to this, the sheer number of activities in Spark is 
much larger than other projects, and our workflow has to accommodate this fact.

In practice, people just create pull requests on github, which is a newer  
friendlier  better model given the constraints. We even have tools that 
automatically tags a ticket with a link to the pull requests.


On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli 
vino...@hortonworks.commailto:vino...@hortonworks.com wrote:
Actually what this community got away with is pretty much an anti-pattern 
compared to every other Apache project I have seen. And may I say in a not so 
Apache way.

Waiting for a committer to assign a patch to someone leaves it as a privilege 
to a committer. Not alluding to anything fishy in practice, but this also 
leaves a lot of open ground for self-interest. Committers defining notions of 
good fit / level of experience do not work, highly subjective and lead to group 
control.

In terms of semantics, here is what most other projects (dare I say every 
Apache project?) that I have seen do
 - A new contributor comes in who is not yet added to the JIRA project. He/she 
requests one of the project's JIRA admins to add him/her.
 - After that, he or she is free to assign tickets to themselves.
 - What this means
-- Assigning a ticket to oneself is a signal to the rest of the community 
that he/she is actively working on the said patch.
-- If multiple contributors want to work on the same patch, it needs to 
resolved amicably through open communication. On JIRA, or on mailing lists. Not 
by the whim of a committer.
 - Common issues
-- Land grabbing: Other contributors can nudge him/her in case of 
inactivity and take them over. Again, amicably instead of a committer making 
subjective decisions.
-- Progress stalling: One contributor assigns the ticket to himself/herself 
is actively debating but with no real code/docs contribution or with any real 
intention of making progress. Here workable, reviewable code for review usually 
wins.

Assigning patches is not a privilege. Contributors at Apache are a bunch of 
volunteers, the PMC should let volunteers contribute as they see fit. We do not 
assign work at Apache.

+Vinod

On Apr 22, 2015, at 12:32 PM, Patrick Wendell 
pwend...@gmail.commailto:pwend...@gmail.com wrote:

 One over arching issue is that it's pretty unclear what Assigned to
 X in JIAR means from a process perspective. Personally I actually
 feel it's better for this to be more historical - i.e. who ended up
 submitting a patch for this feature that was merged - rather than
 creating an exclusive reservation for a particular user to work on
 something.

 If an issue is assigned to person X, but some other person Y submits
 a great patch for it, I think we have some obligation to Spark users
 and to the community to merge the better patch. So the idea of
 reserving the right to add a feature, it just seems overall off to me.
 IMO, its fine if multiple people want to submit competing patches for
 something, provided everyone comments on JIRA saying they are
 intending to submit a patch, and everyone understands there is
 duplicate effort. So commenting with an intention to submit a patch,
 IMO seems like the healthiest workflow since it is non exclusive.

 To me the main benefit of assigning something ahead of time is if
 you have a committer that really wants to see someone specific work on
 a patch, it just acts as a strong signal that there is someone
 endorsed to work on that patch. That doesn't mean no one else can
 submit a patch, but it is IMO more of a warning that there may be
 existing work which is likely to be high quality, to avoid duplicated
 effort.

 When it was really easy to assign features to themselves, I saw a lot
 of anti-patterns in the community that seemed unhealthy, specifically:

 - It was really unclear what it means semantically if someone is
 assigned to a JIRA.
 - People assign JIRA's to themselves that aren't a good fit, given the
 authors level of experience.
 - People expect if they assign JIRA's to themselves that others won't
 submit patches, and become upset if they do.
 - People are discouraged from working on a patch because someone else
 was officially assigned.

 - Patrick

 On Wed, Apr 22, 2015 at 11:13 AM, Sean Owen 
 so...@cloudera.commailto:so...@cloudera.com wrote:
 Anecdotally, there are a number of people asking to set the Assignee
 field. This is currently restricted to Committers in JIRA. I know the
 logic was to prevent people from Assigning a JIRA and 

Re: Should we let everyone set Assignee?

2015-04-22 Thread Ganelin, Ilya
As a contributor, I¹ve never felt shut out from the Spark community, nor
have I seen any examples of territorial behavior. A few times I¹ve
expressed interest in more challenging work and the response I received
was generally ³go ahead and give it a shot, just understand that this is
sensitive code so we may end up modifying the PR substantially.² Honestly,
that seems fine, and in general, I think it¹s completely fair to go with
the PR model - e.g. If a JIRA has an open PR then it¹s an active effort,
otherwise it¹s fair game unless otherwise stated. At the end of the day,
it¹s about moving the project forward and the only way to do that is to
have actual code in the pipes -speculation and intent don¹t really help,
and there¹s nothing preventing an interested party from submitting a PR
against an issue. 

Thank you, 
Ilya Ganelin






On 4/22/15, 1:25 PM, Mark Hamstra m...@clearstorydata.com wrote:

Agreed.  The Spark project and community that Vinod describes do not
resemble the ones with which I am familiar.

On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com
wrote:

 Hi Vinod,

 Thanks for you thoughts - However, I do not agree with your sentiment
 and implications. Spark is broadly quite an inclusive project and we
 spend a lot of effort culturally to help make newcomers feel welcome.

 - Patrick

 On Wed, Apr 22, 2015 at 1:11 PM, Vinod Kumar Vavilapalli
 vino...@hortonworks.com wrote:
  Actually what this community got away with is pretty much an
 anti-pattern compared to every other Apache project I have seen. And
may I
 say in a not so Apache way.
 
  Waiting for a committer to assign a patch to someone leaves it as a
 privilege to a committer. Not alluding to anything fishy in practice,
but
 this also leaves a lot of open ground for self-interest. Committers
 defining notions of good fit / level of experience do not work, highly
 subjective and lead to group control.
 
  In terms of semantics, here is what most other projects (dare I say
 every Apache project?) that I have seen do
   - A new contributor comes in who is not yet added to the JIRA
project.
 He/she requests one of the project's JIRA admins to add him/her.
   - After that, he or she is free to assign tickets to themselves.
   - What this means
  -- Assigning a ticket to oneself is a signal to the rest of the
 community that he/she is actively working on the said patch.
  -- If multiple contributors want to work on the same patch, it
needs
 to resolved amicably through open communication. On JIRA, or on mailing
 lists. Not by the whim of a committer.
   - Common issues
  -- Land grabbing: Other contributors can nudge him/her in case of
 inactivity and take them over. Again, amicably instead of a committer
 making subjective decisions.
  -- Progress stalling: One contributor assigns the ticket to
 himself/herself is actively debating but with no real code/docs
 contribution or with any real intention of making progress. Here
workable,
 reviewable code for review usually wins.
 
  Assigning patches is not a privilege. Contributors at Apache are a
bunch
 of volunteers, the PMC should let volunteers contribute as they see
fit. We
 do not assign work at Apache.
 
  +Vinod
 
  On Apr 22, 2015, at 12:32 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  One over arching issue is that it's pretty unclear what Assigned to
  X in JIAR means from a process perspective. Personally I actually
  feel it's better for this to be more historical - i.e. who ended up
  submitting a patch for this feature that was merged - rather than
  creating an exclusive reservation for a particular user to work on
  something.
 
  If an issue is assigned to person X, but some other person Y
submits
  a great patch for it, I think we have some obligation to Spark users
  and to the community to merge the better patch. So the idea of
  reserving the right to add a feature, it just seems overall off to
me.
  IMO, its fine if multiple people want to submit competing patches for
  something, provided everyone comments on JIRA saying they are
  intending to submit a patch, and everyone understands there is
  duplicate effort. So commenting with an intention to submit a patch,
  IMO seems like the healthiest workflow since it is non exclusive.
 
  To me the main benefit of assigning something ahead of time is if
  you have a committer that really wants to see someone specific work
on
  a patch, it just acts as a strong signal that there is someone
  endorsed to work on that patch. That doesn't mean no one else can
  submit a patch, but it is IMO more of a warning that there may be
  existing work which is likely to be high quality, to avoid duplicated
  effort.
 
  When it was really easy to assign features to themselves, I saw a lot
  of anti-patterns in the community that seemed unhealthy,
specifically:
 
  - It was really unclear what it means semantically if someone is
  assigned to a JIRA.
  - People assign JIRA's to themselves that 

Re: Graphical display of metrics on application UI page

2015-04-22 Thread Akhil Das
​There were some PR's about graphical representation with D3.js, you can
possibly see it on the github. Here's a few of them
https://github.com/apache/spark/pulls?utf8=%E2%9C%93q=d3​

Thanks
Best Regards

On Wed, Apr 22, 2015 at 8:08 AM, Punyashloka Biswal punya.bis...@gmail.com
wrote:

 Dear Spark devs,

 Would people find it useful to have a graphical display of metrics (such as
 duration, GC time, etc) on the application UI page? Has anybody worked on
 this before?

 Punya



Re: Addition of new Metrics for killed executors.

2015-04-22 Thread twinkle sachdeva
Hi,

Looks interesting.

It is quite interesting to know about what could have been the reason for
not showing these stats in UI.

As per the description of Patrick W in
https://spark-project.atlassian.net/browse/SPARK-999, it does not mention
any exception w.r.t failed tasks/executors.

Can somebody please comment if it is a bug or some intended behaviour w.r.t
performance or some other bottleneck.

--Twinkle




On Mon, Apr 20, 2015 at 2:47 PM, Archit Thakur archit279tha...@gmail.com
wrote:

 Hi Twinkle,

 We have a use case in where we want to debug the reason of how n why an
 executor got killed.
 Could be because of stackoverflow, GC or any other unexpected scenario.
 If I see the driver UI there is no information present around killed
 executors, So was just curious how do people usually debug those things
 apart from scanning logs and understanding it. The metrics we are planning
 to add are similar to what we have for non killed executors - [data per
 stage specifically] - numFailedTasks, executorRunTime, inputBytes,
 memoryBytesSpilled .. etc.

 Apart from that we also intend to add all information present in an
 executor tabs for running executors.

 Thanks,
 Archit Thakur.

 On Mon, Apr 20, 2015 at 1:31 PM, twinkle sachdeva 
 twinkle.sachd...@gmail.com wrote:

 Hi Archit,

 What is your use case and what kind of metrics are you planning to add?

 Thanks,
 Twinkle

 On Fri, Apr 17, 2015 at 4:07 PM, Archit Thakur archit279tha...@gmail.com
  wrote:

 Hi,

 We are planning to add new Metrics in Spark for the executors that got
 killed during the execution. Was just curious, why this info is not already
 present. Is there some reason for not adding it.?
 Any ideas around are welcome.

 Thanks and Regards,
 Archit Thakur.






Re: Graphical display of metrics on application UI page

2015-04-22 Thread Punyashloka Biswal
Thanks for the pointers! It looks like others are pretty active on this so
I'll comment on those PRs and try to coordinate before starting any new
work.

Punya
On Wed, Apr 22, 2015 at 2:49 AM Akhil Das ak...@sigmoidanalytics.com
wrote:

 ​There were some PR's about graphical representation with D3.js, you can
 possibly see it on the github. Here's a few of them
 https://github.com/apache/spark/pulls?utf8=%E2%9C%93q=d3​

 Thanks
 Best Regards

 On Wed, Apr 22, 2015 at 8:08 AM, Punyashloka Biswal 
 punya.bis...@gmail.com wrote:

 Dear Spark devs,

 Would people find it useful to have a graphical display of metrics (such
 as
 duration, GC time, etc) on the application UI page? Has anybody worked on
 this before?

 Punya





Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Sourav Chandra
Anyone?

On Wed, Apr 22, 2015 at 12:29 PM, Sourav Chandra 
sourav.chan...@livestream.com wrote:

 Hi Olivier,

 *the update function is as below*:

 *val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long,
 Long)]) = {*
 *  val previousCount = state.getOrElse((0L, 0L))._2*
 *  var startValue: IConcurrentUsers = ConcurrentViewers(0)*
 *  var currentCount = 0L*
 *  val lastIndexOfConcurrentUsers =*
 *values.lastIndexWhere(_.isInstanceOf[ConcurrentViewers])*
 *  val subList = values.slice(0, lastIndexOfConcurrentUsers)*
 *  val currentCountFromSubList = subList.foldLeft(startValue)(_ op
 _).count + previousCount*
 *  val lastConcurrentViewersCount =
 values(lastIndexOfConcurrentUsers).count*

 *  if (math.abs(lastConcurrentViewersCount - currentCountFromSubList)
 = 1) {*
 *logger.error(*
 *  sCount using state updation $currentCountFromSubList,  +*
 *sConcurrentUsers count $lastConcurrentViewersCount +*
 *s resetting to $lastConcurrentViewersCount*
 *)*
 *currentCount = lastConcurrentViewersCount*
 *  }*
 *  val remainingValuesList = values.diff(subList)*
 *  startValue = ConcurrentViewers(currentCount)*
 *  currentCount = remainingValuesList.foldLeft(startValue)(_ op
 _).count*

 *  if (currentCount  0) {*

 *logger.error(*
 *  sERROR: Got new count $currentCount  0, value:$values,
 state:$state, resetting to 0*
 *)*
 *currentCount = 0*
 *  }*
 *  // to stop pushing subsequent 0 after receiving first 0*
 *  if (currentCount == 0  previousCount == 0) None*
 *  else Some(previousCount, currentCount)*
 *}*

 *trait IConcurrentUsers {*
 *  val count: Long*
 *  def op(a: IConcurrentUsers): IConcurrentUsers =
 IConcurrentUsers.op(this, a)*
 *}*

 *object IConcurrentUsers {*
 *  def op(a: IConcurrentUsers, b: IConcurrentUsers): IConcurrentUsers =
 (a, b) match {*
 *case (_, _: ConcurrentViewers) = *
 *  ConcurrentViewers(b.count)*
 *case (_: ConcurrentViewers, _: IncrementConcurrentViewers) = *
 *  ConcurrentViewers(a.count + b.count)*
 *case (_: ConcurrentViewers, _: DecrementConcurrentViewers) = *
 *  ConcurrentViewers(a.count - b.count)*
 *  }*
 *}*

 *case class IncrementConcurrentViewers(count: Long) extends
 IConcurrentUsers*
 *case class DecrementConcurrentViewers(count: Long) extends
 IConcurrentUsers*
 *case class ConcurrentViewers(count: Long) extends IConcurrentUsers*


 *also the error stack trace copied from executor logs is:*

 *java.lang.OutOfMemoryError: Java heap space*
 *at
 org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)*
 *at
 org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2564)*
 *at
 org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)*
 *at
 org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)*
 *at
 org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43)*
 *at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927)*
 *at
 org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)*
 *at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)*
 *at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)*
 *at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
 *at java.lang.reflect.Method.invoke(Method.java:601)*
 *at
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)*
 *at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)*
 *at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)*
 *at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)*
 *at
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)*
 *at
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)*
 *at
 org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:236)*
 *at
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readObject$1.apply$mcV$sp(TorrentBroadcast.scala:169)*
 *at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927)*
 *at
 org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:155)*
 *at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)*
 *at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
 *at java.lang.reflect.Method.invoke(Method.java:601)*
 *at
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)*
 *at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)*
 *at
 

Re: Dataframe.fillna from 1.3.0

2015-04-22 Thread Olivier Girardot
Where should this *coalesce* come from ? Is it related to the partition
manipulation coalesce method ?
Thanks !

Le lun. 20 avr. 2015 à 22:48, Reynold Xin r...@databricks.com a écrit :

 Ah ic. You can do something like


 df.select(coalesce(df(a), lit(0.0)))

 On Mon, Apr 20, 2015 at 1:44 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 From PySpark it seems to me that the fillna is relying on Java/Scala
 code, that's why I was wondering.
 Thank you for answering :)

 Le lun. 20 avr. 2015 à 22:22, Reynold Xin r...@databricks.com a écrit :

 You can just create fillna function based on the 1.3.1 implementation of
 fillna, no?


 On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 a UDF might be a good idea no ?

 Le lun. 20 avr. 2015 à 11:17, Olivier Girardot 
 o.girar...@lateral-thoughts.com a écrit :

  Hi everyone,
  let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna*
 API
  in PySpark, is there any efficient alternative to mapping the records
  myself ?
 
  Regards,
 
  Olivier.
 






Re: Dataframe.fillna from 1.3.0

2015-04-22 Thread Olivier Girardot
I think I found the Coalesce you were talking about, but this is a catalyst
class that I think is not available from pyspark

Regards,

Olivier.

Le mer. 22 avr. 2015 à 11:56, Olivier Girardot 
o.girar...@lateral-thoughts.com a écrit :

 Where should this *coalesce* come from ? Is it related to the partition
 manipulation coalesce method ?
 Thanks !

 Le lun. 20 avr. 2015 à 22:48, Reynold Xin r...@databricks.com a écrit :

 Ah ic. You can do something like


 df.select(coalesce(df(a), lit(0.0)))

 On Mon, Apr 20, 2015 at 1:44 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 From PySpark it seems to me that the fillna is relying on Java/Scala
 code, that's why I was wondering.
 Thank you for answering :)

 Le lun. 20 avr. 2015 à 22:22, Reynold Xin r...@databricks.com a
 écrit :

 You can just create fillna function based on the 1.3.1 implementation
 of fillna, no?


 On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 a UDF might be a good idea no ?

 Le lun. 20 avr. 2015 à 11:17, Olivier Girardot 
 o.girar...@lateral-thoughts.com a écrit :

  Hi everyone,
  let's assume I'm stuck in 1.3.0, how can I benefit from the *fillna*
 API
  in PySpark, is there any efficient alternative to mapping the records
  myself ?
 
  Regards,
 
  Olivier.
 






RE: Is spark-ec2 for production use?

2015-04-22 Thread nate
Replacement for production-ish is beyond a stretch phrasing, UX just isn’t 
there yet for average end user wanting push-button.

Up until a bit ago focus was heavily focused on infrastructure folks and people 
building their own distros.  Project is turning towards end users so anyone 
from ops to dev/data-hacker will be able to extract value and get moving easily.

If you are brave enough to give it a go and start playing around with it in its 
current state you can start here looking at puppet modules readme:

https://github.com/apache/bigtop/tree/master/bigtop-deploy/puppet

Currently limited (ie: no yarn, mesos variants, orchestration not added yet), 
things will be stepping up a great detail heading out of 1.0 release.  If you 
do and run into stuff hop on mailing list, docs are another area updating is 
needed.

Thanks for pointers on the json feed link, definitely handy for some smoke tests


-Original Message-
From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com] 
Sent: Tuesday, April 21, 2015 2:33 PM
To: n...@reactor8.com; Spark dev list
Subject: Re: Is spark-ec2 for production use?

Nate, could you point us to an example of how one would use Big Top as a more 
production-ish replacement for spark-ec2? I look a look at the project page 
http://bigtop.apache.org/index.html, but couldn't find any usage examples. 
Perhaps we can link to them from the spark-ec2 docs.

Regarding tests to validate that Spark was set up correctly, I am using the 
JSON feed from the Spark master web UI 
http://stackoverflow.com/a/29659630/877069 for starters. Y'all might find it 
useful for the same purpose.

Nick

On Tue, Apr 21, 2015 at 5:21 PM n...@reactor8.com wrote:

 Several of the Bigtop folks got together last week at ApacheCon, this 
 was popular topic for next enhancements with spark related components 
 after getting 1.0 out the door.  Some leading topics were:

 -deployment of spark specific clusters
  -spark standalone, hdfs
  -spark over yarn, hdfs
  -spark on mesos (talked to mesos folk about working to include in 
 bigtop post 1.0)
  -the above plus variants of other bigtop components (ie: kafka, 
 zeppelin, demo data generators)

 One thing group would like some help on is tests for spark 
 environments so things can be validated post build/deploy and enhance 
 CI process so if you choose to deploy via bigtop in test/prod/etc you 
 know things have gone through a certain amount of rigor beforehand

 Nate

 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Tuesday, April 21, 2015 12:46 PM
 To: Nicholas Chammas
 Cc: Spark dev list
 Subject: Re: Is spark-ec2 for production use?

 It could be a good idea to document this a bit. The original goals 
 were to give people an easy way to get started with Spark and also to 
 provide a consistent environment for our own experiments and 
 benchmarking of Spark at the AMPLab. Over time I've noticed a huge 
 amount of scope increase in terms of what people want to do and I do 
 know that many companies run production infrastructure based on launching the 
 EC2 scripts.

 My feeling is that the general problem of deploying Spark with other 
 applications and frameworks is fairly well covered by projects which 
 specifically focus on packaging and automation (e.g. Whirr, BigTop, etc).
 So
 I'd like to see a narrower focus on just getting a vanilla Spark 
 cluster up and running and make it clear that customization and 
 extension of that functionality is really not in scope.

 This doesn't mean discouraging people from using it for production use 
 cases, but more that they shouldn't expect us to merge and maintain 
 things that seek to do broader integration with other technologies, 
 automation, etc.

 - Patrick

 On Tue, Apr 21, 2015 at 12:05 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:
  Is spark-ec2 intended for spinning up production Spark clusters?
 
  I think the answer is no.
 
  However, the docs for spark-ec2
  https://spark.apache.org/docs/latest/ec2-scripts.html very much 
  leave that possibility open, and indeed I see many people asking 
  questions or opening issues that stem from some production use case 
  they are trying to fit spark-ec2 to.
 
  Here's the latest example
  https://issues.apache.org/jira/browse/SPARK-6900?focusedCommentId=1
  45 
  04236page=com.atlassian.jira.plugin.system.issuetabpanels:comment-t
  ab
  panel#comment-14504236
  of
  someone using spark-ec2 to power their (presumably) production service.
 
  Shouldn't we actively discourage people from using spark-ec2 in this way?
 
  I understand there's no stopping people from doing what they want 
  with it, and certainly the questions and issues we receive about 
  spark-ec2 are still valid, even if they stem from discouraged use cases.
 
  From what I understand, spark-ec2 is intended for quick 
  experimentation, one-off jobs, prototypes, and so forth.
 
  If that's the case, it's best to stress this in 

Re: Spark build time

2015-04-22 Thread Olivier Girardot
I agree, it's what I did :)
I was just wondering if it was considered a problem or something to work
on, I personally think so because the feedback loop should be as quick as
possible, and therefore if there was someone I could help.

Le mar. 21 avr. 2015 à 22:20, Reynold Xin r...@databricks.com a écrit :

 It runs tons of integration tests. I think most developers just let
 Jenkins run the full suite of them.

 On Tue, Apr 21, 2015 at 12:54 PM, Olivier Girardot ssab...@gmail.com
 wrote:

 Hi everyone,
 I was just wandering about the Spark full build time (including tests),
 1h48 seems to me quite... spacious. What's taking most of the time ? Is
 the
 build mainly integration tests ? Is there any roadmap or jiras dedicated
 to
 that we can chip in ?

 Regards,

 Olivier.





Pipeline in pyspark

2015-04-22 Thread Suraj Shetiya
Hi,

I came across documentation for creating a pipeline in mlib library of
pyspark. I wanted to know if something similar exists for pyspark input
transformations. I have a use case where I have my input files in different
formats and would like to convert them to rdd and store them in memory and
perform certain custom tasks in a pipeline without storing it back to disc
in any step. I came across luigi(http://luigi.readthedocs.org/en/latest/),
but I found that it stores the contents onto disc and reloads it for the
next phase of the pipeline.

-- 
Thanks and regards,
Suraj