[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-20 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1447#issuecomment-49540356
  
Merging this in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1447


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-18 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1447#issuecomment-49401225
  
This looks okay to me as is, @rxin what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-18 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/1447#issuecomment-49407083
  
Upmerged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1447#issuecomment-49407225
  
QA tests have started for PR 1447. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16821/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1447#issuecomment-49415317
  
QA results for PR 1447:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16821/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15042631
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 
 def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
   val map = new JHashMap[K, V]
-  iter.foreach { case (k, v) =
-val old = map.get(k)
-map.put(k, if (old == null) v else func(old, v))
+  iter.foreach { pair =
+val old = map.get(pair._1)
--- End diff --

This I'm sure has been done a million times, but one more for good luck -- 
here is a `map { case (x, y) = x + y}` versus `foreach { xy = xy._1 + xy._2}`:
http://www.diffchecker.com/vtv6cptx

No new virtual function calls, but one new branch (trivially branch 
predictable) and one new throw (which will never be invoked), and several 
store/loads. (Perhaps I'm misreading the bytecode, but isn't `astore_2` 
followed by `aload_2` a no-op?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15042857
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 
 def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
   val map = new JHashMap[K, V]
-  iter.foreach { case (k, v) =
-val old = map.get(k)
-map.put(k, if (old == null) v else func(old, v))
+  iter.foreach { pair =
+val old = map.get(pair._1)
--- End diff --

I think we tested this in the past and saw a difference, perhaps because it 
can now throw. But I don't remember the details. Weird that it's also doing 
those stores.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15042897
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 
 def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
   val map = new JHashMap[K, V]
-  iter.foreach { case (k, v) =
-val old = map.get(k)
-map.put(k, if (old == null) v else func(old, v))
+  iter.foreach { pair =
+val old = map.get(pair._1)
--- End diff --

There are all kinds of stuff that's really hard to profile at runtime. For 
example, the JVM might decide not to JIT the code because it is longer. It 
might decide not to do it because of exceptions. Branch prediction might stop 
working because it has too many branches. Etc.

Again, I'm only in favor of this change because it is small, and it might 
have impact. If it is a big ugly change, I would be against it without 
profiling.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15042905
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -571,12 +571,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   throw new SparkException(Default partitioner cannot partition array 
keys.)
 }
 val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), 
partitioner)
-cg.mapValues { case Seq(vs, w1s, w2s, w3s) =
-  (vs.asInstanceOf[Seq[V]],
-w1s.asInstanceOf[Seq[W1]],
-w2s.asInstanceOf[Seq[W2]],
-w3s.asInstanceOf[Seq[W3]])
-}
+cg.mapValues { seq  = seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2], 
Seq[W3])] }
--- End diff --

And yeah as Reynold pointed out, this won't actually cast. A Seq is not a 
tuple4, you need to make a new one. In this case it might be okay to leave the 
code as is, because the non-pattern-matching way will be hairier. Basically 
you'd have to do `seq = (seq(0).asInstanceOf[Seq[V]], 
seq(1).asInstanceOf[Seq[W1]], etc)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15043849
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -571,12 +571,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   throw new SparkException(Default partitioner cannot partition array 
keys.)
 }
 val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), 
partitioner)
-cg.mapValues { case Seq(vs, w1s, w2s, w3s) =
-  (vs.asInstanceOf[Seq[V]],
-w1s.asInstanceOf[Seq[W1]],
-w2s.asInstanceOf[Seq[W2]],
-w3s.asInstanceOf[Seq[W3]])
-}
+cg.mapValues { seq  = seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2], 
Seq[W3])] }
--- End diff --

Ah, right.  I'll leave these as they are for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15043860
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 
 def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
   val map = new JHashMap[K, V]
-  iter.foreach { case (k, v) =
-val old = map.get(k)
-map.put(k, if (old == null) v else func(old, v))
+  iter.foreach { pair =
+val old = map.get(pair._1)
--- End diff --

Also on removing the calls to _1,  these values should be in cache, so 
accesses will be really fast.  Will hold off on these for now, but happy to 
make the change if y'all want.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15044028
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 val index = p.getPartition(key)
 def process(it: Iterator[(K, V)]): Seq[V] = {
   val buf = new ArrayBuffer[V]
-  for ((k, v) - it if k == key) {
--- End diff --

Wait, actually, my understanding of this loop is that it's iterating over 
every record within the partition.  Am I missing something? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15044062
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 val index = p.getPartition(key)
 def process(it: Iterator[(K, V)]): Seq[V] = {
   val buf = new ArrayBuffer[V]
-  for ((k, v) - it if k == key) {
--- End diff --

Actually yes my bad :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1447#issuecomment-49272425
  
QA tests have started for PR 1447. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16773/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1447#issuecomment-49287721
  
QA results for PR 1447:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16773/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread sryza
GitHub user sryza opened a pull request:

https://github.com/apache/spark/pull/1447

SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical section...

...s of CoGroupedRDD and PairRDDFunctions

This also removes an unnecessary tuple creation in cogroup.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sryza/spark sandy-spark-2519-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1447.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1447


commit 906e6b362db11a4ac5a3f8096668c6f968dc48aa
Author: Sandy Ryza sa...@cloudera.com
Date:   2014-07-16T21:00:17Z

SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical sections 
of CoGroupedRDD and PairRDDFunctions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15037091
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 
 def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
   val map = new JHashMap[K, V]
-  iter.foreach { case (k, v) =
-val old = map.get(k)
-map.put(k, if (old == null) v else func(old, v))
+  iter.foreach { pair =
+val old = map.get(pair._1)
--- End diff --

can we save the key into a variable here so we don't call _1 twice?

I was looking at the bytecode generated by pattern matching versus 
non-pattern-matching, and the main difference is that we bypass some 
unnecessary branches:
{code}
final def apply(x0$1: Tuple2): Unit = {
  case synthetic val x1: Tuple2 = x0$1;
  case4(){
if (x1.ne(null))
  {
val k: Int = x1._1$mcI$sp();
val v: Int = x1._2$mcI$sp();
matchEnd3({
  scala.this.Predef.println(scala.Int.box(k.+(v)));
  scala.runtime.BoxedUnit.UNIT
})
  }
else
  case5()
  };
  case5(){
matchEnd3(throw new MatchError(x1))
  };
  matchEnd3(x: runtime.BoxedUnit){
()
  }
{code}


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15037098
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 
 def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
   val map = new JHashMap[K, V]
-  iter.foreach { case (k, v) =
-val old = map.get(k)
-map.put(k, if (old == null) v else func(old, v))
+  iter.foreach { pair =
+val old = map.get(pair._1)
+map.put(pair._1, if (old == null) pair._2 else func(old, pair._2))
   }
   Iterator(map)
 }
 
 def mergeMaps(m1: JHashMap[K, V], m2: JHashMap[K, V]): JHashMap[K, V] 
= {
-  m2.foreach { case (k, v) =
-val old = m1.get(k)
-m1.put(k, if (old == null) v else func(old, v))
+  m2.foreach { pair =
+val old = m1.get(pair._1)
--- End diff --

again, create a variable to save the key


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15037121
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -413,11 +413,11 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
* partition the output RDD.
*/
   def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): 
RDD[(K, (V, Option[W]))] = {
-this.cogroup(other, partitioner).flatMapValues { case (vs, ws) =
-  if (ws.isEmpty) {
-vs.map(v = (v, None))
+this.cogroup(other, partitioner).flatMapValues { pair =
+  if (pair._2.isEmpty) {
--- End diff --

save the key and value, i.e.
```scala
this.cogroup(other, partitioner).flatMapValues { pair =
  val vs = pair._1
  val ws = pair._2
  ... same old code ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15037134
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -571,12 +571,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   throw new SparkException(Default partitioner cannot partition array 
keys.)
 }
 val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), 
partitioner)
-cg.mapValues { case Seq(vs, w1s, w2s, w3s) =
-  (vs.asInstanceOf[Seq[V]],
-w1s.asInstanceOf[Seq[W1]],
-w2s.asInstanceOf[Seq[W2]],
-w3s.asInstanceOf[Seq[W3]])
-}
+cg.mapValues { seq  = seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2], 
Seq[W3])] }
--- End diff --

extra space here. Does the type casting work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15037164
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 val index = p.getPartition(key)
 def process(it: Iterator[(K, V)]): Seq[V] = {
   val buf = new ArrayBuffer[V]
-  for ((k, v) - it if k == key) {
--- End diff --

we can probably revert this change since this only works on the partitions 
rather than per record


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15037689
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 val index = p.getPartition(key)
 def process(it: Iterator[(K, V)]): Seq[V] = {
   val buf = new ArrayBuffer[V]
-  for ((k, v) - it if k == key) {
--- End diff --

Ah, misread


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread aarondav
Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15038052
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 
 def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
   val map = new JHashMap[K, V]
-  iter.foreach { case (k, v) =
-val old = map.get(k)
-map.put(k, if (old == null) v else func(old, v))
+  iter.foreach { pair =
+val old = map.get(pair._1)
--- End diff --

If we're using functional style anyway, do we really expect to see any sort 
of remotely noticeable gain from doing these kinds of optimizations?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15038340
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 
 def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
   val map = new JHashMap[K, V]
-  iter.foreach { case (k, v) =
-val old = map.get(k)
-map.put(k, if (old == null) v else func(old, v))
+  iter.foreach { pair =
+val old = map.get(pair._1)
--- End diff --

It's really hard to tell. I think it is really hard to tell when doing 
microbenchmarks because branch prediction takes care of the extra branches. But 
I don't know how this impacts real runtime in a real workload without a lot 
more instrumentation. My opinion is since it is a small change, it's ok to just 
do it this way. If it is a big refactoring, that's a very different story and 
deserves more profiling. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15042032
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
 
 def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] 
= {
   val map = new JHashMap[K, V]
-  iter.foreach { case (k, v) =
-val old = map.get(k)
-map.put(k, if (old == null) v else func(old, v))
+  iter.foreach { pair =
+val old = map.get(pair._1)
--- End diff --

Removing the call to _1 isn't hugely helpful IMO; accessor methods get 
inlined very fast. However, pattern matching adds a whole bunch of potential 
throw sites and branches and things like that, and we've noticed problems in 
the past.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15042052
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -571,12 +571,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   throw new SparkException(Default partitioner cannot partition array 
keys.)
 }
 val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), 
partitioner)
-cg.mapValues { case Seq(vs, w1s, w2s, w3s) =
-  (vs.asInstanceOf[Seq[V]],
-w1s.asInstanceOf[Seq[W1]],
-w2s.asInstanceOf[Seq[W2]],
-w3s.asInstanceOf[Seq[W3]])
-}
+cg.mapValues { seq  = seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2], 
Seq[W3])] }
--- End diff --

This should actually return Iterable, not Seq


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15042054
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -589,9 +584,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   throw new SparkException(Default partitioner cannot partition array 
keys.)
 }
 val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
-cg.mapValues { case Seq(vs, ws) =
-  (vs.asInstanceOf[Seq[V]], ws.asInstanceOf[Seq[W]])
-}
+cg.mapValues { pair = pair.asInstanceOf[(Seq[V], Seq[W])] }
--- End diff --

Same here, should be Iterable (although the types work okay now, it's 
confusing)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1447#discussion_r15042061
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -604,11 +597,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   throw new SparkException(Default partitioner cannot partition array 
keys.)
 }
 val cg = new CoGroupedRDD[K](Seq(self, other1, other2), partitioner)
-cg.mapValues { case Seq(vs, w1s, w2s) =
-  (vs.asInstanceOf[Seq[V]],
-   w1s.asInstanceOf[Seq[W1]],
-   w2s.asInstanceOf[Seq[W2]])
-}
+cg.mapValues { seq  = seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2])] }
--- End diff --

Same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---