Repository: spark
Updated Branches:
  refs/heads/branch-2.0 5a71a0501 -> 7d9bd951b


[SPARK-16469] enhanced simulate multiply

## What changes were proposed in this pull request?

We have a use case of multiplying very big sparse matrices. we have about 
1000x1000 distributed block matrices multiplication and the simulate multiply 
goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it 
slightly with classical hashmap and now run in about 30 seconds O(n^2).

## How was this patch tested?

We have added a performance test and verified the reduced time.

Author: oraviv <ora...@paypal.com>

Closes #14068 from uzadude/master.

(cherry picked from commit ea06e4ef34c860219a9aeec81816ef53ada96253)
Signed-off-by: Sean Owen <so...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7d9bd951
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7d9bd951
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7d9bd951

Branch: refs/heads/branch-2.0
Commit: 7d9bd951b0b5767ef2c95eb7467f35c9409e7d8c
Parents: 5a71a05
Author: oraviv <ora...@paypal.com>
Authored: Wed Jul 13 14:47:08 2016 +0100
Committer: Sean Owen <so...@cloudera.com>
Committed: Wed Jul 13 14:47:47 2016 +0100

----------------------------------------------------------------------
 .../spark/mllib/linalg/distributed/BlockMatrix.scala   | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/7d9bd951/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
index 639295c..9782350 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala
@@ -426,16 +426,21 @@ class BlockMatrix @Since("1.3.0") (
       partitioner: GridPartitioner): (BlockDestinations, BlockDestinations) = {
     val leftMatrix = blockInfo.keys.collect() // blockInfo should already be 
cached
     val rightMatrix = other.blocks.keys.collect()
+
+    val rightCounterpartsHelper = 
rightMatrix.groupBy(_._1).mapValues(_.map(_._2))
     val leftDestinations = leftMatrix.map { case (rowIndex, colIndex) =>
-      val rightCounterparts = rightMatrix.filter(_._1 == colIndex)
-      val partitions = rightCounterparts.map(b => 
partitioner.getPartition((rowIndex, b._2)))
+      val rightCounterparts = rightCounterpartsHelper.getOrElse(colIndex, 
Array())
+      val partitions = rightCounterparts.map(b => 
partitioner.getPartition((rowIndex, b)))
       ((rowIndex, colIndex), partitions.toSet)
     }.toMap
+
+    val leftCounterpartsHelper = 
leftMatrix.groupBy(_._2).mapValues(_.map(_._1))
     val rightDestinations = rightMatrix.map { case (rowIndex, colIndex) =>
-      val leftCounterparts = leftMatrix.filter(_._2 == rowIndex)
-      val partitions = leftCounterparts.map(b => 
partitioner.getPartition((b._1, colIndex)))
+      val leftCounterparts = leftCounterpartsHelper.getOrElse(rowIndex, 
Array())
+      val partitions = leftCounterparts.map(b => partitioner.getPartition((b, 
colIndex)))
       ((rowIndex, colIndex), partitions.toSet)
     }.toMap
+
     (leftDestinations, rightDestinations)
   }
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to