spark git commit: [SPARK-17033][ML][MLLIB] GaussianMixture should use treeAggregate to improve performance

yliang Fri, 12 Aug 2016 10:07:06 -0700

Repository: spark
Updated Branches:
  refs/heads/master 79e2caa13 -> bbae20ade



[SPARK-17033][ML][MLLIB] GaussianMixture should use treeAggregate to improve 
performance

## What changes were proposed in this pull request?
```GaussianMixture``` should use ```treeAggregate``` rather than 
```aggregate``` to improve performance and scalability. In my test of dataset 
with 200 features and 1M instance, I found there is 20% increased performance.
BTW, we should destroy broadcast variable ```compute``` at the end of each 
iteration.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <yblia...@gmail.com>

Closes #14621 from yanboliang/spark-17033.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bbae20ad
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bbae20ad
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bbae20ad

Branch: refs/heads/master
Commit: bbae20ade14e50541e4403ca7b45bf6c11695d15
Parents: 79e2caa
Author: Yanbo Liang <yblia...@gmail.com>
Authored: Fri Aug 12 10:06:17 2016 -0700
Committer: Yanbo Liang <yblia...@gmail.com>
Committed: Fri Aug 12 10:06:17 2016 -0700

----------------------------------------------------------------------
 .../scala/org/apache/spark/mllib/clustering/GaussianMixture.scala | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/bbae20ad/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
index a214b1a..43193ad 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
@@ -198,7 +198,7 @@ class GaussianMixture private (
       val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_)
 
       // aggregate the cluster contribution for all sample points
-      val sums = breezeData.aggregate(ExpectationSum.zero(k, 
d))(compute.value, _ += _)
+      val sums = breezeData.treeAggregate(ExpectationSum.zero(k, 
d))(compute.value, _ += _)
 
       // Create new distributions based on the partial assignments
       // (often referred to as the "M" step in literature)
@@ -227,6 +227,7 @@ class GaussianMixture private (
       llhp = llh // current becomes previous
       llh = sums.logLikelihood // this is the freshly computed log-likelihood
       iter += 1
+      compute.destroy(blocking = false)
     }
 
     new GaussianMixtureModel(weights, gaussians)


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17033][ML][MLLIB] GaussianMixture should use treeAggregate to improve performance

Reply via email to