Repository: spark Updated Branches: refs/heads/master 79e2caa13 -> bbae20ade
[SPARK-17033][ML][MLLIB] GaussianMixture should use treeAggregate to improve performance ## What changes were proposed in this pull request? ```GaussianMixture``` should use ```treeAggregate``` rather than ```aggregate``` to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance. BTW, we should destroy broadcast variable ```compute``` at the end of each iteration. ## How was this patch tested? Existing tests. Author: Yanbo Liang <yblia...@gmail.com> Closes #14621 from yanboliang/spark-17033. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bbae20ad Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bbae20ad Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bbae20ad Branch: refs/heads/master Commit: bbae20ade14e50541e4403ca7b45bf6c11695d15 Parents: 79e2caa Author: Yanbo Liang <yblia...@gmail.com> Authored: Fri Aug 12 10:06:17 2016 -0700 Committer: Yanbo Liang <yblia...@gmail.com> Committed: Fri Aug 12 10:06:17 2016 -0700 ---------------------------------------------------------------------- .../scala/org/apache/spark/mllib/clustering/GaussianMixture.scala | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/bbae20ad/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---------------------------------------------------------------------- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala index a214b1a..43193ad 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala @@ -198,7 +198,7 @@ class GaussianMixture private ( val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_) // aggregate the cluster contribution for all sample points - val sums = breezeData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) + val sums = breezeData.treeAggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) // Create new distributions based on the partial assignments // (often referred to as the "M" step in literature) @@ -227,6 +227,7 @@ class GaussianMixture private ( llhp = llh // current becomes previous llh = sums.logLikelihood // this is the freshly computed log-likelihood iter += 1 + compute.destroy(blocking = false) } new GaussianMixtureModel(weights, gaussians) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org