[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305058#comment-14305058 ]
Manoj Kumar edited comment on SPARK-5021 at 2/4/15 1:18 PM: ------------------------------------------------------------ Hmm. I figured it out, it is because I have something like this. val trainData = { if sparse data.map(sample => sample.asInstanceOf[SparseVector]).cache() else data.map(u => u.toBreeze.toDenseVector).cache() Now since trainData can have two possible types, this statement returns an error. val sums = { if (isSparse) { val compute = sc.broadcast(ExpectationSum.addSparse(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } else { val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } } [error] found : (org.apache.spark.mllib.clustering.ExpectationSum, org.apache.spark.mllib.linalg.SparseVector) => org.apache.spark.mllib.clustering.ExpectationSum [error] required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) => org.apache.spark.mllib.clustering.ExpectationSum [error] trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) What it the best way to overcome this? was (Author: mechcoder): Hmm. I figured it out, it is because I have something like this. val trainData = { if sparse data.map(sample => sample.asInstanceOf[SparseVector]).cache() else data.map(u => u.toBreeze.toDenseVector).cache() Now since trainData can have two possible types, this statement returns an error. val sums = { if (isSparse) { val compute = sc.broadcast(ExpectationSum.addSparse(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } else { val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_) trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) } } [error] found : (org.apache.spark.mllib.clustering.ExpectationSum, org.apache.spark.mllib.linalg.SparseVector) => org.apache.spark.mllib.clustering.ExpectationSum [error] required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) => org.apache.spark.mllib.clustering.ExpectationSum [error] trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _) What it the best way to overcome this? > GaussianMixtureEM should be faster for SparseVector input > --------------------------------------------------------- > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.3.0 > Reporter: Joseph K. Bradley > Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org