[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

Manoj Kumar (JIRA) Wed, 04 Feb 2015 05:17:02 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305058#comment-14305058
 ]


Manoj Kumar commented on SPARK-5021:
------------------------------------

Hmm. I figured it out, it is because I have something like this.

    val trainData = {
        if sparse
            data.map(sample => sample.asInstanceOf[SparseVector]).cache()
        else
            data.map(u => u.toBreeze.toDenseVector).cache()

Now since trainData can have two possible types, this statement returns an 
error.

      val sums = {
        if (isSparse) {
          val compute =  sc.broadcast(ExpectationSum.addSparse(weights, 
gaussians)_)
          trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
        }
        else {
          val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_)
          trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
        }
      }

[error]  found   : (org.apache.spark.mllib.clustering.ExpectationSum, 
org.apache.spark.mllib.linalg.SparseVector) => 
org.apache.spark.mllib.clustering.ExpectationSum
[error]  required: (org.apache.spark.mllib.clustering.ExpectationSum, _0) => 
org.apache.spark.mllib.clustering.ExpectationSum
[error]           trainData.aggregate(ExpectationSum.zero(k, d))(compute.value, 
_ += _)


What it the best way to overcome this?

> GaussianMixtureEM should be faster for SparseVector input
> ---------------------------------------------------------
>
>                 Key: SPARK-5021
>                 URL: https://issues.apache.org/jira/browse/SPARK-5021
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>            Assignee: Manoj Kumar
>
> GaussianMixtureEM currently converts everything to dense vectors.  It would 
> be nice if it were faster for SparseVectors (running in time linear in the 
> number of non-zero values).
> However, this may not be too important since clustering should rarely be done 
> in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

Reply via email to