Re: Indices of SparseVector must be ordered while computing SVD

Joseph Bradley Wed, 22 Apr 2015 19:33:08 -0700

Hi Chunnan,

There is currently Scala documentation for the constructor parameters:
https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala#L515


There is one benefit to not checking for validity (ordering) within the
constructor: If you need to translate between SparseVector and some other
library's type (e.g., Breeze), you can do so with a few reference copies,
rather than iterating through or copying the actual data.  It might be good
to provide this check within Vectors.sparse(), but we'd need to check
through MLlib for uses of Vectors.sparse which expect it to be a cheap
operation.  What do you think?

It is documented in the programming guide too:
https://github.com/apache/spark/blob/04525c077c638a7e615c294ba988e35036554f5f/docs/mllib-data-types.md
But perhaps that should be more prominent.

If you think it would be helpful, then please do make a JIRA about adding a
check to Vectors.sparse().

Joseph

On Wed, Apr 22, 2015 at 8:29 AM, Chunnan Yao <yaochun...@gmail.com> wrote:

> Hi all,
> I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This
> really
> confused me today. At first I thought my implementation is wrong. It turns
> out it's an issue in MLlib. Fortunately, I've figured it out.
>
> I suggest to add a hint on user document of MLlib ( as far as I know, there
> have not been such hints yet) that  indices of Local Sparse Vector must be
> ordered in ascending manner. Because of ignorance of this point, I spent a
> lot of time looking for reasons why computeSVD of RowMatrix did not run
> correctly on Sparse data. I don't know the influence of Sparse Vector
> without ordered indices on other functions, but I believe it is necessary
> to
> let the users know or fix it. Actually, it's very easy to fix. Just add a
> sortBy function in internal construction of SparseVector.
>
> Here is an example to reproduce the affect of unordered Sparse Vector on
> computeSVD.
> ================================================
> //in spark-shell, Spark 1.3.1
>  import org.apache.spark.mllib.linalg.distributed.RowMatrix
>  import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector,
> Vectors}
>
>   val sparseData_ordered = Seq(
>     Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)),
>     Vectors.sparse(3, Array(0,1,2), Array(3.0, 4.0, 5.0)),
>     Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)),
>     Vectors.sparse(3, Array(0,2), Array(9.0, 1.0))
>   )
>   val sparseMat_ordered = new RowMatrix(sc.parallelize(sparseData_ordered,
> 2))
>
>   val sparseData_not_ordered = Seq(
>     Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)),
>     Vectors.sparse(3, Array(2,1,0), Array(5.0,4.0,3.0)),
>     Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)),
>     Vectors.sparse(3, Array(2,0), Array(1.0,9.0))
>   )
>  val sparseMat_not_ordered = new
> RowMatrix(sc.parallelize(sparseData_not_ordered, 2))
>
> //apparently, sparseMat_ordered and sparseMat_not_ordered are essentially
> the same matirx
> //however, the computeSVD result of these two matrixes are different. Users
> should be notified about this situation.
>   println(sparseMat_ordered.computeSVD(2,
> true).U.rows.collect.mkString("\n"))
>   println("===================")
>   println(sparseMat_not_ordered.computeSVD(2,
> true).U.rows.collect.mkString("\n"))
> ======================================================
> The results are:
> ordered:
> [-0.10972870132786407,-0.18850811494220537]
> [-0.44712472003608356,-0.24828866611663725]
> [-0.784520738744303,-0.3080692172910691]
> [-0.4154110101064339,0.8988385762953358]
>
> not ordered:
> [-0.10830447119599484,-0.1559341848984378]
> [-0.4522713511277327,-0.23449829541447448]
> [-0.7962382310594706,-0.3130624059305111]
> [-0.43131320303494614,0.8453864703362308]
>
> Looking into this issue, I can see it's reason locates in
> RowMatrix.scala(line 629). The implementation of Sparse dspr here requires
> ordered indices. Because it is scanning the indices consecutively to skip
> empty columns.
>
>
>
> -----
> Feel the sparking Spark!
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Indices-of-SparseVector-must-be-ordered-while-computing-SVD-tp11731.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Indices of SparseVector must be ordered while computing SVD

Reply via email to