Having ordered indices is a contract of SparseVector: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector. We do not verify it for performance. -Xiangrui
On Wed, Apr 22, 2015 at 8:26 AM, yaochunnan <yaochun...@gmail.com> wrote: > Hi all, > I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This really > confused me today. At first I thought my implementation is wrong. It turns > out it's an issue in MLlib. Fortunately, I've figured it out. > > I suggest to add a hint on user document of MLlib ( as far as I know, there > have not been such hints yet) that indices of Local Sparse Vector must be > ordered in ascending manner. Because of ignorance of this point, I spent a > lot of time looking for reasons why computeSVD of RowMatrix did not run > correctly on Sparse data. I don't know the influence of Sparse Vector > without ordered indices on other functions, but I believe it is necessary to > let the users know or fix it. Actually, it's very easy to fix. Just add a > sortBy function in internal construction of SparseVector. > > Here is an example to reproduce the affect of unordered Sparse Vector on > computeSVD. > ================================================ > //in spark-shell, Spark 1.3.1 > import org.apache.spark.mllib.linalg.distributed.RowMatrix > import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector, > Vectors} > > val sparseData_ordered = Seq( > Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), > Vectors.sparse(3, Array(0,1,2), Array(3.0, 4.0, 5.0)), > Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), > Vectors.sparse(3, Array(0,2), Array(9.0, 1.0)) > ) > val sparseMat_ordered = new RowMatrix(sc.parallelize(sparseData_ordered, > 2)) > > val sparseData_not_ordered = Seq( > Vectors.sparse(3, Array(1, 2), Array(1.0, 2.0)), > Vectors.sparse(3, Array(2,1,0), Array(5.0,4.0,3.0)), > Vectors.sparse(3, Array(0,1,2), Array(6.0, 7.0, 8.0)), > Vectors.sparse(3, Array(2,0), Array(1.0,9.0)) > ) > val sparseMat_not_ordered = new > RowMatrix(sc.parallelize(sparseData_not_ordered, 2)) > > //apparently, sparseMat_ordered and sparseMat_not_ordered are essentially > the same matirx > //however, the computeSVD result of these two matrixes are different. Users > should be notified about this situation. > println(sparseMat_ordered.computeSVD(2, > true).U.rows.collect.mkString("\n")) > println("===================") > println(sparseMat_not_ordered.computeSVD(2, > true).U.rows.collect.mkString("\n")) > ====================================================== > The results are: > ordered: > [-0.10972870132786407,-0.18850811494220537] > [-0.44712472003608356,-0.24828866611663725] > [-0.784520738744303,-0.3080692172910691] > [-0.4154110101064339,0.8988385762953358] > > not ordered: > [-0.10830447119599484,-0.1559341848984378] > [-0.4522713511277327,-0.23449829541447448] > [-0.7962382310594706,-0.3130624059305111] > [-0.43131320303494614,0.8453864703362308] > > Looking into this issue, I can see it's reason locates in > RowMatrix.scala(line 629). The implementation of Sparse dspr here requires > ordered indices. Because it is scanning the indices consecutively to skip > empty columns. > > > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/the-indices-of-SparseVector-must-be-ordered-while-computing-SVD-tp22611.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org