zhengruifeng created SPARK-13970: ------------------------------------ Summary: Add Non-Negative Matrix Factorization to MLlib Key: SPARK-13970 URL: https://issues.apache.org/jira/browse/SPARK-13970 Project: Spark Issue Type: New Feature Components: MLlib Reporter: zhengruifeng Priority: Minor
NMF is to find two non-negative matrices (W, H) whose product W * H.T approximates the non-negative matrix X. This factorization can be used for example for dimensionality reduction, source separation or topic extraction. NMF was implemented in several packages: Scikit-Learn (http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF) R-NMF (https://cran.r-project.org/web/packages/NMF/index.html) LibNMF (http://www.univie.ac.at/rlcta/software/) I have implemented in MLlib according to the following papers: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce (http://research.microsoft.com/pubs/119077/DNMF.pdf) Algorithms for Non-negative Matrix Factorization (http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf) It can be used like this: val m = 4 val n = 3 val data = Seq( (0L, Vectors.dense(0.0, 1.0, 2.0)), (1L, Vectors.dense(3.0, 4.0, 5.0)), (3L, Vectors.dense(9.0, 0.0, 1.0)) ).map(x => IndexedRow(x._1, x._2)) val A = new IndexedRowMatrix(indexedRows).toCoordinateMatrix() val k = 2 // run the nmf algo val r = NMF.solve(A, k, 10) val rW = r.W.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix] >>> org.apache.spark.mllib.linalg.DenseMatrix = 1.1349295096806706 1.4423101890626953E-5 3.453054133110303 0.46312492493865615 0.0 0.0 0.3133764134585149 2.70684017255672 val rH = r.H.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix] >>> org.apache.spark.mllib.linalg.DenseMatrix = 0.4184163313845057 3.2719352525149286 1.12188012613645 0.002939823716977737 1.456499371939653 0.18992996116069297 val R = rW.multiply(rH.transpose) >>> org.apache.spark.mllib.linalg.DenseMatrix = 0.4749202332761286 1.273254903877907 1.6530268574248572 2.9601290106732367 3.8752743120480346 5.117332475154927 0.0 0.0 0.0 8.987727592773672 0.35952840319637736 0.9705425982249293 val AD = A.toBlockMatrix().toLocalMatrix() >>> org.apache.spark.mllib.linalg.Matrix = 0.0 1.0 2.0 3.0 4.0 5.0 0.0 0.0 0.0 9.0 0.0 1.0 var loss = 0.0 for(i <- 0 until AD.numRows; j <- 0 until AD.numCols) { val diff = AD(i, j) - R(i, j) loss += diff * diff } loss >>> Double = 0.5817999580912183 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org