[ https://issues.apache.org/jira/browse/SPARK-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-13970: ------------------------------------ Assignee: Apache Spark > Add Non-Negative Matrix Factorization to MLlib > ---------------------------------------------- > > Key: SPARK-13970 > URL: https://issues.apache.org/jira/browse/SPARK-13970 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: zhengruifeng > Assignee: Apache Spark > Priority: Minor > > NMF is to find two non-negative matrices (W, H) whose product W * H.T > approximates the non-negative matrix X. This factorization can be used for > example for dimensionality reduction, source separation or topic extraction. > NMF was implemented in several packages: > Scikit-Learn > (http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF) > R-NMF (https://cran.r-project.org/web/packages/NMF/index.html) > LibNMF (http://www.univie.ac.at/rlcta/software/) > I have implemented in MLlib according to the following papers: > Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data > Analysis on MapReduce (http://research.microsoft.com/pubs/119077/DNMF.pdf) > Algorithms for Non-negative Matrix Factorization > (http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf) > It can be used like this: > val m = 4 > val n = 3 > val data = Seq( > (0L, Vectors.dense(0.0, 1.0, 2.0)), > (1L, Vectors.dense(3.0, 4.0, 5.0)), > (3L, Vectors.dense(9.0, 0.0, 1.0)) > ).map(x => IndexedRow(x._1, x._2)) > val A = new IndexedRowMatrix(indexedRows).toCoordinateMatrix() > val k = 2 > // run the nmf algo > val r = NMF.solve(A, k, 10) > val rW = r.W.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix] > >>> org.apache.spark.mllib.linalg.DenseMatrix = > 1.1349295096806706 1.4423101890626953E-5 > 3.453054133110303 0.46312492493865615 > 0.0 0.0 > 0.3133764134585149 2.70684017255672 > val rH = r.H.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix] > >>> org.apache.spark.mllib.linalg.DenseMatrix = > 0.4184163313845057 3.2719352525149286 > 1.12188012613645 0.002939823716977737 > 1.456499371939653 0.18992996116069297 > val R = rW.multiply(rH.transpose) > >>> org.apache.spark.mllib.linalg.DenseMatrix = > 0.4749202332761286 1.273254903877907 1.6530268574248572 > 2.9601290106732367 3.8752743120480346 5.117332475154927 > 0.0 0.0 0.0 > 8.987727592773672 0.35952840319637736 0.9705425982249293 > val AD = A.toBlockMatrix().toLocalMatrix() > >>> org.apache.spark.mllib.linalg.Matrix = > 0.0 1.0 2.0 > 3.0 4.0 5.0 > 0.0 0.0 0.0 > 9.0 0.0 1.0 > var loss = 0.0 > for(i <- 0 until AD.numRows; j <- 0 until AD.numCols) { > val diff = AD(i, j) - R(i, j) > loss += diff * diff > } > loss > >>> Double = 0.5817999580912183 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org