zhengruifeng created SPARK-13970:
------------------------------------

             Summary: Add Non-Negative Matrix Factorization to MLlib
                 Key: SPARK-13970
                 URL: https://issues.apache.org/jira/browse/SPARK-13970
             Project: Spark
          Issue Type: New Feature
          Components: MLlib
            Reporter: zhengruifeng
            Priority: Minor


NMF is to find two non-negative matrices (W, H) whose product W * H.T 
approximates the non-negative matrix X. This factorization can be used for 
example for dimensionality reduction, source separation or topic extraction.

NMF was implemented in several packages:
Scikit-Learn 
(http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF)
R-NMF (https://cran.r-project.org/web/packages/NMF/index.html)
LibNMF (http://www.univie.ac.at/rlcta/software/)

I have implemented in MLlib according to the following papers:
Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis 
on MapReduce (http://research.microsoft.com/pubs/119077/DNMF.pdf)
Algorithms for Non-negative Matrix Factorization 
(http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf)

It can be used like this:

val m = 4
val n = 3
val data = Seq(
    (0L, Vectors.dense(0.0, 1.0, 2.0)),
    (1L, Vectors.dense(3.0, 4.0, 5.0)),
    (3L, Vectors.dense(9.0, 0.0, 1.0))
  ).map(x => IndexedRow(x._1, x._2))

val A = new IndexedRowMatrix(indexedRows).toCoordinateMatrix()
val k = 2

// run the nmf algo
val r = NMF.solve(A, k, 10)

val rW = r.W.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix]
>>> org.apache.spark.mllib.linalg.DenseMatrix =
1.1349295096806706  1.4423101890626953E-5
3.453054133110303   0.46312492493865615
0.0                 0.0
0.3133764134585149  2.70684017255672

val rH = r.H.toBlockMatrix().toLocalMatrix().asInstanceOf[DenseMatrix]
>>> org.apache.spark.mllib.linalg.DenseMatrix =
0.4184163313845057  3.2719352525149286
1.12188012613645    0.002939823716977737
1.456499371939653   0.18992996116069297


val R = rW.multiply(rH.transpose)
>>> org.apache.spark.mllib.linalg.DenseMatrix =
0.4749202332761286  1.273254903877907    1.6530268574248572
2.9601290106732367  3.8752743120480346   5.117332475154927
0.0                 0.0                  0.0
8.987727592773672   0.35952840319637736  0.9705425982249293

val AD = A.toBlockMatrix().toLocalMatrix()
>>> org.apache.spark.mllib.linalg.Matrix =
0.0  1.0  2.0
3.0  4.0  5.0
0.0  0.0  0.0
9.0  0.0  1.0

var loss = 0.0
for(i <- 0 until AD.numRows; j <- 0 until AD.numCols) {
   val diff = AD(i, j) - R(i, j)
   loss += diff * diff
}
loss
>>> Double = 0.5817999580912183





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to