[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224871#comment-14224871 ]
Debasish Das commented on SPARK-2426: ------------------------------------- Actually on MovieLens dataset, I am getting good MAP numbers with EQUALITY constraint...The formulation is similar to PLSA but not exact: [~akopich] could you please help review if my understanding is correct here ? k \in {1...25} (if we running with rank as 25) Minimize \sum_i \sum_j ( r_ij - w_i*h_j) + lambda(||w_i||^2 + ||h_j||^2) s.t \sum_k w_ik = 1, w_ik >= 0 \sum_k h_kj = 1, h_kj >= 0 This is not quite the stochastic matrix factorization that this paper http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf talks about as PLSA needs the following constraint (I am reading it more) along with log-likelihood loss: For each k \sum_j h_kj = 1 On MovieLens dataset I run the EQUALITY version as follows (rank=50, 5 iterations). More iterations does not improve it further. ./bin/spark-submit --total-executor-cores 4 --executor-memory 4g --driver-memory 1g --master spark://TUSCA09LMLVT00C.local:7077 --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --rank 50 --numIterations 5 --userConstraint EQUALITY --lambdaUser 0.065 --productConstraint EQUALITY --lambdaProduct 0.065 --kryo --validateRecommendation hdfs://localhost:8020/sandbox/movielens/ Got 1000209 ratings from 6040 users on 3706 movies. Training: 800670, test: 199539. Quadratic minimization userConstraint EQUALITY productConstraint EQUALITY Test RMSE = 1.6970509086529808. Test users 6038 MAP 0.09333309533803603 So basically best MAP results come from this formulation. 2X improvement over default of 4.8% [~mengxr] [~srowen] it will be great if you guys can review the MAP calculation https://issues.apache.org/jira/browse/SPARK-4231 and help merge it to mllib. I am keen to understand if there are bugs in the calculation. This is a bit surprising to me since I have not finished the PLSA code (I am working on the bi-concave cost) as the paper points out and that means results can improve further. Note the degradation in RMSE. I will do runs with Netflix dataset but on our internal dataset (2M x 20K) trends look similar. > Quadratic Minimization for MLlib ALS > ------------------------------------ > > Key: SPARK-2426 > URL: https://issues.apache.org/jira/browse/SPARK-2426 > Project: Spark > Issue Type: New Feature > Components: MLlib > Affects Versions: 1.3.0 > Reporter: Debasish Das > Assignee: Debasish Das > Original Estimate: 504h > Remaining Estimate: 504h > > Current ALS supports least squares and nonnegative least squares. > I presented ADMM and IPM based Quadratic Minimization solvers to be used for > the following ALS problems: > 1. ALS with bounds > 2. ALS with L1 regularization > 3. ALS with Equality constraint and bounds > Initial runtime comparisons are presented at Spark Summit. > http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark > Based on Xiangrui's feedback I am currently comparing the ADMM based > Quadratic Minimization solvers with IPM based QpSolvers and the default > ALS/NNLS. I will keep updating the runtime comparison results. > For integration the detailed plan is as follows: > 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization > 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org