[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360956#comment-14360956
 ] 

Debasish Das edited comment on SPARK-6323 at 3/15/15 4:29 PM:
--------------------------------------------------------------

g(z) is not regularization...we support constraints like z>=0; lb <= z <= 
ub;1'z = s, z >=0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 

2. The original PLSA paper from Hoffman et al.

3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic 
Models for Collaborative Filtering

4. Industry specific application: 
http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

For large rank matrix factorization the requirement also come from sparse 
topics now which can easily range in ~ 10K...

The idea can be implemented in the Sparse LDA JIRA as well 
https://issues.apache.org/jira/browse/SPARK-5564 and I asked [~josephkb] if he 
thinks we should do it in LDA framework but I don't think we know which flow 
will scale better yet as the data moves from sparse from dense.

With the factorization flow I will start to see results next week as the flow 
is designed to handle these ideas. I will start to look into graphx based flow 
after that. 


was (Author: debasish83):
g(z) is not regularization...we support constraints like z>=0; lb <= z <= 
ub;1'z = s, z >=0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 

2. The original PLSA paper from Hoffman et al.

3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic 
Models for Collaborative Filtering

4. Industry specific application: 
http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

For large rank matrix factorization the requirement also come from sparse 
topics now which can easily range in ~ 10K...

The idea can be implemented in the Sparse LDA JIRA as well 
https://issues.apache.org/jira/browse/SPARK-5564 and I asked Joseph if he 
thinks we should do it in LDA framework but I don't think we know which flow 
will scale better yet. 

With the factorization flow I will start to see results next week as the flow 
is designed to handle these ideas. I will start to look into graphx based flow 
after that. 

> Large rank matrix factorization with Nonlinear loss and constraints
> -------------------------------------------------------------------
>
>                 Key: SPARK-6323
>                 URL: https://issues.apache.org/jira/browse/SPARK-6323
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, MLlib
>    Affects Versions: 1.4.0
>            Reporter: Debasish Das
>             Fix For: 1.4.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Currently ml.recommendation.ALS is optimized for gram matrix generation which 
> scales to modest ranks. The problems that we can solve are in the normal 
> equation/quadratic form: 0.5x'Hx + c'x + g(z)
> g(z) can be one of the constraints from Breeze proximal library:
> https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
> In this PR we will re-use ml.recommendation.ALS design and come up with 
> ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
> changes, it's straightforward to do it now !
> ALM will be capable of solving the following problems: min f ( x ) + g ( z )
> 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
> HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
> and implement LoglikelihoodLoss
> 2. Constraints g ( z ) supported are same as above except that we don't 
> support affine + bounds yet Aeq x = beq , lb <= x <= ub yet. Most likely we 
> don't need that for ML applications
> 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
> in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
> on convergence speed.
> https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
> 4. The factors will be SparseVector so that we keep shuffle size in check. 
> For example we will run with 10K ranks but we will force factors to be 
> 100-sparse.
> This is closely related to Sparse LDA 
> https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
> are not using graph representation here.
> As we do scaling experiments, we will understand which flow is more suited as 
> ratings get denser (my understanding is that since we already scaled ALS to 2 
> billion ratings and we will keep sparsity in check, the same 2 billion flow 
> will scale to 10K ranks as well)...
> This JIRA is intended to extend the capabilities of ml recommendation to 
> generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to