[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2018-12-23 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728020#comment-16728020
 ] 

Debasish Das commented on SPARK-24374:
--

Hi [~mengxr] with barrier mode available is it not possible to use native TF 
parameter server in place of using MPI ? Although we are offloading compute 
from spark to tf workers/ps, still if there is an exception that comes out, 
tracking it with native TF API might be easier than MPI exception...great work 
by the way...I was looking for a cloud-ml alternative using spark over 
aws/azure/gcp and looks like barrier should help a lot although I am still not 
clear on the limitations of TensorflowOnSpark project from Yahoo 
[https://github.com/yahoo/TensorFlowOnSpark] which tried to put barrier like 
syntax but not sure if few partitions fails on some tfrecord read / 
communication exceptions whether it can re-run full job or it will only re-run 
the failed partition...I guess the exception from few partitions can be thrown 
back to spark driver and driver can take the action for re-run..when multiple 
tf training jobs get scheduled on the same spark cluster I suspect TFoS might 
have issues as well... 

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-08 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15809876#comment-15809876
 ] 

Debasish Das commented on SPARK-10078:
--

I looked into the code and I see we are replicating Breeze BFGS and OWLQN core 
logic in this PR:
https://github.com/yanboliang/spark-vlbfgs/tree/master/src/main/scala/org/apache/spark/ml/optim.

We can provide a DiffFunction interface that works on feature partition and add 
the VL-BFGS paper logic as a refactoring to current Breeze BFGS code...

Now DiffFunction can run with a DistributedVector or a Vector. What that helps 
with is that even with features < 100M, we can run multi-core VLBFGS with 
putting multiple partitions and a if-else switch is not necessary.

I can provide breeze interfaces based on your PR if you agree with the idea. 
BFGS and OWLQN are few variants but Breeze has several constraint solvers that 
use BFGS code...  

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-02 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793770#comment-15793770
 ] 

Debasish Das commented on SPARK-10078:
--

[~mengxr] [~dlwh] is it possible to implement VL-BFGS as part of breeze so that 
OWLQN, LBFGS, LBFGS-B and proximal.NonlinearMinimizer get benefited by it ? We 
can bring it the way we bring LBFGS/OWLQN right now...If it makes sense, I can 
look at the design doc and propose a breeze interface to abstract RDD details...

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS

2017-01-02 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793760#comment-15793760
 ] 

Debasish Das edited comment on SPARK-10078 at 1/3/17 12:26 AM:
---

Ideally feature partitioning should be automatically tuned...at 100M features 
master only processing what we do with Breeze LBFGS / OWLQN will also get 
benefitted  by VL-BFGSIdeally it should be part of breeze and a proper 
interface should be defined so that the Breeze VL-BFGS solver can be called in 
Spark ML...There are bounded BFGS that's in breeze...all of them will be 
benefited by this change. A solver can be used in other frameworks as well and 
may not be constrained to RDD if possible...


was (Author: debasish83):
Ideally feature partitioning should be automatically tuned...at 100M features 
master only processing what we do with Breeze LBFGS / OWLQN will also get 
benefitted  by VL-BFGSIdeally it should be part of breeze and a proper 
interface should be defined so that the Breeze VL-BFGS solver can be called in 
Spark ML...

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-02 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793760#comment-15793760
 ] 

Debasish Das commented on SPARK-10078:
--

Ideally feature partitioning should be automatically tuned...at 100M features 
master only processing what we do with Breeze LBFGS / OWLQN will also get 
benefitted  by VL-BFGSIdeally it should be part of breeze and a proper 
interface should be defined so that the Breeze VL-BFGS solver can be called in 
Spark ML...

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-12-25 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15777650#comment-15777650
 ] 

Debasish Das edited comment on SPARK-13857 at 12/26/16 5:57 AM:


item->item and user->user was done in an old PR I had...if there is interest I 
can resend it...nice to see how it compares with approximate nearest neighbor 
work from uber:
https://github.com/apache/spark/pull/6213


was (Author: debasish83):
item->item and user->user was done in an old PR I had...if there is interested 
I can resend it...nice to see how it compares with approximate nearest neighbor 
work from uber:
https://github.com/apache/spark/pull/6213

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-12-25 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15777650#comment-15777650
 ] 

Debasish Das commented on SPARK-13857:
--

item->item and user->user was done in an old PR I had...if there is interested 
I can resend it...nice to see how it compares with approximate nearest neighbor 
work from uber:
https://github.com/apache/spark/pull/6213

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581366#comment-15581366
 ] 

Debasish Das commented on SPARK-5992:
-

Also do you have hash function for euclidean distance?  We use cosine, jaccard 
and euclidean with SPARK-4823...for knn comparison we can use overlap 
metric...pick up k and then compare overlap within lsh based approximate knn 
and brute force knn...let me know if you need help in running the benchmarks...

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581361#comment-15581361
 ] 

Debasish Das commented on SPARK-5992:
-

Did you compare with brute force knn ? Normally lsh does not work well for nn 
queries and that's why hybrid spill trees and other ideas came alongI can 
run some comparisons using SPARK-4823

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2016-10-17 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581359#comment-15581359
 ] 

Debasish Das commented on SPARK-4823:
-

We use it in multiple usecases internally but did not get time to refactor the 
PR into 3 smaller PRsI will update the PR to 2.0

> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
> Attachments: MovieLensSimilarity Comparisons.pdf, 
> SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf
>
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server

2016-08-07 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411023#comment-15411023
 ] 

Debasish Das commented on SPARK-6932:
-

[~rxin] [~sowen] Do we have any other active parameter server effort going on 
other than glint project from Rolf ? I have started to look into glint to scale 
Spark-as-a-Service to process queries (idea is that can we keep Spark master as 
a coordinator but 0 compute happens on Spark master other than coordination 
through messages, in our impl right now compute is happening on master which is 
a major con right now). More details will be covered in the talk 
https://spark-summit.org/eu-2016/events/fusing-apache-spark-and-lucene-for-near-realtime-predictive-model-building/
 but I believe parameter server (or something similar) will be needed to scale 
query-processing further to Cassandra ring architecture for example...We will 
provide our implementation for spark-lucene integration as part of our 
framework (Trapezium) open source.


> A Prototype of Parameter Server
> ---
>
> Key: SPARK-6932
> URL: https://issues.apache.org/jira/browse/SPARK-6932
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib, Spark Core
>Reporter: Qiping Li
>
>  h2. Introduction
> As specified in 
> [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
> very helpful to integrate parameter server into Spark for machine learning 
> algorithms, especially for those with ultra high dimensions features. 
> After carefully studying the design doc of [Parameter 
> Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
>  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
> we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
> several key design concerns:
> * *User friendly interface*
>   Careful investigation is done to most existing Parameter Server 
> systems(including:  [petuum|http://petuum.github.io], [parameter 
> server|http://parameterserver.org], 
> [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
> design by absorbing essence from all these system. 
> * *Prototype of distributed array*
> IndexRDD (see 
> [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
> to be a good option for distributed array, because in most case, the #key 
> updates/second is not be very high. 
> So we implement a distributed HashMap to store the parameters, which can 
> be easily extended to get better performance.
> 
> * *Minimal code change*
>   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
> which need parameter server are still created and scheduled by Spark's 
> scheduler. Tasks communicate with parameter server with a client object, 
> through *akka* or *netty*.
> With all these concerns we propose the following architecture:
> h2. Architecture
> !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
> Data is stored in RDD and is partitioned across workers. During each 
> iteration, each worker gets parameters from parameter server then computes 
> new parameters based on old parameters and data in the partition. Finally 
> each worker updates parameters to parameter server.Worker communicates with 
> parameter server through a parameter server client,which is initialized in 
> `TaskContext` of this worker.
> The current implementation is based on YARN cluster mode, 
> but it should not be a problem to transplanted it to other modes. 
> h3. Interface
> We refer to existing parameter server systems(petuum, parameter server, 
> paracel) when design the interface of parameter server. 
> *`PSClient` provides the following interface for workers to use:*
> {code}
> //  get parameter indexed by key from parameter server
> def get[T](key: String): T
> // get multiple parameters from parameter server
> def multiGet[T](keys: Array[String]): Array[T]
> // add parameter indexed by `key` by `delta`, 
> // if multiple `delta` to update on the same parameter,
> // use `reduceFunc` to reduce these `delta`s frist.
> def update[T](key: String, delta: T, reduceFunc: (T, T) => T): Unit
> // update multiple parameters at the same time, use the same `reduceFunc`.
> def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) => 
> T: Unit
> 
> // advance clock to indicate that current iteration is finished.
> def clock(): Unit
>  
> // block until all workers have reached this line of code.
> def sync(): Unit
> {code}
> *`PSContext` provides following functions to use on driver:*
> {code}
> // load parameters from existing rdd.
> def loadPSModel[T](model: RDD[String, T]) 
> // fetch parameters from parameter server to construct 

[jira] [Comment Edited] (SPARK-9834) Normal equation solver for ordinary least squares

2016-06-05 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315935#comment-15315935
 ] 

Debasish Das edited comment on SPARK-9834 at 6/5/16 4:49 PM:
-

Do you have runtime comparisons that when features <= 4096, OLS using Normal 
Equations is faster than BFGS ? I am extending OLS for sparse features and it 
will be great if you can point to the runtime experiments you have done...


was (Author: debasish83):
Do you have runtime comparisons that when features <= 4096, OLS using Normal 
Equations is faster than BFGS ? 

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.6.0
>
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9834) Normal equation solver for ordinary least squares

2016-06-05 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315935#comment-15315935
 ] 

Debasish Das commented on SPARK-9834:
-

Do you have runtime comparisons that when features <= 4096, OLS using Normal 
Equations is faster than BFGS ? 

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.6.0
>
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10408) Autoencoder

2015-09-08 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735706#comment-14735706
 ] 

Debasish Das commented on SPARK-10408:
--

[~avulanov] In MLP can we change BFGS to OWLQN and get L1 regularization ? That 
way I can get sparse weights and clean up the network to avoid 
overfitting...For the autoencoder did you experiment with graphx based design ? 
I would like to work on it. Basically the idea is to come up with a N layer 
deep autoencoder that can support similar prediction APIs like matrix 
factorization.

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9834) Normal equation solver for ordinary least squares

2015-09-08 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734170#comment-14734170
 ] 

Debasish Das edited comment on SPARK-9834 at 9/8/15 3:18 PM:
-

[~mengxr] If you are open to use breeze.proximal.QuadraticMinimizer we can 
support elastic net in this variant as well...The flow will be very similar to 
QuadraticMinimizer integration to ALS...I have done runtime benchmarks compared 
to OWLQN and if we can afford to do dense cholesky QuadraticMinimizer converges 
faster than OWLQN.

There are two new QuadraticMinimizer features I am working on which will 
further improve the solver:
1. sparse ldl through tim davis lgpl code and using breeze sparse matrix for 
sparse gram and conic formulations. Plan is to add it in breeze-native under 
lgpl similar to netlib-java integration.
2. admm acceleration using nesterov method. ADMM can be run in the same 
complexity as FISTA (implemented in TFOCS). Reference: 
http://www.optimization-online.org/DB_FILE/2009/12/2502.pdf

Although in practice I found even the ADMM implemented right now in 
QuadraticMinimizer converges faster than OWLQN. Tom in his paper demonstrated 
faster ADMM convergence compared to FISTA for quadratic problems: 
ftp://ftp.math.ucla.edu/pub/camreport/cam12-35.pdf. 

Due to the X^TX availability in these problems (ALS and linear regression) I 
also compute the min and max eigen values using power iteration 
(breeze.optimize.linear.PowerMethod) in the code which gives the Lipschitz 
estimator L and there is no line search overhead. This trick did not work for 
the nonlinear variant as the hessian estimates are not close to gram matrix !

QuadraticMinimizer is optimized to run at par with blas dposv when there are no 
constraints while BFGS/OWLQN both still have lot of overhead from iterators 
etc. That might also be the reason that I see QuadraticMinimizer is faster than 
BFGS/OWLQN.

It might be the right time to do the micro-benchmark as well that you asked for 
QuadraticMinimizer. Let me know what you think. I can finish up the 
micro-benchmark, bring the runtime of QuadraticMinimizer to ALS 
NormalEquationSolver and then start the L1 experiments.


was (Author: debasish83):
If you are open to use breeze.proximal.QuadraticMinimizer we can support 
elastic net in this variant as well...I can add it on top of your PR...it will 
be very similar to quadraticminimizer integration to ALS...I have done runtime 
benchmarks compared to OWLQN and if we can afford to do dense cholesky 
QuadraticMinimizer converges faster than OWLQN...there are two new features I 
am working on...sparse ldl through tim davis lgpl code and using breeze sparse 
matrix for sparse gram and conic formulations and admm acceleration using 
nesterov method...admm can also be run in the same complexity as FISTA...david 
goldferb proved it.

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9834) Normal equation solver for ordinary least squares

2015-09-07 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734170#comment-14734170
 ] 

Debasish Das commented on SPARK-9834:
-

If you are open to use breeze.proximal.QuadraticMinimizer we can support 
elastic net in this variant as well...I can add it on top of your PR...it will 
be very similar to quadraticminimizer integration to ALS...I have done runtime 
benchmarks compared to OWLQN and if we can afford to do dense cholesky 
QuadraticMinimizer converges faster than OWLQN...there are two new features I 
am working on...sparse ldl through tim davis lgpl code and using breeze sparse 
matrix for sparse gram and conic formulations and admm acceleration using 
nesterov method...admm can also be run in the same complexity as FISTA...david 
goldferb proved it.

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2015-09-07 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734130#comment-14734130
 ] 

Debasish Das commented on SPARK-10078:
--

[~mengxr] will it be Breeze LBFGS modification or part of mllib.optimization ? 
Is  someone looking into it ?

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4823) rowSimilarities

2015-07-30 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-4823:

Attachment: SparkMeetup2015-Experiments2.pdf
SparkMeetup2015-Experiments1.pdf

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
 Attachments: MovieLensSimilarity Comparisons.pdf, 
 SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf


 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2015-07-30 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648340#comment-14648340
 ] 

Debasish Das commented on SPARK-4823:
-

We did more detailed experiment for July 2015 Spark Meetup to understand the 
shuffle effects on runtime. I attached the data for experiments in the JIRA. I 
will update the PR as discussed with Reza. I am targeting 1 PR for Spark 1.5.


 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
 Attachments: MovieLensSimilarity Comparisons.pdf


 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-06-12 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583886#comment-14583886
 ] 

Debasish Das commented on SPARK-2336:
-

Very cool idea Sen. Did you also look into FLANN for randomized KDTree and 
KMeansTree. We have a PR for rowSimilarities which we will use to compare the 
QoR of your PR as soon as you open up a stable version.


 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2336) Approximate k-NN Models for MLLib

2015-06-12 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583886#comment-14583886
 ] 

Debasish Das edited comment on SPARK-2336 at 6/12/15 6:51 PM:
--

Very cool idea Sen. Did you also look into FLANN for randomized KDTree and 
KMeansTree. We have a PR for rowSimilarities 
https://github.com/apache/spark/pull/6213 for brute force KNN generation which 
we will use to compare the QoR of your PR as soon as you open up a stable 
version.



was (Author: debasish83):
Very cool idea Sen. Did you also look into FLANN for randomized KDTree and 
KMeansTree. We have a PR for rowSimilarities which we will use to compare the 
QoR of your PR as soon as you open up a stable version.


 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-05-28 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-6323:

Affects Version/s: (was: 1.4.0)

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Debasish Das
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most 
 likely we will re-use the Gradient interfaces already defined and implement 
 LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion ratings and we will keep sparsity in check, the same 2 billion flow 
 will scale to 10K ranks as well)...
 This JIRA is intended to extend the capabilities of ml recommendation to 
 generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4823) rowSimilarities

2015-05-23 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-4823:

Attachment: MovieLensSimilarity Comparisons.pdf

The attached file shows the runtime comparison of row and column based flow on 
all items from MovieLens dataset on my local Macbook with 8 cores, 1 GB driver, 
4 GB executor memory.

1e-2 is the threshold that's being set to both row based kernel flow and column 
based dimsum flow. 

Stage 24 - 35 is the row similarity flow. Total runtime ~ 20 s

Stage 64 is col similarity mapPartitions. Total runtime ~ 4.6 mins

This shows the power of blocking in Spark and I have not yet gone to gemv which 
will decrease the runtime further.

I updated the driver code in examples.mllib.MovieLensSimilarity  




 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
 Attachments: MovieLensSimilarity Comparisons.pdf


 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-05-23 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557416#comment-14557416
 ] 

Debasish Das commented on SPARK-2426:
-

[~mengxr] Should I add the PR to spark packages and close the JIRA ? The main 
contribution was to add sparsity constraints (L1 and probability simplex) to 
user and product factors in implicit and explicit feedback factorization and 
interested users can use the features from spark packages if they need...Later 
if there is community interest, we can pull it in to master ALS ?

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-05-19 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551294#comment-14551294
 ] 

Debasish Das commented on SPARK-6323:
-

Petuum paper that got released today mentioned going to larger topic size 
(~10-100K) http://arxiv.org/pdf/1412.1576v1.pdf

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most 
 likely we will re-use the Gradient interfaces already defined and implement 
 LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion ratings and we will keep sparsity in check, the same 2 billion flow 
 will scale to 10K ranks as well)...
 This JIRA is intended to extend the capabilities of ml recommendation to 
 generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2015-05-17 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547318#comment-14547318
 ] 

Debasish Das commented on SPARK-4823:
-

I opened up a PR that worked well for our datasets. It is still brute-force 
computation although we use blocked cartesian and user defined kernels to 
optimize on cutting computation and shuffle...There are trivial ideas to go 
from BLAS-1 to BLAS-2 and BLAS-3 as more sparse operations are added to mllib 
BLAS although I don't think it will give us the runtime boost we are looking 
for...

We are looking into approximate KNN family of algorithms to improve the runtime 
further...KDTree is good for dense vector with low features but for sparse 
vector in higher dimensions researches did not find it useful..

LSH seems to be most commonly used and that's the direction we are looking 
into. I looked into papers but the one that showed good recall values in their 
experiments as compared to brute force KNN is Google Correlate and that's the 
validation strategy we will focus at 
https://www.google.com/trends/correlate/nnsearch.pdf. Please point to any other 
references that deem fit. There are twitter papers as well using LSH and the 
implementation is available in algebird. We will start with algebird LSH but 
ideally we don't want to have a distance metric hardcoded in LSH.

If we get good recall using LSH based method compared to the rowSimilarities 
code from the PR, we will use LSH based method to approximate compute 
similarities between dense/sparse rows using cosine kernel,  dense userFactor, 
productFactor from factorization using product kernel and dense user/product 
factor similarities using cosine kernel.

The kernel abstraction is part of the current PR and right now we support 
Cosine, Product, Euclidean and RBF. Pearson is something that's of interest but 
it's not added yet. For approximate row similarity I will open up a separate 
JIRA.

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2015-05-02 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-4231:

Affects Version/s: (was: 1.2.0)
   1.4.0

 Add RankingMetrics to examples.MovieLensALS
 ---

 Key: SPARK-4231
 URL: https://issues.apache.org/jira/browse/SPARK-4231
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.4.0
Reporter: Debasish Das
   Original Estimate: 24h
  Remaining Estimate: 24h

 examples.MovieLensALS computes RMSE for movielens dataset but after addition 
 of RankingMetrics and enhancements to ALS, it is critical to look at not only 
 the RMSE but also measures like prec@k and MAP.
 In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and 
 also added a flag that takes an input whether user/product recommendation is 
 being validated.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2015-05-02 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das reopened SPARK-4231:
-

The code was not part of SPARK-3066 and so reopening...

 Add RankingMetrics to examples.MovieLensALS
 ---

 Key: SPARK-4231
 URL: https://issues.apache.org/jira/browse/SPARK-4231
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.2.0
Reporter: Debasish Das
   Original Estimate: 24h
  Remaining Estimate: 24h

 examples.MovieLensALS computes RMSE for movielens dataset but after addition 
 of RankingMetrics and enhancements to ALS, it is critical to look at not only 
 the RMSE but also measures like prec@k and MAP.
 In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and 
 also added a flag that takes an input whether user/product recommendation is 
 being validated.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-04-25 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512843#comment-14512843
 ] 

Debasish Das commented on SPARK-5992:
-

Did someone compared algebird LSH with spark minhash link above ? Unless 
algebird is slow (which I found for TopK monoid) we should use it the same way 
HLL is being used in Spark streaming ? Is it ok to add algebird to mllib ?

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3987) NNLS generates incorrect result

2015-04-07 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484646#comment-14484646
 ] 

Debasish Das commented on SPARK-3987:
-

@mengxr for this testcase it was fixed but I remember there was someone in user 
list who mentioned that he got incorrect result compared to some other 
tool...may be it's a good idea to ask for testcases...

 NNLS generates incorrect result
 ---

 Key: SPARK-3987
 URL: https://issues.apache.org/jira/browse/SPARK-3987
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Debasish Das
Assignee: Shuo Xiang
 Fix For: 1.1.1, 1.2.0


 Hi,
 Please see the example gram matrix and linear term:
 val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 
 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, 
 -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, 
 -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, 
 -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, 
 -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 
 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 
 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 
 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 
 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, 
 -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, 
 -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, 
 -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, 
 -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 
 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 
 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 
 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, 
 -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 
 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 
 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 
 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 
 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, 
 -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, 
 -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 
 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 
 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, 
 -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, 
 -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, 
 -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 
 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, 
 -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, 
 -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, 
 -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, 
 -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 
 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 
 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 
 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, 
 -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 
 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 
 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 
 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, 
 -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 
 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 
 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 
 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 
 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, 
 -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, 
 -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, 
 -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 
 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, 
 -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, 
 -5281.636812, 39298.355527, -3440.450858, 0.00, 13717.870243, 
 -8471.405582, 2071.812204, 53928.060360, -8283.992464, 33507.951918, 
 -24087.375367, -5408.929644, 4089.672804, -37103.267653, -33839.612565, 
 18537.392481, 7026.518692, 54636.778527, -57375.986301, -5281.636812, 
 9735.061160, -45360.674033, 10634.633559, 0.00, 

[jira] [Comment Edited] (SPARK-3987) NNLS generates incorrect result

2015-04-07 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484646#comment-14484646
 ] 

Debasish Das edited comment on SPARK-3987 at 4/8/15 3:31 AM:
-

[~mengxr] for this testcase it was fixed but I remember there was someone in 
user list who mentioned that he got incorrect result compared to some other 
tool...may be it's a good idea to ask for testcases...


was (Author: debasish83):
@mengxr for this testcase it was fixed but I remember there was someone in user 
list who mentioned that he got incorrect result compared to some other 
tool...may be it's a good idea to ask for testcases...

 NNLS generates incorrect result
 ---

 Key: SPARK-3987
 URL: https://issues.apache.org/jira/browse/SPARK-3987
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Debasish Das
Assignee: Shuo Xiang
 Fix For: 1.1.1, 1.2.0


 Hi,
 Please see the example gram matrix and linear term:
 val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 
 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, 
 -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, 
 -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, 
 -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, 
 -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 
 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 
 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 
 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 
 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, 
 -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, 
 -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, 
 -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, 
 -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 
 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 
 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 
 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, 
 -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 
 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 
 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 
 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 
 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, 
 -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, 
 -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 
 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 
 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, 
 -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, 
 -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, 
 -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 
 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, 
 -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, 
 -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, 
 -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, 
 -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 
 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 
 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 
 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, 
 -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 
 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 
 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 
 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, 
 -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 
 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 
 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 
 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 
 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, 
 -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, 
 -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, 
 -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 
 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, 
 -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, 
 -5281.636812, 39298.355527, -3440.450858, 0.00, 

[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-31 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388128#comment-14388128
 ] 

Debasish Das commented on SPARK-5564:
-

[~sparks] we are trying to access the EC2 dataset but giving error:

[ec2-user@ip-172-31-38-56 ~]$ aws s3 ls 
s3://files.sparks.requester.pays/enwiki_category_text/

A client error (AccessDenied) occurred when calling the ListObjects operation: 
Access Denied

Could you please take a look if it is still available for use ?

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3066) Support recommendAll in matrix factorization model

2015-03-31 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389973#comment-14389973
 ] 

Debasish Das edited comment on SPARK-3066 at 4/1/15 4:28 AM:
-

Also unless the raw flow runs there is no way to validate how good a LSH based 
flow is doing...I updated the PR today with [~mengxr] reviews...I am working on 
level 3 BLAS routines for item-item similarity calculation from matrix factors 
and the same optimization can be applied here...I will open up the PR for that 
in coming weeks...we already have a JIRA for rowSimilarities...


was (Author: debasish83):
Also unless the raw flow runs there is no way to validate how good a LSH based 
flow is doing since users...I updated the PR today with [~mengxr] reviews...I 
am working on level 3 BLAS routines for item-item similarity calculation from 
matrix factors and the same optimization can be applied here...I will open up 
the PR for that in coming weeks...we already have a JIRA for rowSimilarities...

 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Debasish Das

 ALS returns a matrix factorization model, which we can use to predict ratings 
 for individual queries as well as small batches. In practice, users may want 
 to compute top-k recommendations offline for all users. It is very expensive 
 but a common problem. We can do some optimization like
 1) collect one side (either user or product) and broadcast it as a matrix
 2) use level-3 BLAS to compute inner products
 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2015-03-31 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389973#comment-14389973
 ] 

Debasish Das commented on SPARK-3066:
-

Also unless the raw flow runs there is no way to validate how good a LSH based 
flow is doing since users...I updated the PR today with [~mengxr] reviews...I 
am working on level 3 BLAS routines for item-item similarity calculation from 
matrix factors and the same optimization can be applied here...I will open up 
the PR for that in coming weeks...we already have a JIRA for rowSimilarities...

 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Debasish Das

 ALS returns a matrix factorization model, which we can use to predict ratings 
 for individual queries as well as small batches. In practice, users may want 
 to compute top-k recommendations offline for all users. It is very expensive 
 but a common problem. We can do some optimization like
 1) collect one side (either user or product) and broadcast it as a matrix
 2) use level-3 BLAS to compute inner products
 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-30 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387180#comment-14387180
 ] 

Debasish Das commented on SPARK-5564:
-

Cool...I will run my experiments on the same dataset as well and report 
results...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-30 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387180#comment-14387180
 ] 

Debasish Das edited comment on SPARK-5564 at 3/30/15 6:52 PM:
--

Cool...I will run my experiments on the same dataset as well and report 
results...By the way my plan is to run 1000 sparse topics here...K will be 1000 
but sparse and so we never shuffle more than 100 sparse vectors...For sparsity 
experiments did you also add something specific ?


was (Author: debasish83):
Cool...I will run my experiments on the same dataset as well and report 
results...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-29 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049
 ] 

Debasish Das commented on SPARK-5564:
-

[~josephkb] could you please point me to the datasets that are used for 
benchmarking? I have started testing loglikelihood loss for recommendation and 
since I already added the constraints, this is the right time to test it on LDA 
benchmarks as well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases from your LDA JIRA...For 
recommendation, I know how to construct the testcases...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-29 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049
 ] 

Debasish Das edited comment on SPARK-5564 at 3/30/15 12:31 AM:
---

[~josephkb] could you please point me to the datasets that are used for 
benchmarking LDA and how do they scale as we start scaling the topics? I have 
started testing loglikelihood loss for recommendation and since I already added 
the constraints, this is the right time to test it on LDA benchmarks as 
well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases based on document and word 
matrix...For recommendation, I know how to construct the testcases with 
loglikelihood loss


was (Author: debasish83):
[~josephkb] could you please point me to the datasets that are used for 
benchmarking? I have started testing loglikelihood loss for recommendation and 
since I already added the constraints, this is the right time to test it on LDA 
benchmarks as well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases based on document and word 
matrix...For recommendation, I know how to construct the testcases with 
loglikelihood loss

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-29 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049
 ] 

Debasish Das edited comment on SPARK-5564 at 3/30/15 12:30 AM:
---

[~josephkb] could you please point me to the datasets that are used for 
benchmarking? I have started testing loglikelihood loss for recommendation and 
since I already added the constraints, this is the right time to test it on LDA 
benchmarks as well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases based on document and word 
matrix...For recommendation, I know how to construct the testcases with 
loglikelihood loss


was (Author: debasish83):
[~josephkb] could you please point me to the datasets that are used for 
benchmarking? I have started testing loglikelihood loss for recommendation and 
since I already added the constraints, this is the right time to test it on LDA 
benchmarks as well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases from your LDA JIRA...For 
recommendation, I know how to construct the testcases...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-28 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-2426:

Affects Version/s: (was: 1.3.0)
   1.4.0

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357
 ] 

Debasish Das edited comment on SPARK-2426 at 3/24/15 3:23 PM:
--

[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
., could you please point me to a good reference with application to 
collaborative filtering/topic modeling ? Stochastic matrix decomposition is 
what we can do in this PR now https://github.com/apache/spark/pull/3221Is 
not there is log term that multiplies with R to make it a KL divergence loss ? 
May be the log term can removed under non-negative and normalization 
constraints ? @mengxr any ideas here ? If we can do that we can target KL 
divergence loss from Lee's paper: 
http://hebb.mit.edu/people/seung/papers/ls-lponm-99.pdf

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323. I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS. MAP loss 
looks like a strong contender to LDA and can natively handle counts (does not 
need regression style datasets which is difficult to get in practical setup 
where people normally don't give any rating and satisfaction should be infered 
from viewing time etc)


was (Author: debasish83):
[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
., could you please point me to a good reference with application to 
collaborative filtering/topic modeling ? Stochastic matrix decomposition is 
what we can do in this PR now https://github.com/apache/spark/pull/3221

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323. I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS. MAP loss 
looks like a strong contender to LDA and can natively handle counts (does not 
need regression style datasets which is difficult to get in practical setup 
where people normally don't give any rating and satisfaction should be infered 
from viewing time etc)

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357
 ] 

Debasish Das edited comment on SPARK-2426 at 3/24/15 3:23 PM:
--

[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
., could you please point me to a good reference with application to 
collaborative filtering/topic modeling ? Stochastic matrix decomposition is 
what we can do in this PR now https://github.com/apache/spark/pull/3221 Is not 
there is log term that multiplies with R to make it a KL divergence loss ? May 
be the log term can removed under non-negative and normalization constraints ? 
@mengxr any ideas here ? If we can do that we can target KL divergence loss 
from Lee's paper: http://hebb.mit.edu/people/seung/papers/ls-lponm-99.pdf

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323. I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS. MAP loss 
looks like a strong contender to LDA and can natively handle counts (does not 
need regression style datasets which is difficult to get in practical setup 
where people normally don't give any rating and satisfaction should be infered 
from viewing time etc)


was (Author: debasish83):
[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
., could you please point me to a good reference with application to 
collaborative filtering/topic modeling ? Stochastic matrix decomposition is 
what we can do in this PR now https://github.com/apache/spark/pull/3221Is 
not there is log term that multiplies with R to make it a KL divergence loss ? 
May be the log term can removed under non-negative and normalization 
constraints ? @mengxr any ideas here ? If we can do that we can target KL 
divergence loss from Lee's paper: 
http://hebb.mit.edu/people/seung/papers/ls-lponm-99.pdf

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323. I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS. MAP loss 
looks like a strong contender to LDA and can natively handle counts (does not 
need regression style datasets which is difficult to get in practical setup 
where people normally don't give any rating and satisfaction should be infered 
from viewing time etc)

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-24 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378062#comment-14378062
 ] 

Debasish Das commented on SPARK-6323:
-

I did some more reading and realized that even for small ranks, the current 
least square approach won't be able to handle both KL divergence loss (unless 
there is an approximation that I am missing, more discussions on 
https://issues.apache.org/jira/browse/SPARK-2426) and PLSA formulation (KL 
divergence loss with additional constraints)...Even for collaborative filtering 
with small ranks, this code will be useful...

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most 
 likely we will re-use the Gradient interfaces already defined and implement 
 LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion ratings and we will keep sparsity in check, the same 2 billion flow 
 will scale to 10K ranks as well)...
 This JIRA is intended to extend the capabilities of ml recommendation to 
 generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357
 ] 

Debasish Das commented on SPARK-2426:
-

[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
||.|| stands for Frobenius norm (or l1)., could you please point me to a good 
reference with application to collaborative filtering/topic modeling ? 
Stochastic matrix decomposition is what we can do in this PR now 
https://github.com/apache/spark/pull/3221

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357
 ] 

Debasish Das edited comment on SPARK-2426 at 3/24/15 6:11 AM:
--

[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
|| . || stands for Frobenius norm (or l1)., could you please point me to a 
good reference with application to collaborative filtering/topic modeling ? 
Stochastic matrix decomposition is what we can do in this PR now 
https://github.com/apache/spark/pull/3221

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS


was (Author: debasish83):
[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
||.|| stands for Frobenius norm (or l1)., could you please point me to a good 
reference with application to collaborative filtering/topic modeling ? 
Stochastic matrix decomposition is what we can do in this PR now 
https://github.com/apache/spark/pull/3221

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357
 ] 

Debasish Das edited comment on SPARK-2426 at 3/24/15 6:11 AM:
--

[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
., could you please point me to a good reference with application to 
collaborative filtering/topic modeling ? Stochastic matrix decomposition is 
what we can do in this PR now https://github.com/apache/spark/pull/3221

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS


was (Author: debasish83):
[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
|| . || stands for Frobenius norm (or l1)., could you please point me to a 
good reference with application to collaborative filtering/topic modeling ? 
Stochastic matrix decomposition is what we can do in this PR now 
https://github.com/apache/spark/pull/3221

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357
 ] 

Debasish Das edited comment on SPARK-2426 at 3/24/15 6:13 AM:
--

[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
., could you please point me to a good reference with application to 
collaborative filtering/topic modeling ? Stochastic matrix decomposition is 
what we can do in this PR now https://github.com/apache/spark/pull/3221

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323. I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS. MAP loss 
looks like a strong contender to LDA and can natively handle counts (does not 
need regression style datasets which is difficult to get in practical setup 
where people normally don't give any rating and satisfaction should be infered 
from viewing time etc)


was (Author: debasish83):
[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
., could you please point me to a good reference with application to 
collaborative filtering/topic modeling ? Stochastic matrix decomposition is 
what we can do in this PR now https://github.com/apache/spark/pull/3221

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3735) Sending the factor directly or AtA based on the cost in ALS

2015-03-23 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376046#comment-14376046
 ] 

Debasish Das commented on SPARK-3735:
-

We might want to consider doing some of these things through indexed RDD 
exposed through an API...right now ALS is completely join based...can we do 
something nicer if we have access to an efficient read only cache from ALS 
mapPartitions...Idea here is to think about zeros explicitly and not adding the 
implicit heuristic which is generally hard to tune... 

 Sending the factor directly or AtA based on the cost in ALS
 ---

 Key: SPARK-3735
 URL: https://issues.apache.org/jira/browse/SPARK-3735
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 It is common to have some super popular products in the dataset. In this 
 case, sending many user factors to the target product block could be more 
 expensive than sending the normal equation `\sum_i u_i u_i^T` and `\sum_i u_i 
 r_ij` to the product block. The cost of sending a single factor is `k`, while 
 the cost of sending a normal equation is much more expensive, `k * (k + 3) / 
 2`. However, if we use normal equation for all products associated with a 
 user, we don't need to send this user factor.
 Determining the optimal assignment is hard. But we could use a simple 
 heuristic. Inside any rating block,
 1) order the product ids by the number of user ids associated with them in 
 desc order
 2) starting from the most popular product, mark popular products as use 
 normal eq and calculate the cost
 Remember the best assignment that comes with the lowest cost and use it for 
 computation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-22 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375325#comment-14375325
 ] 

Debasish Das commented on SPARK-2426:
-

[~acopich] There's a completely different loss... BTW, we've used a 
factorisation with the loss you've described as an initial approximation for 
PLSA. It gave a significant speed-up. Could you help adding some testcases and 
driver for the PLSA approximation ? the PR 
https://github.com/apache/spark/pull/3221 has now the LSA constraints and least 
square loss...

Idea here is to do probability simplex on user side, bounds on the item side 
and normalization on item columns at each ALS iteration...The MAP loss is 
tracked through https://issues.apache.org/jira/browse/SPARK-6323 but the solve 
idea will be very similar as I mentioned before and so we can re-use the flow 
test-cases...We can discuss more on the PR...It will be great if you can help 
add examples.mllib.PLSA as well that will driver both PLSA through ALS and ALM 
(alternating MAP loss optimization)...

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-22 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-6323:

Description: 
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f ( x ) + g ( z )

1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most 
likely we will re-use the Gradient interfaces already defined and implement 
LoglikelihoodLoss

2. Constraints g ( z ) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as 
ratings get denser (my understanding is that since we already scaled ALS to 2 
billion ratings and we will keep sparsity in check, the same 2 billion flow 
will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of ml recommendation to 
generalized loss function.

  was:
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f ( x ) + g ( z )

1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
and implement LoglikelihoodLoss

2. Constraints g ( z ) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as 
ratings get denser (my understanding is that since we already scaled ALS to 2 
billion ratings and we will keep sparsity in check, the same 2 billion flow 
will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of ml recommendation to 
generalized loss function.


 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 

[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-16 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956
 ] 

Debasish Das edited comment on SPARK-6323 at 3/16/15 6:30 PM:
--

g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 

2. The original PLSA paper from Hoffman et al.

3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic 
Models for Collaborative Filtering

4. Industry specific application: 
http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

For large rank matrix factorization the requirement also come from sparse 
topics now which can easily range in ~ 10K...

The idea can be implemented in the Sparse LDA JIRA as well 
https://issues.apache.org/jira/browse/SPARK-5564 and I asked [~josephkb] if he 
thinks we should do it in LDA framework but I don't think we know which flow 
will scale better yet as the data moves from sparse from dense.

With the factorization flow I will start to see results next week. These are 
the algorithm steps:

1. minimize f(w,h*)
s.t 1'w = 1, w =0 (row constraints)

2. minimize f(w*,h)
s.t 0 = h = 1,

3. Normalize each column in h

Note that 2 and 3 is an approximation to the original matrix formulation but 
the column normalization makes the factor probabilistically well defined for 
next PLSA iteration.

f(w,h*) is loglikelihood loss from PLSA paper.

I will start to look into graphx based flow after that because in general that 
flow makes more sense for distributed nets where objective is no longer 
separable like ALS-WR / PLSA.


was (Author: debasish83):
g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 

2. The original PLSA paper from Hoffman et al.

3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic 
Models for Collaborative Filtering

4. Industry specific application: 
http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

For large rank matrix factorization the requirement also come from sparse 
topics now which can easily range in ~ 10K...

The idea can be implemented in the Sparse LDA JIRA as well 
https://issues.apache.org/jira/browse/SPARK-5564 and I asked [~josephkb] if he 
thinks we should do it in LDA framework but I don't think we know which flow 
will scale better yet as the data moves from sparse from dense.

With the factorization flow I will start to see results next week as the flow 
is designed to handle these ideas. I will start to look into graphx based flow 
after that. 

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( 

[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-15 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956
 ] 

Debasish Das edited comment on SPARK-6323 at 3/15/15 4:29 PM:
--

g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 

2. The original PLSA paper from Hoffman et al.

3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic 
Models for Collaborative Filtering

4. Industry specific application: 
http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

For large rank matrix factorization the requirement also come from sparse 
topics now which can easily range in ~ 10K...

The idea can be implemented in the Sparse LDA JIRA as well 
https://issues.apache.org/jira/browse/SPARK-5564 and I asked [~josephkb] if he 
thinks we should do it in LDA framework but I don't think we know which flow 
will scale better yet as the data moves from sparse from dense.

With the factorization flow I will start to see results next week as the flow 
is designed to handle these ideas. I will start to look into graphx based flow 
after that. 


was (Author: debasish83):
g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 

2. The original PLSA paper from Hoffman et al.

3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic 
Models for Collaborative Filtering

4. Industry specific application: 
http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

For large rank matrix factorization the requirement also come from sparse 
topics now which can easily range in ~ 10K...

The idea can be implemented in the Sparse LDA JIRA as well 
https://issues.apache.org/jira/browse/SPARK-5564 and I asked Joseph if he 
thinks we should do it in LDA framework but I don't think we know which flow 
will scale better yet. 

With the factorization flow I will start to see results next week as the flow 
is designed to handle these ideas. I will start to look into graphx based flow 
after that. 

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
 HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
 and implement LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver 

[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-15 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956
 ] 

Debasish Das edited comment on SPARK-6323 at 3/15/15 4:26 PM:
--

g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 

2. The original PLSA paper from Hoffman et al.

3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic 
Models for Collaborative Filtering

4. Industry specific application: 
http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

For large rank matrix factorization the requirement also come from sparse 
topics now which can easily range in ~ 10K...

The idea can be implemented in the Sparse LDA JIRA as well 
https://issues.apache.org/jira/browse/SPARK-5564 and I asked Joseph if he 
thinks we should do it in LDA framework but I don't think we know which flow 
will scale better yet. 

With the factorization flow I will start to see results next week as the flow 
is designed to handle these ideas. I will start to look into graphx based flow 
after that. 


was (Author: debasish83):
g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 
2. The original PLSA paper from Hoffman et al.

For large rank matrix factorization I think the requirements come from sparse 
topics now which can easily range in ~ 10K...


 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
 HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
 and implement LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion 

[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-14 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361981#comment-14361981
 ] 

Debasish Das commented on SPARK-6323:
-

By the way I can close the JIRA if it is not related to broader interest of the 
community...

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
 HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
 and implement LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion ratings and we will keep sparsity in check, the same 2 billion flow 
 will scale to 10K ranks as well)...
 This JIRA is intended to extend the capabilities of ml recommendation to 
 generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956
 ] 

Debasish Das commented on SPARK-6323:
-

g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 
2. The original PLSA paper from Hoffman et al.

For large rank matrix factorization I think the requirements come from sparse 
topics now which can easily range in ~ 10K...


 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
 HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
 and implement LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion ratings and we will keep sparsity in check, the same 2 billion flow 
 will scale to 10K ranks as well)...
 This JIRA is intended to extend the capabilities of ml recommendation to 
 generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361005#comment-14361005
 ] 

Debasish Das edited comment on SPARK-6323 at 3/13/15 7:48 PM:
--

There are some other interesting cases for large rank non-convex function but 
we will come to it once fixing PLSA using factorization but yes in all the 
formulation things break up like ALS and that's why we can distribute the solve 
to spark workers...If the objective function does not break like neural net 
(which is the natural extension for ALS) then we need parameter server type 
ideas for solver...


was (Author: debasish83):
There are some other interesting cases for large rank non-convex function but 
we will come to it once fixing PLSA using factorization...

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
 HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
 and implement LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion ratings and we will keep sparsity in check, the same 2 billion flow 
 will scale to 10K ranks as well)...
 This JIRA is intended to extend the capabilities of ml recommendation to 
 generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361005#comment-14361005
 ] 

Debasish Das commented on SPARK-6323:
-

There are some other interesting cases for large rank non-convex function but 
we will come to it once fixing PLSA using factorization...

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
 HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
 and implement LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion ratings and we will keep sparsity in check, the same 2 billion flow 
 will scale to 10K ranks as well)...
 This JIRA is intended to extend the capabilities of ml recommendation to 
 generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-6323:

Description: 
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f(x) + g(z)

1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. 
Most likely we will re-use the Gradient interfaces already defined and 
implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand the underlying architecture.

This JIRA is intended to extend the capabilities of Spark's collaborative 
filtering toolkit to generalized loss function.

  was:
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f (x) + g (z) 

1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. 
Most likely we will re-use the Gradient interfaces already defined and 
implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support 
affine constraint Aeq x = beq , lb = x = ub yet. But most likely we don't 
need that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand the underlying architecture.

This JIRA is intended to extend the capabilities of Spark's collaborative 
filtering toolkit to generalized loss function.


 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 only scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: 

[jira] [Created] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)
Debasish Das created SPARK-6323:
---

 Summary: Large rank matrix factorization with Nonlinear loss and 
constraints
 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0


Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f(x) + g(z) 

1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. 
Most likely we will re-use the Gradient interfaces already defined and 
implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support 
affine constraint Aeq x = beq , lb = x = ub yet. But most likely we don't 
need that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand the underlying architecture.

This JIRA is intended to extend the capabilities of Spark's collaborative 
filtering toolkit to generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-6323:

Description: 
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f(x) + g(z)

1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. 
Most likely we will re-use the Gradient interfaces already defined and 
implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as 
ratings get denser (my understanding is that since we already scaled ALS to 2 
billion ratings and since we will keep sparsity in check, the same 2 billion 
flow will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of Spark's collaborative 
filtering toolkit to generalized loss function.

  was:
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f(x) + g(z)

1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. 
Most likely we will re-use the Gradient interfaces already defined and 
implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand the underlying architecture.

This JIRA is intended to extend the capabilities of Spark's collaborative 
filtering toolkit to generalized loss function.


 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 only scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and 

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-6323:

Description: 
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f ( x ) + g ( z )

1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
and implement LoglikelihoodLoss

2. Constraints g ( z ) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as 
ratings get denser (my understanding is that since we already scaled ALS to 2 
billion ratings and since we will keep sparsity in check, the same 2 billion 
flow will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of ml recommendation to 
generalized loss function.

  was:
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f ( x ) + g ( z )

1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
and implement LoglikelihoodLoss

2. Constraints g ( z ) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as 
ratings get denser (my understanding is that since we already scaled ALS to 2 
billion ratings and since we will keep sparsity in check, the same 2 billion 
flow will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of ml recommendation to 
generalized loss function.


 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-6323:

Description: 
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f ( x ) + g ( z )

1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
and implement LoglikelihoodLoss

2. Constraints g ( z ) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as 
ratings get denser (my understanding is that since we already scaled ALS to 2 
billion ratings and since we will keep sparsity in check, the same 2 billion 
flow will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of ml recommendation to 
generalized loss function.

  was:
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f ( x ) + g ( z )

1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
and implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as 
ratings get denser (my understanding is that since we already scaled ALS to 2 
billion ratings and since we will keep sparsity in check, the same 2 billion 
flow will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of ml recommendation to 
generalized loss function.


 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 only scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal 

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-6323:

Description: 
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f (x) + g (z) 

1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. 
Most likely we will re-use the Gradient interfaces already defined and 
implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support 
affine constraint Aeq x = beq , lb = x = ub yet. But most likely we don't 
need that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand the underlying architecture.

This JIRA is intended to extend the capabilities of Spark's collaborative 
filtering toolkit to generalized loss function.

  was:
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f(x) + g(z) 

1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. 
Most likely we will re-use the Gradient interfaces already defined and 
implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support 
affine constraint Aeq x = beq , lb = x = ub yet. But most likely we don't 
need that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand the underlying architecture.

This JIRA is intended to extend the capabilities of Spark's collaborative 
filtering toolkit to generalized loss function.


 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 only scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following 

[jira] [Resolved] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2015-03-13 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das resolved SPARK-4231.
-
Resolution: Duplicate

 Add RankingMetrics to examples.MovieLensALS
 ---

 Key: SPARK-4231
 URL: https://issues.apache.org/jira/browse/SPARK-4231
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.2.0
Reporter: Debasish Das
   Original Estimate: 24h
  Remaining Estimate: 24h

 examples.MovieLensALS computes RMSE for movielens dataset but after addition 
 of RankingMetrics and enhancements to ALS, it is critical to look at not only 
 the RMSE but also measures like prec@k and MAP.
 In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and 
 also added a flag that takes an input whether user/product recommendation is 
 being validated.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-6323:

Description: 
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f(x) + g(z)

1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. 
Most likely we will re-use the Gradient interfaces already defined and 
implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as 
ratings get denser (my understanding is that since we already scaled ALS to 2 
billion ratings and since we will keep sparsity in check, the same 2 billion 
flow will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of ml recommendation to 
generalized loss function.

  was:
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f(x) + g(z)

1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. 
Most likely we will re-use the Gradient interfaces already defined and 
implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as 
ratings get denser (my understanding is that since we already scaled ALS to 2 
billion ratings and since we will keep sparsity in check, the same 2 billion 
flow will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of Spark's collaborative 
filtering toolkit to generalized loss function.


 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 only scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal 

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-6323:

Description: 
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f ( x ) + g ( z )

1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
and implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as 
ratings get denser (my understanding is that since we already scaled ALS to 2 
billion ratings and since we will keep sparsity in check, the same 2 billion 
flow will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of ml recommendation to 
generalized loss function.

  was:
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
only scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f(x) + g(z)

1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. 
Most likely we will re-use the Gradient interfaces already defined and 
implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as 
ratings get denser (my understanding is that since we already scaled ALS to 2 
billion ratings and since we will keep sparsity in check, the same 2 billion 
flow will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of ml recommendation to 
generalized loss function.


 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 only scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-6323:

Description: 
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f ( x ) + g ( z )

1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
and implement LoglikelihoodLoss

2. Constraints g ( z ) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as 
ratings get denser (my understanding is that since we already scaled ALS to 2 
billion ratings and we will keep sparsity in check, the same 2 billion flow 
will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of ml recommendation to 
generalized loss function.

  was:
Currently ml.recommendation.ALS is optimized for gram matrix generation which 
scales to modest ranks. The problems that we can solve are in the normal 
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with 
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f ( x ) + g ( z )

1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
and implement LoglikelihoodLoss

2. Constraints g ( z ) supported are same as above except that we don't support 
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need 
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in 
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on 
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For 
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA 
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as 
ratings get denser (my understanding is that since we already scaled ALS to 2 
billion ratings and since we will keep sparsity in check, the same 2 billion 
flow will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of ml recommendation to 
generalized loss function.


 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2015-03-12 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359892#comment-14359892
 ] 

Debasish Das commented on SPARK-3066:
-

We use the non-level 3 BLAS code in our internal flows with ~ 60M x 3M 
datasets...Runtime is decent...I am moving to level 3 BLAS for 4823 and I think 
the speed will improve further 

 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Debasish Das

 ALS returns a matrix factorization model, which we can use to predict ratings 
 for individual queries as well as small batches. In practice, users may want 
 to compute top-k recommendations offline for all users. It is very expensive 
 but a common problem. We can do some optimization like
 1) collect one side (either user or product) and broadcast it as a matrix
 2) use level-3 BLAS to compute inner products
 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-07 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351839#comment-14351839
 ] 

Debasish Das commented on SPARK-2426:
-

[~mengxr] NNLS and QuadraticMinimizer are both merged to BreezeI will 
migrate ml.recommendation.ALS accordingly...

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2015-03-07 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351945#comment-14351945
 ] 

Debasish Das commented on SPARK-3066:
-

[~josephkb] do you mean knn ? For recommendation until you do the dot product I 
am not sure how can you find topk..level 3 BLAS will definite give a big boost 
since it's all blocked dense with dense multiplication...For 
https://issues.apache.org/jira/browse/SPARK-4823 I am looking into dense dense 
BLAS and dense sparse BLAS..ideally there we can add in a knn based 
optimization followed by row similarity calculation 

 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Debasish Das

 ALS returns a matrix factorization model, which we can use to predict ratings 
 for individual queries as well as small batches. In practice, users may want 
 to compute top-k recommendations offline for all users. It is very expensive 
 but a common problem. We can do some optimization like
 1) collect one side (either user or product) and broadcast it as a matrix
 2) use level-3 BLAS to compute inner products
 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2015-03-07 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351948#comment-14351948
 ] 

Debasish Das commented on SPARK-4823:
-

[~mengxr] I need level 3 BLAS for this JIRA as well as 
https://issues.apache.org/jira/browse/SPARK-4675...Specifically I am looking 
for dense matrix x dense matrix and dense matrix x sparse matrix...Does breeze 
CSCMatrix support BLAS 3 based dense matrix x CSCMatrix product ? I had some 
code with breeze dot and it was extremely slow...I will migrate the code to 
netlib java BLAS from mllib and update the results on the JIRA...

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4823) rowSimilarities

2015-03-07 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351948#comment-14351948
 ] 

Debasish Das edited comment on SPARK-4823 at 3/8/15 6:42 AM:
-

[~mengxr] I need level 3 BLAS for this JIRA as well as 
https://issues.apache.org/jira/browse/SPARK-4675. Specifically I am looking for 
dense matrix x dense matrix and dense matrix x sparse matrix...Does breeze 
CSCMatrix support BLAS 3 based dense matrix x CSCMatrix product ? I had some 
code with breeze dot and it was extremely slow...I will migrate the code to 
netlib java BLAS from mllib and update the results on the JIRA...


was (Author: debasish83):
[~mengxr] I need level 3 BLAS for this JIRA as well as 
https://issues.apache.org/jira/browse/SPARK-4675...Specifically I am looking 
for dense matrix x dense matrix and dense matrix x sparse matrix...Does breeze 
CSCMatrix support BLAS 3 based dense matrix x CSCMatrix product ? I had some 
code with breeze dot and it was extremely slow...I will migrate the code to 
netlib java BLAS from mllib and update the results on the JIRA...

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311
 ] 

Debasish Das edited comment on SPARK-5564 at 3/1/15 4:41 PM:
-

I am right now using the following PR to do large rank matrix factorization 
with various constraints...https://github.com/scalanlp/breeze/pull/364

I am not sure if the current ALS will scale to large ranks (we want to go ~ 10K 
range) and so I am keen to compare the exact formulation in graphx based LDA 
flow...

Idea here is to solve the constrained factorization problem as explained in 
Vorontsov and Potapenko:

minimize f(w,h*)
s.t 1'w = 1, w =0 (row constraints)

minimize f(w*,h)
s.t 0 = h = 1, Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant 
in https://issues.apache.org/jira/browse/SPARK-2426 for low ranks and got good 
improvement in MAP statistics for recommendation workloads...Here also I expect 
Perplexity will improve...

If no one else is looking into it I would like to compare join based 
factorization based flow (ml.recommendation.ALS) with the graphx based LDA 
flow...

Infact if you think for large ranks, LDA based flow will be more efficient than 
join based factorization flow, I can implement stochastic matrix factorization 
directly on top of LDA flow and add both the least square and MAP losses...

I am assuming here that LDA architecture is a bipartite graph with nodes as 
docs/words and there are counts on each edge...The solver will be run once on 
every node of each partition after it collects the ratings and factors from 
it's edges..


was (Author: debasish83):
I am right now using the following PR to do large rank matrix factorization 
with various constraints...I am not sure if the current ALS will scale to large 
ranks but I am keen to compare the exact formulation in graphx based LDA flow...

https://github.com/scalanlp/breeze/pull/364

Idea here is to solve the constrained factorization problem as explained in 
Vorontsov and Potapenko:
minimize f(w,h*)
s.t 1'w = 1, w =0 (row constraints)

minimize f(w*,h)
s.t 0 = h = 1, Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant 
in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in 
MAP statistics...Here also I expect Perplexity will improve...

If no one else is looking into it I would like to compare join based 
factorization based flow (ml.recommendation.ALS) with the graphx based LDA 
flow...

Infact if you think for large ranks, LDA based flow will be more efficient than 
join based factorization flow, I can implement stochastic matrix factorization 
directly on top of LDA flow and add both the least square and MAP losses...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311
 ] 

Debasish Das edited comment on SPARK-5564 at 3/1/15 4:51 PM:
-

I am right now using the following PR to do large rank matrix factorization 
with various constraints...https://github.com/scalanlp/breeze/pull/364

I am not sure if the current ALS will scale to large ranks (we want to go ~ 10K 
range sparse) and so I am keen to compare the exact formulation in graphx based 
LDA flow...

Idea here is to solve the constrained factorization problem as explained in 
Vorontsov and Potapenko:

minimize f(w,h*)
s.t 1'w = 1, w =0 (row constraints)

minimize f(w*,h)
s.t 0 = h = 1, 

Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant 
in https://issues.apache.org/jira/browse/SPARK-2426 for low ranks and got good 
improvement in MAP statistics for recommendation workloads...Here also I expect 
Perplexity will improve...

If no one else is looking into it I would like to compare join based 
factorization based flow (ml.recommendation.ALS) with the graphx based LDA 
flow...

Infact if you think for large ranks, LDA based flow will be more efficient than 
join based factorization flow, I can implement stochastic matrix factorization 
directly on top of LDA flow and add both the least square and MAP losses...

I am assuming here that LDA architecture is a bipartite graph with nodes as 
docs/words and there are counts on each edge...The solver will be run once on 
every node of each partition after it collects the ratings and factors from 
it's edges..


was (Author: debasish83):
I am right now using the following PR to do large rank matrix factorization 
with various constraints...https://github.com/scalanlp/breeze/pull/364

I am not sure if the current ALS will scale to large ranks (we want to go ~ 10K 
range) and so I am keen to compare the exact formulation in graphx based LDA 
flow...

Idea here is to solve the constrained factorization problem as explained in 
Vorontsov and Potapenko:

minimize f(w,h*)
s.t 1'w = 1, w =0 (row constraints)

minimize f(w*,h)
s.t 0 = h = 1, Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant 
in https://issues.apache.org/jira/browse/SPARK-2426 for low ranks and got good 
improvement in MAP statistics for recommendation workloads...Here also I expect 
Perplexity will improve...

If no one else is looking into it I would like to compare join based 
factorization based flow (ml.recommendation.ALS) with the graphx based LDA 
flow...

Infact if you think for large ranks, LDA based flow will be more efficient than 
join based factorization flow, I can implement stochastic matrix factorization 
directly on top of LDA flow and add both the least square and MAP losses...

I am assuming here that LDA architecture is a bipartite graph with nodes as 
docs/words and there are counts on each edge...The solver will be run once on 
every node of each partition after it collects the ratings and factors from 
it's edges..

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311
 ] 

Debasish Das commented on SPARK-5564:
-

I am right now using the following PR to do large rank matrix factorization 
with various constraints...I am not sure if the current ALS will scale to large 
ranks but I will keen to compare the exact formulation in graphx based LDA 
flow...

https://github.com/scalanlp/breeze/pull/364

Idea here is to solve the constrained factorization problem as explained in 
Vorontsov and Potapenko:
minimize f(w,h*)
s.t 1'w = 1, w =0 (row constraints)

minimize f(w*,h)
s.t 0 = h = 1, Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant 
in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in 
MAP statistics...Here also I expect Perplexity will improve...

If no one else is looking into it I would like to compare join based 
factorization based flow (ml.recommendation.ALS) with the graphx based LDA 
flow...

Infact if you think for large ranks, LDA based flow will be more efficient than 
join based factorization flow, I can implement stochastic matrix factorization 
directly on top of LDA and add both the least square and MAP losses...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311
 ] 

Debasish Das edited comment on SPARK-5564 at 3/1/15 4:20 PM:
-

I am right now using the following PR to do large rank matrix factorization 
with various constraints...I am not sure if the current ALS will scale to large 
ranks but I am keen to compare the exact formulation in graphx based LDA flow...

https://github.com/scalanlp/breeze/pull/364

Idea here is to solve the constrained factorization problem as explained in 
Vorontsov and Potapenko:
minimize f(w,h*)
s.t 1'w = 1, w =0 (row constraints)

minimize f(w*,h)
s.t 0 = h = 1, Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant 
in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in 
MAP statistics...Here also I expect Perplexity will improve...

If no one else is looking into it I would like to compare join based 
factorization based flow (ml.recommendation.ALS) with the graphx based LDA 
flow...

Infact if you think for large ranks, LDA based flow will be more efficient than 
join based factorization flow, I can implement stochastic matrix factorization 
directly on top of LDA flow and add both the least square and MAP losses...


was (Author: debasish83):
I am right now using the following PR to do large rank matrix factorization 
with various constraints...I am not sure if the current ALS will scale to large 
ranks but I am keen to compare the exact formulation in graphx based LDA flow...

https://github.com/scalanlp/breeze/pull/364

Idea here is to solve the constrained factorization problem as explained in 
Vorontsov and Potapenko:
minimize f(w,h*)
s.t 1'w = 1, w =0 (row constraints)

minimize f(w*,h)
s.t 0 = h = 1, Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant 
in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in 
MAP statistics...Here also I expect Perplexity will improve...

If no one else is looking into it I would like to compare join based 
factorization based flow (ml.recommendation.ALS) with the graphx based LDA 
flow...

Infact if you think for large ranks, LDA based flow will be more efficient than 
join based factorization flow, I can implement stochastic matrix factorization 
directly on top of LDA and add both the least square and MAP losses...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311
 ] 

Debasish Das edited comment on SPARK-5564 at 3/1/15 4:19 PM:
-

I am right now using the following PR to do large rank matrix factorization 
with various constraints...I am not sure if the current ALS will scale to large 
ranks but I am keen to compare the exact formulation in graphx based LDA flow...

https://github.com/scalanlp/breeze/pull/364

Idea here is to solve the constrained factorization problem as explained in 
Vorontsov and Potapenko:
minimize f(w,h*)
s.t 1'w = 1, w =0 (row constraints)

minimize f(w*,h)
s.t 0 = h = 1, Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant 
in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in 
MAP statistics...Here also I expect Perplexity will improve...

If no one else is looking into it I would like to compare join based 
factorization based flow (ml.recommendation.ALS) with the graphx based LDA 
flow...

Infact if you think for large ranks, LDA based flow will be more efficient than 
join based factorization flow, I can implement stochastic matrix factorization 
directly on top of LDA and add both the least square and MAP losses...


was (Author: debasish83):
I am right now using the following PR to do large rank matrix factorization 
with various constraints...I am not sure if the current ALS will scale to large 
ranks but I will keen to compare the exact formulation in graphx based LDA 
flow...

https://github.com/scalanlp/breeze/pull/364

Idea here is to solve the constrained factorization problem as explained in 
Vorontsov and Potapenko:
minimize f(w,h*)
s.t 1'w = 1, w =0 (row constraints)

minimize f(w*,h)
s.t 0 = h = 1, Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant 
in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in 
MAP statistics...Here also I expect Perplexity will improve...

If no one else is looking into it I would like to compare join based 
factorization based flow (ml.recommendation.ALS) with the graphx based LDA 
flow...

Infact if you think for large ranks, LDA based flow will be more efficient than 
join based factorization flow, I can implement stochastic matrix factorization 
directly on top of LDA and add both the least square and MAP losses...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342312#comment-14342312
 ] 

Debasish Das commented on SPARK-5564:
-

By the way the following step is an approximation to the real constraint but if 
we get good results over Gibbs Sampling based approaches, there are ways to 
solve the real problem as well...

minimize f(w*,h)
s.t 0 = h = 1, Normalize each column in h 


 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-02-03 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302932#comment-14302932
 ] 

Debasish Das commented on SPARK-2426:
-

[~mengxr] [~coderxiang] David is out in Feb and I am not sure if we can cut a 
breeze release with the code. I refactored NNLS to breeze.optimize.linear due 
to its similarity to CG core. Proximal algorithms and QuadraticMinimizer are 
refactored to breeze.optimize.proximal. It will be great if you could also 
review the PR https://github.com/scalanlp/breeze/pull/321. 

With this solver added to Breeze I am ready to add in ALS modifications to 
Spark. The test-cases for default ALS and nnls runs fine with my Spark PR. I 
need to add appropriate test-cases for sparse coding and least square loss with 
lsa constraints as explained above. 

Should I add them to ml.als or mllib.als since we have now two codebases ? My 
current PR will merge fine with mllib.als but not with ml.als. I see there is a 
CholeskySolver but all those features are supported in 
breeze.optimize.proximal.QuadraticMinimizer.

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel

2014-12-11 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243026#comment-14243026
 ] 

Debasish Das commented on SPARK-4675:
-

Is there a metric like MAP / AUC kind of measure that can help us validate 
similarUsers and similarProducts ? 

Right now if I run column similarities with sparse vector on matrix 
factorization datasets for product similarities, it will assume all unvisited 
entries (which should be ?) as 0 and compute column similarities for...If the 
sparse vector has ? in place of 0 then basically all similarity calculation is 
incorrect...so in that sense it makes more sense to compute the similarities on 
the matrix factors...

But then we are back to map-reduce calculation of rowSimilarities.

 Find similar products and similar users in MatrixFactorizationModel
 ---

 Key: SPARK-4675
 URL: https://issues.apache.org/jira/browse/SPARK-4675
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Steven Bourke
Priority: Trivial
  Labels: mllib, recommender

 Using the latent feature space that is learnt in MatrixFactorizationModel, I 
 have added 2 new functions to find similar products and similar users. A user 
 of the API can for example pass a product ID, and get the closest products. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-11 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243048#comment-14243048
 ] 

Debasish Das commented on SPARK-4823:
-

[~srowen] did you implement map-reduce row similarities for user factors ? 
What's the algorithm that you used ? Any pointers will be really helpful...

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-12-11 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243149#comment-14243149
 ] 

Debasish Das commented on SPARK-2426:
-

[~mengxr] as per our discussion, QuadraticMinimizer and NNLS are both added to 
breeze and updated with breeze DenseMatrix and DenseVector...Inside breeze I 
did some interesting comparisons and that motivated me to port NNLS to breeze 
as well...I added all the testcases for QuadraticMinimizer and NNLS as well 
based on my experiments with MovieLens dataset...

Here is the PR: https://github.com/scalanlp/breeze/pull/321

To run the Quadratic programming variants in breeze:
runMain breeze.optimize.quadratic.QuadraticMinimizer 100 1 0.1 0.99
regParam = 0.1, beta = 0.99 is Elastic Net parameter
 
It will randomly generate quadratic problems with 100 variables, 1 equality 
constraint and lower/upper bounds. This format is similar to PDCO QP generator 
(please look into my Matlab examples)

0.5x'Hx + c'x
s.t Ax = B, 
lb = x = ub

1. Unconstrained minimization: breeze luSolve, cg and qp(dposv added to breeze 
through this PR)

Minimize 0.5x'Hx + c'x

||qp - lu|| norm 4.312577233496585E-10 max-norm 1.3842793578078272E-10
||cg - lu|| norm 4.167925029822007E-7 max-norm 1.0053204402282745E-7
dim 100 lu 86.007 qp 41.56 cg 102.627

||qp - lu|| norm 4.267891623199082E-8 max-norm 6.681460718027665E-9
||cg - lu|| norm 1.94497623480055E-7 max-norm 2.6288773824489908E-8
dim 500 lu 169.993 qp 78.001 cg 443.044

qp is faster than cg for smaller dimensions as expected. I also tried 
unconstrained BFGS but the results were not good. We are looking into it.

2. Elastic Net formulation: 0.5 x'Hx + c'x + (1-beta)*L2(x) + 
beta*regParam*L1(x)

beta = 0.99 Strong L1 regParam=0.1
||owlqn - sparseqp|| norm 0.1653200701235298 inf-norm 0.051855911945906996
sparseQp 61.948 ms iters 227 owlqn 928.11 ms

beta = 0.5 average L1 regParam=0.1
||owlqn - sparseqp|| norm 0.15823773098501168 inf-norm 0.035153837685728107
sparseQp 69.934 ms iters 353 owlqn 882.104 ms

beta = 0.01 mostly BFGS regParam=0.1

||owlqn - sparseqp|| norm 0.17950035092790165 inf-norm 0.04718697692014828
sparseQp 80.411 ms iters 580 owlqn 988.313 ms

ADMM based proximal formulation is faster for smaller dimension. Even as I 
scale dimension, I notice similar behavior that owlqn is taking longer to 
converge and results are not same. Look for example in dim = 500 case:

||owlqn - sparseqp|| norm 10.946326189397649 inf-norm 1.412726586317294
sparseQp 830.593 ms iters 2417 owlqn 19848.932 ms

I validated ADMM through Matlab scripts so there is something funky going on in 
OWLQN.

3. NNLS formulation:  0.5 x'Hx + c'x s.t x = 0

Here are compared ADMM based proximal formulation with CG based projected 
gradient in NNLS. NNLS converges much nicer but the convergence criteria does 
not look same as breeze CG but they should be same. 

For now I ported it to breeze and we can call NNLS for x = 0 and 
QuadraticMinimizer for other formulations

dim = 100 posQp 16.367 ms iters 284 nnls 8.854 ms iters 107
dim = 500 posQp 303.184 ms iters 950 nnls 183.543 ms iters 517

NNLS on average looks 2X faster !

4. Bounds formulation: 0.5x'Hx + c'x s.t lb = x = ub
Validated through Matlab scripts above. Here are the runtime numbers:

dim = 100 boundsQp 15.654 ms iters 284 converged true
dim= 500 boundsQp 311.613 ms iters 950 converged true

5. Equality and positivity: 0.5 x'Hx + c'x s.t \sum_i x_i = 1, x_i =0
Validated through Matlab scripts above. Here are the runtime numbers:

dim = 100 Qp Equality 13.64 ms iters 184 converged true
dim = 500 Qp Equality 278.525 ms iters 890 converged true

With this change all copyrights are moved to breeze. Once it merges, I will 
update the Spark PR. With this change we can move ALS code to Breeze 
DenseMatrix and DenseVector as well 

My focus next will be to get a Truncated Newton running for convex cost since 
convex cost is required for PLSA, SVM and Neural Net formulations...

I am still puzzled that why BFGS/OWLQN is not working well for the 
unconstrained case/L1 optimization. If TRON works well for unconstrained case, 
that's what I will use for NonlinearMinimizer. I am looking more into it.

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime 

[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-11 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243207#comment-14243207
 ] 

Debasish Das commented on SPARK-4823:
-

Even for matrix factorization userFactors are user x rank...with modest ranks 
of 50..and users at 10M, I don't think it is possible to transpose the matrix 
and run column similarities...doing it on the fly complexity wise is still 
O(n*n) right...

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-12-11 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243456#comment-14243456
 ] 

Debasish Das commented on SPARK-2426:
-

[~akopich] I got good MAP results on recommendation datasets with the 
approximated PLSA formulation. I did not get time to compare that formulation 
with Gibbs sampling based LDA PR: 
https://issues.apache.org/jira/browse/SPARK-1405 yet. Did you compare them ?

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel

2014-12-10 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241535#comment-14241535
 ] 

Debasish Das commented on SPARK-4675:
-

There are few issues:

1. Batch API for topK similar users and topK similar products
2. Comparison of product x product similarities generated with 
columnSimilarities and compared with topK similar products

I added batch APIs for topK product recommendation for each user and topK user 
recommendation for each product in SPARK-4231...similar batch API will be very 
helpful for topK similar users and topK similar products...

I agree with Cosine Similarity...you should be able to re-use column similarity 
calculations...I think a better idea is to add rowMatrix.similarRows and re-use 
that code to generate product similarities and user similarities...

But my question is more on validation. We can compute product similarities on 
raw features and we can compute product similarities on matrix product 
factor...which one is better ?

 Find similar products and similar users in MatrixFactorizationModel
 ---

 Key: SPARK-4675
 URL: https://issues.apache.org/jira/browse/SPARK-4675
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Steven Bourke
Priority: Trivial
  Labels: mllib, recommender

 Using the latent feature space that is learnt in MatrixFactorizationModel, I 
 have added 2 new functions to find similar products and similar users. A user 
 of the API can for example pass a product ID, and get the closest products. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-10 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242031#comment-14242031
 ] 

Debasish Das commented on SPARK-4823:
-

I am considering coming up with a baseline version that's very close to brute 
force but we cut the computation with a topK number...for each user come up 
with topK users where K is defined by the client..this will take care of matrix 
factorization use-case...

Basically on master we collect a set of user factors, broadcast it to every 
node and does a reduceByKey to generate topK users for each user from this user 
block...We send a kernel function (cosine / polynomial / rbf) in this 
calculation...

But this idea does not work for raw features right...If we do map features to a 
lower dimension using factorization then this approach should run fine...but I 
am not sure if we can ask users to map their data into a lower dimension

Is it possible to bring in ideas from fastfood and kitchen sink to do this ?


 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh

 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel

2014-12-10 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242034#comment-14242034
 ] 

Debasish Das commented on SPARK-4675:
-

[~josephkb] how do we validate that low dimension space is giving more 
meaningful similarities than the feature space (which is sparse) ?

 Find similar products and similar users in MatrixFactorizationModel
 ---

 Key: SPARK-4675
 URL: https://issues.apache.org/jira/browse/SPARK-4675
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Steven Bourke
Priority: Trivial
  Labels: mllib, recommender

 Using the latent feature space that is learnt in MatrixFactorizationModel, I 
 have added 2 new functions to find similar products and similar users. A user 
 of the API can for example pass a product ID, and get the closest products. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222024#comment-14222024
 ] 

Debasish Das commented on SPARK-1405:
-

We need a larger dataset as well where topics go to the range of 1+...That 
range will stress factorization based LSA formulations since there is broadcast 
of factors at each stepNIPS dataset is small...you guy's will be willing to 
test a wikipedia dataset for example ? If there is a pre-processed version from 
either mahout or scikit-learn we can use that ?

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222024#comment-14222024
 ] 

Debasish Das edited comment on SPARK-1405 at 11/22/14 4:22 PM:
---

We need a larger dataset as well where topics go to the range of 1+...That 
range will stress factorization based LSA formulations since there is broadcast 
of factors at each stepNIPS dataset is small...Let's start with that...But 
we should test a large dataset like wikipedia as well..If there is a 
pre-processed version from either mahout or scikit-learn we can use that ?


was (Author: debasish83):
We need a larger dataset as well where topics go to the range of 1+...That 
range will stress factorization based LSA formulations since there is broadcast 
of factors at each stepNIPS dataset is small...you guy's will be willing to 
test a wikipedia dataset for example ? If there is a pre-processed version from 
either mahout or scikit-learn we can use that ?

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222027#comment-14222027
 ] 

Debasish Das commented on SPARK-1405:
-

[~pedrorodriguez] did you write the metric in your repo as well ? That way I 
don't have to code it up again..

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222089#comment-14222089
 ] 

Debasish Das commented on SPARK-1405:
-

NIPS dataset is common for PLSA and additive regularization based matrix 
factorization formulations as well since the experiments in this paper focused 
on the NIPS dataset as well... 
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf

I will be using NIPS dataset for quality experiments but for scaling 
experiments, wiki data is good...wiki data was demo-ed by Databricks in last 
spark summit...it will be great if we can get it from that demo

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222108#comment-14222108
 ] 

Debasish Das commented on SPARK-1405:
-

@sparks that will be awesome...I should be fine running experiments on EC2...

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222108#comment-14222108
 ] 

Debasish Das edited comment on SPARK-1405 at 11/22/14 6:40 PM:
---

[~sparks] that will be awesome...I should be fine running experiments on EC2...


was (Author: debasish83):
@sparks that will be awesome...I should be fine running experiments on EC2...

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2014-11-21 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221379#comment-14221379
 ] 

Debasish Das commented on SPARK-3066:
-

I did experiments on MovieLens dataset with varying rank on my localhost spark 
with 4 GB RAM and 4 cores to see how much MAP improvement we see as the rank is 
scaled

===
rank=10 (default)

Got 1000209 ratings from 6040 users on 3706 movies. 


Training: 799747, test: 200462.
Test RMSE = 0.8528377625407709. 


Test users 6036 MAP 0.03851426277536059

Runtime: 30s

===
rank=25

Got 1000209 ratings from 6040 users on 3706 movies. 


Training: 800417, test: 199792.
Test RMSE = 0.8518001349769724. 


Test users 6037 MAP 0.04508057348514959

Runtime: 30 s

===
rank=50

Got 1000209 ratings from 6040 users on 3706 movies. 


Training: 800823, test: 199386.
Test RMSE = 0.8487416471685229. 


Test users 6038 MAP 0.05145126538369158

Runtime 42s

===
rank=100

Got 1000209 ratings from 6040 users on 3706 movies. 


Training: 800720, test: 199489.
Test RMSE = 0.8508095863317275. 


Test users 6033 MAP 0.0561225429735388

Runtime 1.5m

===
rank=150

Got 1000209 ratings from 6040 users on 3706 movies. 


Training: 800257, test: 199952.
Test RMSE = 0.8435902056186158. 


Test users 6035 MAP 0.05855252471878818

Runtime 3.6 m

===
rank=200

Got 1000209 ratings from 6040 users on 3706 movies. 


Training: 800356, test: 199853.
Test RMSE = 0.8452385688272382. 


Test users 6037 MAP 0.059176892052172934

Runtime 7.4 mins

I will run through MovieLens10m and Netflix dataset and generate the numbers of 
them with varying ranks as well. I need to run them on cluster.

 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 ALS returns a matrix factorization model, which we can use to predict ratings 
 for individual queries as well as small batches. In practice, users may want 
 to compute top-k recommendations offline for all users. It is very expensive 
 but a common problem. We can do some optimization like
 1) collect one side (either user or product) and broadcast it as a matrix
 2) use level-3 BLAS to compute inner products
 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-21 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221505#comment-14221505
 ] 

Debasish Das edited comment on SPARK-1405 at 11/21/14 10:28 PM:


[~witgo] where can I access your dataset ? I got the NIPS dataset from Pedro 
but here the runtimes reported are on a different dataset...also should we use 
the same accuracy measure that Pedro is using ?


was (Author: debasish83):
[~witgo] where can I access your dataset ? I got the NIPS dataset from Pedro 
but here the runtimes reported are on a different dataset...also should be use 
the same accuracy measure that Pedro is using...

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2014-11-20 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219799#comment-14219799
 ] 

Debasish Das commented on SPARK-4231:
-

[~srowen] I added batch predict APIs for user and product recommendation in the 
PR (both batch and mini-batch APIs that takes a per user topK products)

For the MAP calculation code I still have kept it in examples.MovieLensALS in 
this PR but I feel it should be part of MatrixFactorizationModel as well where 
client can send a RDD of userID and list of products they don't want to be 
recommended in the topK...I will add this API in MatrixFactorizationModel if it 
makes sense...

 Add RankingMetrics to examples.MovieLensALS
 ---

 Key: SPARK-4231
 URL: https://issues.apache.org/jira/browse/SPARK-4231
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.2.0
Reporter: Debasish Das
 Fix For: 1.2.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 examples.MovieLensALS computes RMSE for movielens dataset but after addition 
 of RankingMetrics and enhancements to ALS, it is critical to look at not only 
 the RMSE but also measures like prec@k and MAP.
 In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and 
 also added a flag that takes an input whether user/product recommendation is 
 being validated.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3066) Support recommendAll in matrix factorization model

2014-11-19 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218667#comment-14218667
 ] 

Debasish Das edited comment on SPARK-3066 at 11/19/14 10:59 PM:


[~mengxr] as per our discussions, I added APIs for batch user and product 
recommendation and MAP computation for recommending topK products for users...

Note that I don't use reservoir sampling and used your idea of filtering the 
test set users for which there are no model being built...I thought reservoir 
sampling should be part of a separate PR

APIs added:

recommendProductsForUsers(num: Int) : topK is fixed for all users
recommendProductsForUsers(userTopK: RDD[(Int, Int)]): variable topK for every 
user

recommendUsersForProducts(num: Int): topK is fixed for all products
recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for 
every product

I used mllib BLAS for all the computation and cleaned up DoubleMatrix code from 
MatrixFactorizationModel...I have not used level 3 BLAS yet...I can add that as 
well if rest of the flow makes sense...

On examples.MovieLensALS we can activate the user map calculation using 
--validateRecommendation flag:

./bin/spark-submit --master spark://localhost:7077 --jars scopt_2.10-3.2.0.jar 
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g --class 
org.apache.spark.examples.mllib.MovieLensALS 
./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --kryo --lambda 0.065 
--validateRecommendation hdfs://localhost:8020/sandbox/movielens/

Got 1000209 ratings from 6040 users on 3706 movies. 


Training: 799617, test: 200592.
Test RMSE = 0.8495476608536306. 


Test users 6032 MAP 0.03798337814233403 

I will run the netflix dataset and report the MAP measures for that..

On our internal datasets, I have tested for 1M users, 10K products, 120 cores, 
240GB for topK users for each product and that takes around 5 mins...on an 
average I generate a ranked list of 6000 users for each product...Basically 
internally we are using the batch API:

recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for 
every product


was (Author: debasish83):
@mengxr as per our discussions, I added APIs for batch user and product 
recommendation and MAP computation for recommending topK products for users...

Note that I don't use reservoir sampling and used your idea of filtering the 
test set users for which there are no model being built...I thought reservoir 
sampling should be part of a separate PR

APIs added:

recommendProductsForUsers(num: Int) : topK is fixed for all users
recommendProductsForUsers(userTopK: RDD[(Int, Int)]): variable topK for every 
user

recommendUsersForProducts(num: Int): topK is fixed for all products
recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for 
every product

I used mllib BLAS for all the computation and cleaned up DoubleMatrix code from 
MatrixFactorizationModel...I have not used level 3 BLAS yet...I can add that as 
well if rest of the flow makes sense...

On examples.MovieLensALS we can activate the user map calculation using 
--validateRecommendation flag:

./bin/spark-submit --master spark://localhost:7077 --jars scopt_2.10-3.2.0.jar 
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g --class 
org.apache.spark.examples.mllib.MovieLensALS 
./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --kryo --lambda 0.065 
--validateRecommendation hdfs://localhost:8020/sandbox/movielens/

Got 1000209 ratings from 6040 users on 3706 movies. 


Training: 799617, test: 200592.
Test RMSE = 0.8495476608536306. 


Test users 6032 MAP 0.03798337814233403 

I will run the netflix dataset and report the MAP measures for that..

On our internal datasets, I have tested for 1M users, 10K products, 120 cores, 
240GB for topK users for each product and that takes around 5 mins...on an 
average I generate a ranked list of 6000 users for each product...Basically 
internally we are using the batch API:

recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for 
every product

 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
 

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2014-11-19 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218667#comment-14218667
 ] 

Debasish Das commented on SPARK-3066:
-

@mengxr as per our discussions, I added APIs for batch user and product 
recommendation and MAP computation for recommending topK products for users...

Note that I don't use reservoir sampling and used your idea of filtering the 
test set users for which there are no model being built...I thought reservoir 
sampling should be part of a separate PR

APIs added:

recommendProductsForUsers(num: Int) : topK is fixed for all users
recommendProductsForUsers(userTopK: RDD[(Int, Int)]): variable topK for every 
user

recommendUsersForProducts(num: Int): topK is fixed for all products
recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for 
every product

I used mllib BLAS for all the computation and cleaned up DoubleMatrix code from 
MatrixFactorizationModel...I have not used level 3 BLAS yet...I can add that as 
well if rest of the flow makes sense...

On examples.MovieLensALS we can activate the user map calculation using 
--validateRecommendation flag:

./bin/spark-submit --master spark://localhost:7077 --jars scopt_2.10-3.2.0.jar 
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g --class 
org.apache.spark.examples.mllib.MovieLensALS 
./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --kryo --lambda 0.065 
--validateRecommendation hdfs://localhost:8020/sandbox/movielens/

Got 1000209 ratings from 6040 users on 3706 movies. 


Training: 799617, test: 200592.
Test RMSE = 0.8495476608536306. 


Test users 6032 MAP 0.03798337814233403 

I will run the netflix dataset and report the MAP measures for that..

On our internal datasets, I have tested for 1M users, 10K products, 120 cores, 
240GB for topK users for each product and that takes around 5 mins...on an 
average I generate a ranked list of 6000 users for each product...Basically 
internally we are using the batch API:

recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for 
every product

 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 ALS returns a matrix factorization model, which we can use to predict ratings 
 for individual queries as well as small batches. In practice, users may want 
 to compute top-k recommendations offline for all users. It is very expensive 
 but a common problem. We can do some optimization like
 1) collect one side (either user or product) and broadcast it as a matrix
 2) use level-3 BLAS to compute inner products
 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-19 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218845#comment-14218845
 ] 

Debasish Das commented on SPARK-1405:
-

I would like to compare the LSA formulations (sparse coding and PLSA with least 
square loss) from https://issues.apache.org/jira/browse/SPARK-2426 with LDA

I added MAP metric for examples.MovieLensALS in 
https://issues.apache.org/jira/browse/SPARK-4231 but I am not sure how MAP can 
be used for topic modeling datasetswe need some perplexity measures...

Could you guys please point me to the dataset and the quality measures that are 
being benchmarked on the LDA PR so that I can also test the LSA formulations in 
parallel ?


 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-11-19 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218891#comment-14218891
 ] 

Debasish Das commented on SPARK-2426:
-

With the MAP measures being added to examples.MovieLensALS through 
https://issues.apache.org/jira/browse/SPARK-4231 I compared the quality and 
runtime of the matrix completion formulations on MovieLens 1M dataset:

Default: userConstraint L2, productConstraint L2 lambdaUser=lambdaProduct=0.065 
rank=100 iterations 10

Test RMSE = 0.8436480113821955.
Test users 6038 MAP 0.05860164548002782

Solver: Cholesky decomposition followed by forward-backward solves

Per iteration runtime for baseline (solveTime in ms)

14/11/19 17:37:06 INFO ALS: usersOrProducts 924 slowConvergence 0 
QuadraticMinimizer solveTime 362.813 Iters 0
14/11/19 17:37:06 INFO ALS: usersOrProducts 910 slowConvergence 0 
QuadraticMinimizer solveTime 314.527 Iters 0
14/11/19 17:37:06 INFO ALS: usersOrProducts 927 slowConvergence 0 
QuadraticMinimizer solveTime 265.75 Iters 0
14/11/19 17:37:06 INFO ALS: usersOrProducts 918 slowConvergence 0 
QuadraticMinimizer solveTime 271.513 Iters 0
14/11/19 17:37:09 INFO ALS: usersOrProducts 1510 slowConvergence 0 
QuadraticMinimizer solveTime 370.177 Iters 0
14/11/19 17:37:09 INFO ALS: usersOrProducts 1512 slowConvergence 0 
QuadraticMinimizer solveTime 467.994 Iters 0
14/11/19 17:37:09 INFO ALS: usersOrProducts 1507 slowConvergence 0 
QuadraticMinimizer solveTime 511.894 Iters 0
14/11/19 17:37:09 INFO ALS: usersOrProducts 1511 slowConvergence 0 
QuadraticMinimizer solveTime 481.189 Iters 0

NMF: userConstraint POSITIVE, productConstraint POSITIVE, 
userLambda=productLambda=0.065 L2 regularization

Got 1000209 ratings from 6040 users on 3706 movies.
Training: 800670, test: 199539.
Quadratic minimization userConstraint POSITIVE productConstraint POSITIVE
Test RMSE = 0.8435335132641906.
Test users 6038 MAP 0.056361816590625446

ALS iteration1 runtime:

QuadraticMinimizer convergence profile:

14/11/19 17:46:46 INFO ALS: usersOrProducts 918 slowConvergence 0 
QuadraticMinimizer solveTime 1936.281 Iters 73132
14/11/19 17:46:46 INFO ALS: usersOrProducts 927 slowConvergence 0 
QuadraticMinimizer solveTime 1871.364 Iters 75219
14/11/19 17:46:46 INFO ALS: usersOrProducts 910 slowConvergence 0 
QuadraticMinimizer solveTime 2067.735 Iters 73180
14/11/19 17:46:46 INFO ALS: usersOrProducts 924 slowConvergence 0 
QuadraticMinimizer solveTime 2127.161 Iters 75546
14/11/19 17:46:53 INFO ALS: usersOrProducts 1507 slowConvergence 0 
QuadraticMinimizer solveTime 3813.923 Iters 193207
14/11/19 17:46:54 INFO ALS: usersOrProducts 1511 slowConvergence 0 
QuadraticMinimizer solveTime 3894.068 Iters 196882
14/11/19 17:46:54 INFO ALS: usersOrProducts 1510 slowConvergence 0 
QuadraticMinimizer solveTime 3875.915 Iters 193987
14/11/19 17:46:54 INFO ALS: usersOrProducts 1512 slowConvergence 0 
QuadraticMinimizer solveTime 3939.765 Iters 192471

NNLS convergence profile:

14/11/19 17:46:46 INFO ALS: NNLS solveTime 252.909 iters 7381
14/11/19 17:46:46 INFO ALS: NNLS solveTime 256.803 iters 7740
14/11/19 17:46:46 INFO ALS: NNLS solveTime 274.352 iters 7491
14/11/19 17:46:46 INFO ALS: NNLS solveTime 272.971 iters 7664
14/11/19 17:46:53 INFO ALS: NNLS solveTime 1487.262 iters 60338
14/11/19 17:46:54 INFO ALS: NNLS solveTime 1472.742 iters 61321
14/11/19 17:46:54 INFO ALS: NNLS solveTime 1489.863 iters 62228
14/11/19 17:46:54 INFO ALS: NNLS solveTime 1494.192 iters 60489

ALS iteration 10

Quadratic Minimizer convergence profile:

14/11/19 17:48:17 INFO ALS: usersOrProducts 924 slowConvergence 0 
QuadraticMinimizer solveTime 1082.056 Iters 53724
14/11/19 17:48:17 INFO ALS: usersOrProducts 910 slowConvergence 0 
QuadraticMinimizer solveTime 1180.601 Iters 50593
14/11/19 17:48:17 INFO ALS: usersOrProducts 927 slowConvergence 0 
QuadraticMinimizer solveTime 1106.131 Iters 53069
14/11/19 17:48:17 INFO ALS: usersOrProducts 918 slowConvergence 0 
QuadraticMinimizer solveTime 1108.478 Iters 50895
14/11/19 17:48:23 INFO ALS: usersOrProducts 1510 slowConvergence 0 
QuadraticMinimizer solveTime 2262.193 Iters 116818
14/11/19 17:48:23 INFO ALS: usersOrProducts 1512 slowConvergence 0 
QuadraticMinimizer solveTime 2293.64 Iters 116026
14/11/19 17:48:23 INFO ALS: usersOrProducts 1507 slowConvergence 0 
QuadraticMinimizer solveTime 2241.491 Iters 116293
14/11/19 17:48:23 INFO ALS: usersOrProducts 1511 slowConvergence 0 
QuadraticMinimizer solveTime 2372.957 Iters 118391

NNLS convergence profile:

14/11/19 17:48:17 INFO ALS: NNLS solveTime 623.031 iters 21611
14/11/19 17:48:17 INFO ALS: NNLS solveTime 553.493 iters 21732
14/11/19 17:48:17 INFO ALS: NNLS solveTime 559.9 iters 22511
14/11/19 17:48:17 INFO ALS: NNLS solveTime 556.654 iters 21330
14/11/19 17:48:23 INFO ALS: NNLS solveTime 1672.582 iters 86006
14/11/19 17:48:23 INFO ALS: NNLS solveTime 1703.221 iters 85824
14/11/19 17:48:23 INFO ALS: NNLS 

  1   2   >