from:"Debasish Das \(JIRA\)"

[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2018-12-23 Thread Debasish Das (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728020#comment-16728020
 ] 

Debasish Das commented on SPARK-24374:
--

Hi [~mengxr] with barrier mode available is it not possible to use native TF 
parameter server in place of using MPI ? Although we are offloading compute 
from spark to tf workers/ps, still if there is an exception that comes out, 
tracking it with native TF API might be easier than MPI exception...great work 
by the way...I was looking for a cloud-ml alternative using spark over 
aws/azure/gcp and looks like barrier should help a lot although I am still not 
clear on the limitations of TensorflowOnSpark project from Yahoo 
[https://github.com/yahoo/TensorFlowOnSpark] which tried to put barrier like 
syntax but not sure if few partitions fails on some tfrecord read / 
communication exceptions whether it can re-run full job or it will only re-run 
the failed partition...I guess the exception from few partitions can be thrown 
back to spark driver and driver can take the action for re-run..when multiple 
tf training jobs get scheduled on the same spark cluster I suspect TFoS might 
have issues as well... 

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-08 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15809876#comment-15809876
 ] 

Debasish Das commented on SPARK-10078:
--

I looked into the code and I see we are replicating Breeze BFGS and OWLQN core 
logic in this PR:
https://github.com/yanboliang/spark-vlbfgs/tree/master/src/main/scala/org/apache/spark/ml/optim.

We can provide a DiffFunction interface that works on feature partition and add 
the VL-BFGS paper logic as a refactoring to current Breeze BFGS code...

Now DiffFunction can run with a DistributedVector or a Vector. What that helps 
with is that even with features < 100M, we can run multi-core VLBFGS with 
putting multiple partitions and a if-else switch is not necessary.

I can provide breeze interfaces based on your PR if you agree with the idea. 
BFGS and OWLQN are few variants but Breeze has several constraint solvers that 
use BFGS code...  

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-02 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793770#comment-15793770
 ] 

Debasish Das commented on SPARK-10078:
--

[~mengxr] [~dlwh] is it possible to implement VL-BFGS as part of breeze so that 
OWLQN, LBFGS, LBFGS-B and proximal.NonlinearMinimizer get benefited by it ? We 
can bring it the way we bring LBFGS/OWLQN right now...If it makes sense, I can 
look at the design doc and propose a breeze interface to abstract RDD details...

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS

2017-01-02 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793760#comment-15793760
 ] 

Debasish Das edited comment on SPARK-10078 at 1/3/17 12:26 AM:
---

Ideally feature partitioning should be automatically tuned...at 100M features 
master only processing what we do with Breeze LBFGS / OWLQN will also get 
benefitted  by VL-BFGSIdeally it should be part of breeze and a proper 
interface should be defined so that the Breeze VL-BFGS solver can be called in 
Spark ML...There are bounded BFGS that's in breeze...all of them will be 
benefited by this change. A solver can be used in other frameworks as well and 
may not be constrained to RDD if possible...


was (Author: debasish83):
Ideally feature partitioning should be automatically tuned...at 100M features 
master only processing what we do with Breeze LBFGS / OWLQN will also get 
benefitted  by VL-BFGSIdeally it should be part of breeze and a proper 
interface should be defined so that the Breeze VL-BFGS solver can be called in 
Spark ML...

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-02 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793760#comment-15793760
 ] 

Debasish Das commented on SPARK-10078:
--

Ideally feature partitioning should be automatically tuned...at 100M features 
master only processing what we do with Breeze LBFGS / OWLQN will also get 
benefitted  by VL-BFGSIdeally it should be part of breeze and a proper 
interface should be defined so that the Breeze VL-BFGS solver can be called in 
Spark ML...

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-12-25 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15777650#comment-15777650
 ] 

Debasish Das edited comment on SPARK-13857 at 12/26/16 5:57 AM:


item->item and user->user was done in an old PR I had...if there is interest I 
can resend it...nice to see how it compares with approximate nearest neighbor 
work from uber:
https://github.com/apache/spark/pull/6213


was (Author: debasish83):
item->item and user->user was done in an old PR I had...if there is interested 
I can resend it...nice to see how it compares with approximate nearest neighbor 
work from uber:
https://github.com/apache/spark/pull/6213

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-12-25 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15777650#comment-15777650
 ] 

Debasish Das commented on SPARK-13857:
--

item->item and user->user was done in an old PR I had...if there is interested 
I can resend it...nice to see how it compares with approximate nearest neighbor 
work from uber:
https://github.com/apache/spark/pull/6213

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581366#comment-15581366
 ] 

Debasish Das commented on SPARK-5992:
-

Also do you have hash function for euclidean distance?  We use cosine, jaccard 
and euclidean with SPARK-4823...for knn comparison we can use overlap 
metric...pick up k and then compare overlap within lsh based approximate knn 
and brute force knn...let me know if you need help in running the benchmarks...

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-17 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581361#comment-15581361
 ] 

Debasish Das commented on SPARK-5992:
-

Did you compare with brute force knn ? Normally lsh does not work well for nn 
queries and that's why hybrid spill trees and other ideas came alongI can 
run some comparisons using SPARK-4823

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4823) rowSimilarities

2016-10-17 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581359#comment-15581359
 ] 

Debasish Das commented on SPARK-4823:
-

We use it in multiple usecases internally but did not get time to refactor the 
PR into 3 smaller PRsI will update the PR to 2.0

> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
> Attachments: MovieLensSimilarity Comparisons.pdf, 
> SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf
>
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server

2016-08-07 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411023#comment-15411023
 ] 

Debasish Das commented on SPARK-6932:
-

[~rxin] [~sowen] Do we have any other active parameter server effort going on 
other than glint project from Rolf ? I have started to look into glint to scale 
Spark-as-a-Service to process queries (idea is that can we keep Spark master as 
a coordinator but 0 compute happens on Spark master other than coordination 
through messages, in our impl right now compute is happening on master which is 
a major con right now). More details will be covered in the talk 
https://spark-summit.org/eu-2016/events/fusing-apache-spark-and-lucene-for-near-realtime-predictive-model-building/
 but I believe parameter server (or something similar) will be needed to scale 
query-processing further to Cassandra ring architecture for example...We will 
provide our implementation for spark-lucene integration as part of our 
framework (Trapezium) open source.


> A Prototype of Parameter Server
> ---
>
> Key: SPARK-6932
> URL: https://issues.apache.org/jira/browse/SPARK-6932
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib, Spark Core
>Reporter: Qiping Li
>
>  h2. Introduction
> As specified in 
> [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
> very helpful to integrate parameter server into Spark for machine learning 
> algorithms, especially for those with ultra high dimensions features. 
> After carefully studying the design doc of [Parameter 
> Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
>  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
> we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
> several key design concerns:
> * *User friendly interface*
>   Careful investigation is done to most existing Parameter Server 
> systems(including:  [petuum|http://petuum.github.io], [parameter 
> server|http://parameterserver.org], 
> [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
> design by absorbing essence from all these system. 
> * *Prototype of distributed array*
> IndexRDD (see 
> [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
> to be a good option for distributed array, because in most case, the #key 
> updates/second is not be very high. 
> So we implement a distributed HashMap to store the parameters, which can 
> be easily extended to get better performance.
> 
> * *Minimal code change*
>   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
> which need parameter server are still created and scheduled by Spark's 
> scheduler. Tasks communicate with parameter server with a client object, 
> through *akka* or *netty*.
> With all these concerns we propose the following architecture:
> h2. Architecture
> !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
> Data is stored in RDD and is partitioned across workers. During each 
> iteration, each worker gets parameters from parameter server then computes 
> new parameters based on old parameters and data in the partition. Finally 
> each worker updates parameters to parameter server.Worker communicates with 
> parameter server through a parameter server client,which is initialized in 
> `TaskContext` of this worker.
> The current implementation is based on YARN cluster mode, 
> but it should not be a problem to transplanted it to other modes. 
> h3. Interface
> We refer to existing parameter server systems(petuum, parameter server, 
> paracel) when design the interface of parameter server. 
> *`PSClient` provides the following interface for workers to use:*
> {code}
> //  get parameter indexed by key from parameter server
> def get[T](key: String): T
> // get multiple parameters from parameter server
> def multiGet[T](keys: Array[String]): Array[T]
> // add parameter indexed by `key` by `delta`, 
> // if multiple `delta` to update on the same parameter,
> // use `reduceFunc` to reduce these `delta`s frist.
> def update[T](key: String, delta: T, reduceFunc: (T, T) => T): Unit
> // update multiple parameters at the same time, use the same `reduceFunc`.
> def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) => 
> T: Unit
> 
> // advance clock to indicate that current iteration is finished.
> def clock(): Unit
>  
> // block until all workers have reached this line of code.
> def sync(): Unit
> {code}
> *`PSContext` provides following functions to use on driver:*
> {code}
> // load parameters from existing rdd.
> def loadPSModel[T](model: RDD[String, T]) 
> // fetch parameters from parameter server to construct

[jira] [Comment Edited] (SPARK-9834) Normal equation solver for ordinary least squares

2016-06-05 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315935#comment-15315935
 ] 

Debasish Das edited comment on SPARK-9834 at 6/5/16 4:49 PM:
-

Do you have runtime comparisons that when features <= 4096, OLS using Normal 
Equations is faster than BFGS ? I am extending OLS for sparse features and it 
will be great if you can point to the runtime experiments you have done...


was (Author: debasish83):
Do you have runtime comparisons that when features <= 4096, OLS using Normal 
Equations is faster than BFGS ? 

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.6.0
>
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9834) Normal equation solver for ordinary least squares

2016-06-05 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315935#comment-15315935
 ] 

Debasish Das commented on SPARK-9834:
-

Do you have runtime comparisons that when features <= 4096, OLS using Normal 
Equations is faster than BFGS ? 

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.6.0
>
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10408) Autoencoder

2015-09-08 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735706#comment-14735706
 ] 

Debasish Das commented on SPARK-10408:
--

[~avulanov] In MLP can we change BFGS to OWLQN and get L1 regularization ? That 
way I can get sparse weights and clean up the network to avoid 
overfitting...For the autoencoder did you experiment with graphx based design ? 
I would like to work on it. Basically the idea is to come up with a N layer 
deep autoencoder that can support similar prediction APIs like matrix 
factorization.

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9834) Normal equation solver for ordinary least squares

2015-09-08 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734170#comment-14734170
 ] 

Debasish Das edited comment on SPARK-9834 at 9/8/15 3:18 PM:
-

[~mengxr] If you are open to use breeze.proximal.QuadraticMinimizer we can 
support elastic net in this variant as well...The flow will be very similar to 
QuadraticMinimizer integration to ALS...I have done runtime benchmarks compared 
to OWLQN and if we can afford to do dense cholesky QuadraticMinimizer converges 
faster than OWLQN.

There are two new QuadraticMinimizer features I am working on which will 
further improve the solver:
1. sparse ldl through tim davis lgpl code and using breeze sparse matrix for 
sparse gram and conic formulations. Plan is to add it in breeze-native under 
lgpl similar to netlib-java integration.
2. admm acceleration using nesterov method. ADMM can be run in the same 
complexity as FISTA (implemented in TFOCS). Reference: 
http://www.optimization-online.org/DB_FILE/2009/12/2502.pdf

Although in practice I found even the ADMM implemented right now in 
QuadraticMinimizer converges faster than OWLQN. Tom in his paper demonstrated 
faster ADMM convergence compared to FISTA for quadratic problems: 
ftp://ftp.math.ucla.edu/pub/camreport/cam12-35.pdf. 

Due to the X^TX availability in these problems (ALS and linear regression) I 
also compute the min and max eigen values using power iteration 
(breeze.optimize.linear.PowerMethod) in the code which gives the Lipschitz 
estimator L and there is no line search overhead. This trick did not work for 
the nonlinear variant as the hessian estimates are not close to gram matrix !

QuadraticMinimizer is optimized to run at par with blas dposv when there are no 
constraints while BFGS/OWLQN both still have lot of overhead from iterators 
etc. That might also be the reason that I see QuadraticMinimizer is faster than 
BFGS/OWLQN.

It might be the right time to do the micro-benchmark as well that you asked for 
QuadraticMinimizer. Let me know what you think. I can finish up the 
micro-benchmark, bring the runtime of QuadraticMinimizer to ALS 
NormalEquationSolver and then start the L1 experiments.


was (Author: debasish83):
If you are open to use breeze.proximal.QuadraticMinimizer we can support 
elastic net in this variant as well...I can add it on top of your PR...it will 
be very similar to quadraticminimizer integration to ALS...I have done runtime 
benchmarks compared to OWLQN and if we can afford to do dense cholesky 
QuadraticMinimizer converges faster than OWLQN...there are two new features I 
am working on...sparse ldl through tim davis lgpl code and using breeze sparse 
matrix for sparse gram and conic formulations and admm acceleration using 
nesterov method...admm can also be run in the same complexity as FISTA...david 
goldferb proved it.

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9834) Normal equation solver for ordinary least squares

2015-09-07 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734170#comment-14734170
 ] 

Debasish Das commented on SPARK-9834:
-

If you are open to use breeze.proximal.QuadraticMinimizer we can support 
elastic net in this variant as well...I can add it on top of your PR...it will 
be very similar to quadraticminimizer integration to ALS...I have done runtime 
benchmarks compared to OWLQN and if we can afford to do dense cholesky 
QuadraticMinimizer converges faster than OWLQN...there are two new features I 
am working on...sparse ldl through tim davis lgpl code and using breeze sparse 
matrix for sparse gram and conic formulations and admm acceleration using 
nesterov method...admm can also be run in the same complexity as FISTA...david 
goldferb proved it.

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2015-09-07 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734130#comment-14734130
 ] 

Debasish Das commented on SPARK-10078:
--

[~mengxr] will it be Breeze LBFGS modification or part of mllib.optimization ? 
Is  someone looking into it ?

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4823) rowSimilarities

2015-07-30 Thread Debasish Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-4823:

Attachment: SparkMeetup2015-Experiments2.pdf
SparkMeetup2015-Experiments1.pdf

 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
 Attachments: MovieLensSimilarity Comparisons.pdf, 
 SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf


 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4823) rowSimilarities

2015-07-30 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648340#comment-14648340
 ] 

Debasish Das commented on SPARK-4823:
-

We did more detailed experiment for July 2015 Spark Meetup to understand the 
shuffle effects on runtime. I attached the data for experiments in the JIRA. I 
will update the PR as discussed with Reza. I am targeting 1 PR for Spark 1.5.


 rowSimilarities
 ---

 Key: SPARK-4823
 URL: https://issues.apache.org/jira/browse/SPARK-4823
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
 Attachments: MovieLensSimilarity Comparisons.pdf


 RowMatrix has a columnSimilarities method to find cosine similarities between 
 columns.
 A rowSimilarities method would be useful to find similarities between rows.
 This is JIRA is to investigate which algorithms are suitable for such a 
 method, better than brute-forcing it. Note that when there are many rows ( 
 10^6), it is unlikely that brute-force will be feasible, since the output 
 will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-06-12 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583886#comment-14583886
 ] 

Debasish Das commented on SPARK-2336:
-

Very cool idea Sen. Did you also look into FLANN for randomized KDTree and 
KMeansTree. We have a PR for rowSimilarities which we will use to compare the 
QoR of your PR as soon as you open up a stable version.


 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2336) Approximate k-NN Models for MLLib

2015-06-12 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583886#comment-14583886
 ] 

Debasish Das edited comment on SPARK-2336 at 6/12/15 6:51 PM:
--

Very cool idea Sen. Did you also look into FLANN for randomized KDTree and 
KMeansTree. We have a PR for rowSimilarities 
https://github.com/apache/spark/pull/6213 for brute force KNN generation which 
we will use to compare the QoR of your PR as soon as you open up a stable 
version.



was (Author: debasish83):
Very cool idea Sen. Did you also look into FLANN for randomized KDTree and 
KMeansTree. We have a PR for rowSimilarities which we will use to compare the 
QoR of your PR as soon as you open up a stable version.


 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-05-28 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Debasish Das updated SPARK-6323:

Affects Version/s: (was: 1.4.0)

Large rank matrix factorization with Nonlinear loss and constraints
---

Key: SPARK-6323
URL: https://issues.apache.org/jira/browse/SPARK-6323
Project: Spark
Issue Type: New Feature
Components: ML, MLlib
Reporter: Debasish Das
Original Estimate: 672h
Remaining Estimate: 672h

Currently ml.recommendation.ALS is optimized for gram matrix generation which
scales to modest ranks. The problems that we can solve are in the normal
equation/quadratic form: 0.5x'Hx + c'x + g(z)
g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
In this PR we will re-use ml.recommendation.ALS design and come up with
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent
changes, it's straightforward to do it now !
ALM will be capable of solving the following problems: min f ( x ) + g ( z )
1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most
likely we will re-use the Gradient interfaces already defined and implement
LoglikelihoodLoss
2. Constraints g ( z ) supported are same as above except that we don't
support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we
don't need that for ML applications
3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which
in turn uses projection based solver (SPG) or proximal solvers (ADMM) based
on convergence speed.
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
4. The factors will be SparseVector so that we keep shuffle size in check.
For example we will run with 10K ranks but we will force factors to be
100-sparse.
This is closely related to Sparse LDA
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we
are not using graph representation here.
As we do scaling experiments, we will understand which flow is more suited as
ratings get denser (my understanding is that since we already scaled ALS to 2
billion ratings and we will keep sparsity in check, the same 2 billion flow
will scale to 10K ranks as well)...
This JIRA is intended to extend the capabilities of ml recommendation to
generalized loss function.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4823) rowSimilarities

2015-05-23 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Debasish Das updated SPARK-4823:

Attachment: MovieLensSimilarity Comparisons.pdf

The attached file shows the runtime comparison of row and column based flow on
all items from MovieLens dataset on my local Macbook with 8 cores, 1 GB driver,
4 GB executor memory.

1e-2 is the threshold that's being set to both row based kernel flow and column
based dimsum flow.

Stage 24 - 35 is the row similarity flow. Total runtime ~ 20 s

Stage 64 is col similarity mapPartitions. Total runtime ~ 4.6 mins

This shows the power of blocking in Spark and I have not yet gone to gemv which
will decrease the runtime further.

I updated the driver code in examples.mllib.MovieLensSimilarity

rowSimilarities
---

Key: SPARK-4823
URL: https://issues.apache.org/jira/browse/SPARK-4823
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Reza Zadeh
Attachments: MovieLensSimilarity Comparisons.pdf

RowMatrix has a columnSimilarities method to find cosine similarities between
columns.
A rowSimilarities method would be useful to find similarities between rows.
This is JIRA is to investigate which algorithms are suitable for such a
method, better than brute-forcing it. Note that when there are many rows (
10^6), it is unlikely that brute-force will be feasible, since the output
will be of order 10^12.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-05-23 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557416#comment-14557416
]

Debasish Das commented on SPARK-2426:
-

[~mengxr] Should I add the PR to spark packages and close the JIRA ? The main
contribution was to add sparsity constraints (L1 and probability simplex) to
user and product factors in implicit and explicit feedback factorization and
interested users can use the features from spark packages if they need...Later
if there is community interest, we can pull it in to master ALS ?

Quadratic Minimization for MLlib ALS

Key: SPARK-2426
URL: https://issues.apache.org/jira/browse/SPARK-2426
Project: Spark
Issue Type: New Feature
Components: MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
Assignee: Debasish Das
Original Estimate: 504h
Remaining Estimate: 504h

Current ALS supports least squares and nonnegative least squares.
I presented ADMM and IPM based Quadratic Minimization solvers to be used for
the following ALS problems:
1. ALS with bounds
2. ALS with L1 regularization
3. ALS with Equality constraint and bounds
Initial runtime comparisons are presented at Spark Summit.
http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
Based on Xiangrui's feedback I am currently comparing the ADMM based
Quadratic Minimization solvers with IPM based QpSolvers and the default
ALS/NNLS. I will keep updating the runtime comparison results.
For integration the detailed plan is as follows:
1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
2. Integrate QuadraticMinimizer in mllib ALS

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-05-19 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551294#comment-14551294
 ] 

Debasish Das commented on SPARK-6323:
-

Petuum paper that got released today mentioned going to larger topic size 
(~10-100K) http://arxiv.org/pdf/1412.1576v1.pdf

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most 
 likely we will re-use the Gradient interfaces already defined and implement 
 LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion ratings and we will keep sparsity in check, the same 2 billion flow 
 will scale to 10K ranks as well)...
 This JIRA is intended to extend the capabilities of ml recommendation to 
 generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4823) rowSimilarities

2015-05-17 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547318#comment-14547318
]

Debasish Das commented on SPARK-4823:
-

I opened up a PR that worked well for our datasets. It is still brute-force
computation although we use blocked cartesian and user defined kernels to
optimize on cutting computation and shuffle...There are trivial ideas to go
from BLAS-1 to BLAS-2 and BLAS-3 as more sparse operations are added to mllib
BLAS although I don't think it will give us the runtime boost we are looking
for...

We are looking into approximate KNN family of algorithms to improve the runtime
further...KDTree is good for dense vector with low features but for sparse
vector in higher dimensions researches did not find it useful..

LSH seems to be most commonly used and that's the direction we are looking
into. I looked into papers but the one that showed good recall values in their
experiments as compared to brute force KNN is Google Correlate and that's the
validation strategy we will focus at
https://www.google.com/trends/correlate/nnsearch.pdf. Please point to any other
references that deem fit. There are twitter papers as well using LSH and the
implementation is available in algebird. We will start with algebird LSH but
ideally we don't want to have a distance metric hardcoded in LSH.

If we get good recall using LSH based method compared to the rowSimilarities
code from the PR, we will use LSH based method to approximate compute
similarities between dense/sparse rows using cosine kernel, dense userFactor,
productFactor from factorization using product kernel and dense user/product
factor similarities using cosine kernel.

The kernel abstraction is part of the current PR and right now we support
Cosine, Product, Euclidean and RBF. Pearson is something that's of interest but
it's not added yet. For approximate row similarity I will open up a separate
JIRA.

rowSimilarities
---

Key: SPARK-4823
URL: https://issues.apache.org/jira/browse/SPARK-4823
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Reza Zadeh

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2015-05-02 Thread Debasish Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das updated SPARK-4231:

Affects Version/s: (was: 1.2.0)
   1.4.0

 Add RankingMetrics to examples.MovieLensALS
 ---

 Key: SPARK-4231
 URL: https://issues.apache.org/jira/browse/SPARK-4231
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.4.0
Reporter: Debasish Das
   Original Estimate: 24h
  Remaining Estimate: 24h

 examples.MovieLensALS computes RMSE for movielens dataset but after addition 
 of RankingMetrics and enhancements to ALS, it is critical to look at not only 
 the RMSE but also measures like prec@k and MAP.
 In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and 
 also added a flag that takes an input whether user/product recommendation is 
 being validated.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS

2015-05-02 Thread Debasish Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Debasish Das reopened SPARK-4231:
-

The code was not part of SPARK-3066 and so reopening...

 Add RankingMetrics to examples.MovieLensALS
 ---

 Key: SPARK-4231
 URL: https://issues.apache.org/jira/browse/SPARK-4231
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.2.0
Reporter: Debasish Das
   Original Estimate: 24h
  Remaining Estimate: 24h

 examples.MovieLensALS computes RMSE for movielens dataset but after addition 
 of RankingMetrics and enhancements to ALS, it is critical to look at not only 
 the RMSE but also measures like prec@k and MAP.
 In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and 
 also added a flag that takes an input whether user/product recommendation is 
 being validated.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-04-25 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512843#comment-14512843
 ] 

Debasish Das commented on SPARK-5992:
-

Did someone compared algebird LSH with spark minhash link above ? Unless 
algebird is slow (which I found for TopK monoid) we should use it the same way 
HLL is being used in Spark streaming ? Is it ok to add algebird to mllib ?

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3987) NNLS generates incorrect result

2015-04-07 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484646#comment-14484646
 ] 

Debasish Das commented on SPARK-3987:
-

@mengxr for this testcase it was fixed but I remember there was someone in user 
list who mentioned that he got incorrect result compared to some other 
tool...may be it's a good idea to ask for testcases...

 NNLS generates incorrect result
 ---

 Key: SPARK-3987
 URL: https://issues.apache.org/jira/browse/SPARK-3987
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Debasish Das
Assignee: Shuo Xiang
 Fix For: 1.1.1, 1.2.0


 Hi,
 Please see the example gram matrix and linear term:
 val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 
 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, 
 -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, 
 -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, 
 -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, 
 -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 
 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 
 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 
 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 
 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, 
 -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, 
 -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, 
 -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, 
 -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 
 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 
 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 
 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, 
 -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 
 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 
 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 
 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 
 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, 
 -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, 
 -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 
 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 
 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, 
 -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, 
 -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, 
 -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 
 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, 
 -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, 
 -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, 
 -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, 
 -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 
 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 
 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 
 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, 
 -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 
 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 
 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 
 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, 
 -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 
 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 
 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 
 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 
 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, 
 -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, 
 -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, 
 -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 
 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, 
 -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, 
 -5281.636812, 39298.355527, -3440.450858, 0.00, 13717.870243, 
 -8471.405582, 2071.812204, 53928.060360, -8283.992464, 33507.951918, 
 -24087.375367, -5408.929644, 4089.672804, -37103.267653, -33839.612565, 
 18537.392481, 7026.518692, 54636.778527, -57375.986301, -5281.636812, 
 9735.061160, -45360.674033, 10634.633559, 0.00,

[jira] [Comment Edited] (SPARK-3987) NNLS generates incorrect result

2015-04-07 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484646#comment-14484646
 ] 

Debasish Das edited comment on SPARK-3987 at 4/8/15 3:31 AM:
-

[~mengxr] for this testcase it was fixed but I remember there was someone in 
user list who mentioned that he got incorrect result compared to some other 
tool...may be it's a good idea to ask for testcases...


was (Author: debasish83):
@mengxr for this testcase it was fixed but I remember there was someone in user 
list who mentioned that he got incorrect result compared to some other 
tool...may be it's a good idea to ask for testcases...

 NNLS generates incorrect result
 ---

 Key: SPARK-3987
 URL: https://issues.apache.org/jira/browse/SPARK-3987
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Debasish Das
Assignee: Shuo Xiang
 Fix For: 1.1.1, 1.2.0


 Hi,
 Please see the example gram matrix and linear term:
 val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 
 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, 
 -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, 
 -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, 
 -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, 
 -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 
 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 
 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 
 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 
 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, 
 -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, 
 -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, 
 -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, 
 -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 
 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 
 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 
 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, 
 -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 
 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 
 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 
 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 
 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, 
 -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, 
 -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 
 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 
 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, 
 -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, 
 -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, 
 -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 
 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, 
 -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, 
 -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, 
 -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, 
 -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 
 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 
 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 
 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, 
 -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 
 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 
 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 
 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, 
 -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 
 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 
 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 
 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 
 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, 
 -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, 
 -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, 
 -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 
 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, 
 -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, 
 -5281.636812, 39298.355527, -3440.450858, 0.00,

[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-31 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388128#comment-14388128
 ] 

Debasish Das commented on SPARK-5564:
-

[~sparks] we are trying to access the EC2 dataset but giving error:

[ec2-user@ip-172-31-38-56 ~]$ aws s3 ls 
s3://files.sparks.requester.pays/enwiki_category_text/

A client error (AccessDenied) occurred when calling the ListObjects operation: 
Access Denied

Could you please take a look if it is still available for use ?

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3066) Support recommendAll in matrix factorization model

2015-03-31 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389973#comment-14389973
 ] 

Debasish Das edited comment on SPARK-3066 at 4/1/15 4:28 AM:
-

Also unless the raw flow runs there is no way to validate how good a LSH based 
flow is doing...I updated the PR today with [~mengxr] reviews...I am working on 
level 3 BLAS routines for item-item similarity calculation from matrix factors 
and the same optimization can be applied here...I will open up the PR for that 
in coming weeks...we already have a JIRA for rowSimilarities...


was (Author: debasish83):
Also unless the raw flow runs there is no way to validate how good a LSH based 
flow is doing since users...I updated the PR today with [~mengxr] reviews...I 
am working on level 3 BLAS routines for item-item similarity calculation from 
matrix factors and the same optimization can be applied here...I will open up 
the PR for that in coming weeks...we already have a JIRA for rowSimilarities...

 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Debasish Das

 ALS returns a matrix factorization model, which we can use to predict ratings 
 for individual queries as well as small batches. In practice, users may want 
 to compute top-k recommendations offline for all users. It is very expensive 
 but a common problem. We can do some optimization like
 1) collect one side (either user or product) and broadcast it as a matrix
 2) use level-3 BLAS to compute inner products
 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2015-03-31 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389973#comment-14389973
 ] 

Debasish Das commented on SPARK-3066:
-

Also unless the raw flow runs there is no way to validate how good a LSH based 
flow is doing since users...I updated the PR today with [~mengxr] reviews...I 
am working on level 3 BLAS routines for item-item similarity calculation from 
matrix factors and the same optimization can be applied here...I will open up 
the PR for that in coming weeks...we already have a JIRA for rowSimilarities...

 Support recommendAll in matrix factorization model
 --

 Key: SPARK-3066
 URL: https://issues.apache.org/jira/browse/SPARK-3066
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Debasish Das

 ALS returns a matrix factorization model, which we can use to predict ratings 
 for individual queries as well as small batches. In practice, users may want 
 to compute top-k recommendations offline for all users. It is very expensive 
 but a common problem. We can do some optimization like
 1) collect one side (either user or product) and broadcast it as a matrix
 2) use level-3 BLAS to compute inner products
 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-30 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387180#comment-14387180
 ] 

Debasish Das commented on SPARK-5564:
-

Cool...I will run my experiments on the same dataset as well and report 
results...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-30 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387180#comment-14387180
 ] 

Debasish Das edited comment on SPARK-5564 at 3/30/15 6:52 PM:
--

Cool...I will run my experiments on the same dataset as well and report 
results...By the way my plan is to run 1000 sparse topics here...K will be 1000 
but sparse and so we never shuffle more than 100 sparse vectors...For sparsity 
experiments did you also add something specific ?


was (Author: debasish83):
Cool...I will run my experiments on the same dataset as well and report 
results...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-29 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049
]

Debasish Das commented on SPARK-5564:
-

[~josephkb] could you please point me to the datasets that are used for
benchmarking? I have started testing loglikelihood loss for recommendation and
since I already added the constraints, this is the right time to test it on LDA
benchmarks as well...I will open up the code as part of
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood
directly, I am looking to add more testcases from your LDA JIRA...For
recommendation, I know how to construct the testcases...

Support sparse LDA solutions

Key: SPARK-5564
URL: https://issues.apache.org/jira/browse/SPARK-5564
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

Latent Dirichlet Allocation (LDA) currently requires that the priors’
concentration parameters be 1.0. It should support values 0.0, which
should encourage sparser topics (phi) and document-topic distributions
(theta).
For EM, this will require adding a projection to the M-step, as in: Vorontsov
and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive
Regularization for Stochastic Matrix Factorization. 2014.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-29 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049
 ] 

Debasish Das edited comment on SPARK-5564 at 3/30/15 12:31 AM:
---

[~josephkb] could you please point me to the datasets that are used for 
benchmarking LDA and how do they scale as we start scaling the topics? I have 
started testing loglikelihood loss for recommendation and since I already added 
the constraints, this is the right time to test it on LDA benchmarks as 
well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases based on document and word 
matrix...For recommendation, I know how to construct the testcases with 
loglikelihood loss


was (Author: debasish83):
[~josephkb] could you please point me to the datasets that are used for 
benchmarking? I have started testing loglikelihood loss for recommendation and 
since I already added the constraints, this is the right time to test it on LDA 
benchmarks as well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases based on document and word 
matrix...For recommendation, I know how to construct the testcases with 
loglikelihood loss

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2015-03-29 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049
 ] 

Debasish Das edited comment on SPARK-5564 at 3/30/15 12:30 AM:
---

[~josephkb] could you please point me to the datasets that are used for 
benchmarking? I have started testing loglikelihood loss for recommendation and 
since I already added the constraints, this is the right time to test it on LDA 
benchmarks as well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases based on document and word 
matrix...For recommendation, I know how to construct the testcases with 
loglikelihood loss


was (Author: debasish83):
[~josephkb] could you please point me to the datasets that are used for 
benchmarking? I have started testing loglikelihood loss for recommendation and 
since I already added the constraints, this is the right time to test it on LDA 
benchmarks as well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases from your LDA JIRA...For 
recommendation, I know how to construct the testcases...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-28 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Debasish Das updated SPARK-2426:

Affects Version/s: (was: 1.3.0)
1.4.0

Quadratic Minimization for MLlib ALS

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357
]

Debasish Das edited comment on SPARK-2426 at 3/24/15 3:23 PM:
--

[~acopich] From your comment before Anyway, l2 regularized stochastic matrix
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
., could you please point me to a good reference with application to
collaborative filtering/topic modeling ? Stochastic matrix decomposition is
what we can do in this PR now https://github.com/apache/spark/pull/3221Is
not there is log term that multiplies with R to make it a KL divergence loss ?
May be the log term can removed under non-negative and normalization
constraints ? @mengxr any ideas here ? If we can do that we can target KL
divergence loss from Lee's paper:
http://hebb.mit.edu/people/seung/papers/ls-lponm-99.pdf

For MAP loss, I will open up a PR in a week through JIRA
https://issues.apache.org/jira/browse/SPARK-6323. I am very curious how much
slower we get compared to stochastic matrix decomposition using ALS. MAP loss
looks like a strong contender to LDA and can natively handle counts (does not
need regression style datasets which is difficult to get in practical setup
where people normally don't give any rating and satisfaction should be infered
from viewing time etc)

Quadratic Minimization for MLlib ALS

Key: SPARK-2426
URL: https://issues.apache.org/jira/browse/SPARK-2426
Project: Spark
Issue Type: New Feature
Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
Original Estimate: 504h
Remaining Estimate: 504h

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357
]

Debasish Das edited comment on SPARK-2426 at 3/24/15 3:23 PM:
--

[~acopich] From your comment before Anyway, l2 regularized stochastic matrix
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
., could you please point me to a good reference with application to
collaborative filtering/topic modeling ? Stochastic matrix decomposition is
what we can do in this PR now https://github.com/apache/spark/pull/3221 Is not
there is log term that multiplies with R to make it a KL divergence loss ? May
be the log term can removed under non-negative and normalization constraints ?
@mengxr any ideas here ? If we can do that we can target KL divergence loss
from Lee's paper: http://hebb.mit.edu/people/seung/papers/ls-lponm-99.pdf

was (Author: debasish83):
[~acopich] From your comment before Anyway, l2 regularized stochastic matrix
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
., could you please point me to a good reference with application to
collaborative filtering/topic modeling ? Stochastic matrix decomposition is
what we can do in this PR now https://github.com/apache/spark/pull/3221Is
not there is log term that multiplies with R to make it a KL divergence loss ?
May be the log term can removed under non-negative and normalization
constraints ? @mengxr any ideas here ? If we can do that we can target KL
divergence loss from Lee's paper:
http://hebb.mit.edu/people/seung/papers/ls-lponm-99.pdf

Quadratic Minimization for MLlib ALS

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-24 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378062#comment-14378062
 ] 

Debasish Das commented on SPARK-6323:
-

I did some more reading and realized that even for small ranks, the current 
least square approach won't be able to handle both KL divergence loss (unless 
there is an approximation that I am missing, more discussions on 
https://issues.apache.org/jira/browse/SPARK-2426) and PLSA formulation (KL 
divergence loss with additional constraints)...Even for collaborative filtering 
with small ranks, this code will be useful...

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most 
 likely we will re-use the Gradient interfaces already defined and implement 
 LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion ratings and we will keep sparsity in check, the same 2 billion flow 
 will scale to 10K ranks as well)...
 This JIRA is intended to extend the capabilities of ml recommendation to 
 generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357
]

Debasish Das commented on SPARK-2426:
-

For MAP loss, I will open up a PR in a week through JIRA
https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much
slower we get compared to stochastic matrix decomposition using ALS

Quadratic Minimization for MLlib ALS

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357
 ] 

Debasish Das edited comment on SPARK-2426 at 3/24/15 6:11 AM:
--

[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
|| . || stands for Frobenius norm (or l1)., could you please point me to a 
good reference with application to collaborative filtering/topic modeling ? 
Stochastic matrix decomposition is what we can do in this PR now 
https://github.com/apache/spark/pull/3221

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS


was (Author: debasish83):
[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
||.|| stands for Frobenius norm (or l1)., could you please point me to a good 
reference with application to collaborative filtering/topic modeling ? 
Stochastic matrix decomposition is what we can do in this PR now 
https://github.com/apache/spark/pull/3221

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357
 ] 

Debasish Das edited comment on SPARK-2426 at 3/24/15 6:11 AM:
--

[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
., could you please point me to a good reference with application to 
collaborative filtering/topic modeling ? Stochastic matrix decomposition is 
what we can do in this PR now https://github.com/apache/spark/pull/3221

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS


was (Author: debasish83):
[~acopich] From your comment before Anyway, l2 regularized stochastic matrix 
decomposition problem is defined as follows
Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints.
|| . || stands for Frobenius norm (or l1)., could you please point me to a 
good reference with application to collaborative filtering/topic modeling ? 
Stochastic matrix decomposition is what we can do in this PR now 
https://github.com/apache/spark/pull/3221

For MAP loss, I will open up a PR in a week through JIRA 
https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much 
slower we get compared to stochastic matrix decomposition using ALS

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-24 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357
]

Debasish Das edited comment on SPARK-2426 at 3/24/15 6:13 AM:
--

Quadratic Minimization for MLlib ALS

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3735) Sending the factor directly or AtA based on the cost in ALS

2015-03-23 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376046#comment-14376046
 ] 

Debasish Das commented on SPARK-3735:
-

We might want to consider doing some of these things through indexed RDD 
exposed through an API...right now ALS is completely join based...can we do 
something nicer if we have access to an efficient read only cache from ALS 
mapPartitions...Idea here is to think about zeros explicitly and not adding the 
implicit heuristic which is generally hard to tune... 

 Sending the factor directly or AtA based on the cost in ALS
 ---

 Key: SPARK-3735
 URL: https://issues.apache.org/jira/browse/SPARK-3735
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 It is common to have some super popular products in the dataset. In this 
 case, sending many user factors to the target product block could be more 
 expensive than sending the normal equation `\sum_i u_i u_i^T` and `\sum_i u_i 
 r_ij` to the product block. The cost of sending a single factor is `k`, while 
 the cost of sending a normal equation is much more expensive, `k * (k + 3) / 
 2`. However, if we use normal equation for all products associated with a 
 user, we don't need to send this user factor.
 Determining the optimal assignment is hard. But we could use a simple 
 heuristic. Inside any rating block,
 1) order the product ids by the number of user ids associated with them in 
 desc order
 2) starting from the most popular product, mark popular products as use 
 normal eq and calculate the cost
 Remember the best assignment that comes with the lowest cost and use it for 
 computation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-22 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375325#comment-14375325
]

Debasish Das commented on SPARK-2426:
-

[~acopich] There's a completely different loss... BTW, we've used a
factorisation with the loss you've described as an initial approximation for
PLSA. It gave a significant speed-up. Could you help adding some testcases and
driver for the PLSA approximation ? the PR
https://github.com/apache/spark/pull/3221 has now the LSA constraints and least
square loss...

Idea here is to do probability simplex on user side, bounds on the item side
and normalization on item columns at each ALS iteration...The MAP loss is
tracked through https://issues.apache.org/jira/browse/SPARK-6323 but the solve
idea will be very similar as I mentioned before and so we can re-use the flow
test-cases...We can discuss more on the PR...It will be great if you can help
add examples.mllib.PLSA as well that will driver both PLSA through ALS and ALM
(alternating MAP loss optimization)...

Quadratic Minimization for MLlib ALS

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-22 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Debasish Das updated SPARK-6323:

Description:
Currently ml.recommendation.ALS is optimized for gram matrix generation which
scales to modest ranks. The problems that we can solve are in the normal
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f ( x ) + g ( z )

1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most
likely we will re-use the Gradient interfaces already defined and implement
LoglikelihoodLoss

2. Constraints g ( z ) supported are same as above except that we don't support
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we
are not using graph representation here.

As we do scaling experiments, we will understand which flow is more suited as
ratings get denser (my understanding is that since we already scaled ALS to 2
billion ratings and we will keep sparsity in check, the same 2 billion flow
will scale to 10K ranks as well)...

This JIRA is intended to extend the capabilities of ml recommendation to
generalized loss function.

was:
Currently ml.recommendation.ALS is optimized for gram matrix generation which
scales to modest ranks. The problems that we can solve are in the normal
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f ( x ) + g ( z )

1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and
HingeLoss. Most likely we will re-use the Gradient interfaces already defined
and implement LoglikelihoodLoss

2. Constraints g ( z ) supported are same as above except that we don't support
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we
are not using graph representation here.

This JIRA is intended to extend the capabilities of ml recommendation to
generalized loss function.

Large rank matrix factorization with Nonlinear loss and constraints
---

Key: SPARK-6323
URL: https://issues.apache.org/jira/browse/SPARK-6323
Project: Spark
Issue Type: New Feature
Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
Fix For: 1.4.0

Original Estimate: 672h
Remaining Estimate: 672h

[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-16 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956
 ] 

Debasish Das edited comment on SPARK-6323 at 3/16/15 6:30 PM:
--

g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 

2. The original PLSA paper from Hoffman et al.

3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic 
Models for Collaborative Filtering

4. Industry specific application: 
http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

For large rank matrix factorization the requirement also come from sparse 
topics now which can easily range in ~ 10K...

The idea can be implemented in the Sparse LDA JIRA as well 
https://issues.apache.org/jira/browse/SPARK-5564 and I asked [~josephkb] if he 
thinks we should do it in LDA framework but I don't think we know which flow 
will scale better yet as the data moves from sparse from dense.

With the factorization flow I will start to see results next week. These are 
the algorithm steps:

1. minimize f(w,h*)
s.t 1'w = 1, w =0 (row constraints)

2. minimize f(w*,h)
s.t 0 = h = 1,

3. Normalize each column in h

Note that 2 and 3 is an approximation to the original matrix formulation but 
the column normalization makes the factor probabilistically well defined for 
next PLSA iteration.

f(w,h*) is loglikelihood loss from PLSA paper.

I will start to look into graphx based flow after that because in general that 
flow makes more sense for distributed nets where objective is no longer 
separable like ALS-WR / PLSA.


was (Author: debasish83):
g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 

2. The original PLSA paper from Hoffman et al.

3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic 
Models for Collaborative Filtering

4. Industry specific application: 
http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

For large rank matrix factorization the requirement also come from sparse 
topics now which can easily range in ~ 10K...

The idea can be implemented in the Sparse LDA JIRA as well 
https://issues.apache.org/jira/browse/SPARK-5564 and I asked [~josephkb] if he 
thinks we should do it in LDA framework but I don't think we know which flow 
will scale better yet as the data moves from sparse from dense.

With the factorization flow I will start to see results next week as the flow 
is designed to handle these ideas. I will start to look into graphx based flow 
after that. 

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f (

[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-15 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956
 ] 

Debasish Das edited comment on SPARK-6323 at 3/15/15 4:29 PM:
--

g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 

2. The original PLSA paper from Hoffman et al.

3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic 
Models for Collaborative Filtering

4. Industry specific application: 
http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

For large rank matrix factorization the requirement also come from sparse 
topics now which can easily range in ~ 10K...

The idea can be implemented in the Sparse LDA JIRA as well 
https://issues.apache.org/jira/browse/SPARK-5564 and I asked [~josephkb] if he 
thinks we should do it in LDA framework but I don't think we know which flow 
will scale better yet as the data moves from sparse from dense.

With the factorization flow I will start to see results next week as the flow 
is designed to handle these ideas. I will start to look into graphx based flow 
after that. 


was (Author: debasish83):
g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 

2. The original PLSA paper from Hoffman et al.

3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic 
Models for Collaborative Filtering

4. Industry specific application: 
http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

For large rank matrix factorization the requirement also come from sparse 
topics now which can easily range in ~ 10K...

The idea can be implemented in the Sparse LDA JIRA as well 
https://issues.apache.org/jira/browse/SPARK-5564 and I asked Joseph if he 
thinks we should do it in LDA framework but I don't think we know which flow 
will scale better yet. 

With the factorization flow I will start to see results next week as the flow 
is designed to handle these ideas. I will start to look into graphx based flow 
after that. 

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
 HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
 and implement LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver

[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-15 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956
 ] 

Debasish Das edited comment on SPARK-6323 at 3/15/15 4:26 PM:
--

g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 

2. The original PLSA paper from Hoffman et al.

3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic 
Models for Collaborative Filtering

4. Industry specific application: 
http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818

For large rank matrix factorization the requirement also come from sparse 
topics now which can easily range in ~ 10K...

The idea can be implemented in the Sparse LDA JIRA as well 
https://issues.apache.org/jira/browse/SPARK-5564 and I asked Joseph if he 
thinks we should do it in LDA framework but I don't think we know which flow 
will scale better yet. 

With the factorization flow I will start to see results next week as the flow 
is designed to handle these ideas. I will start to look into graphx based flow 
after that. 


was (Author: debasish83):
g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 
2. The original PLSA paper from Hoffman et al.

For large rank matrix factorization I think the requirements come from sparse 
topics now which can easily range in ~ 10K...


 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
 HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
 and implement LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion

[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-14 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361981#comment-14361981
]

Debasish Das commented on SPARK-6323:
-

By the way I can close the JIRA if it is not related to broader interest of the
community...

Large rank matrix factorization with Nonlinear loss and constraints
---

Key: SPARK-6323
URL: https://issues.apache.org/jira/browse/SPARK-6323
Project: Spark
Issue Type: New Feature
Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
Fix For: 1.4.0

Original Estimate: 672h
Remaining Estimate: 672h

Currently ml.recommendation.ALS is optimized for gram matrix generation which
scales to modest ranks. The problems that we can solve are in the normal
equation/quadratic form: 0.5x'Hx + c'x + g(z)
g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
In this PR we will re-use ml.recommendation.ALS design and come up with
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent
changes, it's straightforward to do it now !
ALM will be capable of solving the following problems: min f ( x ) + g ( z )
1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and
HingeLoss. Most likely we will re-use the Gradient interfaces already defined
and implement LoglikelihoodLoss
2. Constraints g ( z ) supported are same as above except that we don't
support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we
don't need that for ML applications
3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which
in turn uses projection based solver (SPG) or proximal solvers (ADMM) based
on convergence speed.
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
4. The factors will be SparseVector so that we keep shuffle size in check.
For example we will run with 10K ranks but we will force factors to be
100-sparse.
This is closely related to Sparse LDA
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we
are not using graph representation here.
As we do scaling experiments, we will understand which flow is more suited as
ratings get denser (my understanding is that since we already scaled ALS to 2
billion ratings and we will keep sparsity in check, the same 2 billion flow
will scale to 10K ranks as well)...
This JIRA is intended to extend the capabilities of ml recommendation to
generalized loss function.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956
 ] 

Debasish Das commented on SPARK-6323:
-

g(z) is not regularization...we support constraints like z=0; lb = z = 
ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported 
through QuadraticMinimizer for 2426. I already migrated ALS to use 
QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze 
release.

I call it z since we are using splitting algorithms here for the solve 
(projection based or admm + proximal)...

Sure for papers on global objective refer to any PLSA paper with matrix 
factorization. I personally like these 2 and I am focused on them:

1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization Equation (2) and (3) 
2. The original PLSA paper from Hoffman et al.

For large rank matrix factorization I think the requirements come from sparse 
topics now which can easily range in ~ 10K...


 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
 HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
 and implement LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion ratings and we will keep sparsity in check, the same 2 billion flow 
 will scale to 10K ranks as well)...
 This JIRA is intended to extend the capabilities of ml recommendation to 
 generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361005#comment-14361005
 ] 

Debasish Das edited comment on SPARK-6323 at 3/13/15 7:48 PM:
--

There are some other interesting cases for large rank non-convex function but 
we will come to it once fixing PLSA using factorization but yes in all the 
formulation things break up like ALS and that's why we can distribute the solve 
to spark workers...If the objective function does not break like neural net 
(which is the natural extension for ALS) then we need parameter server type 
ideas for solver...


was (Author: debasish83):
There are some other interesting cases for large rank non-convex function but 
we will come to it once fixing PLSA using factorization...

 Large rank matrix factorization with Nonlinear loss and constraints
 ---

 Key: SPARK-6323
 URL: https://issues.apache.org/jira/browse/SPARK-6323
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
 Fix For: 1.4.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently ml.recommendation.ALS is optimized for gram matrix generation which 
 scales to modest ranks. The problems that we can solve are in the normal 
 equation/quadratic form: 0.5x'Hx + c'x + g(z)
 g(z) can be one of the constraints from Breeze proximal library:
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
 In this PR we will re-use ml.recommendation.ALS design and come up with 
 ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent 
 changes, it's straightforward to do it now !
 ALM will be capable of solving the following problems: min f ( x ) + g ( z )
 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and 
 HingeLoss. Most likely we will re-use the Gradient interfaces already defined 
 and implement LoglikelihoodLoss
 2. Constraints g ( z ) supported are same as above except that we don't 
 support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we 
 don't need that for ML applications
 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which 
 in turn uses projection based solver (SPG) or proximal solvers (ADMM) based 
 on convergence speed.
 https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
 4. The factors will be SparseVector so that we keep shuffle size in check. 
 For example we will run with 10K ranks but we will force factors to be 
 100-sparse.
 This is closely related to Sparse LDA 
 https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we 
 are not using graph representation here.
 As we do scaling experiments, we will understand which flow is more suited as 
 ratings get denser (my understanding is that since we already scaled ALS to 2 
 billion ratings and we will keep sparsity in check, the same 2 billion flow 
 will scale to 10K ranks as well)...
 This JIRA is intended to extend the capabilities of ml recommendation to 
 generalized loss function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361005#comment-14361005
]

Debasish Das commented on SPARK-6323:
-

There are some other interesting cases for large rank non-convex function but
we will come to it once fixing PLSA using factorization...

Large rank matrix factorization with Nonlinear loss and constraints
---

Key: SPARK-6323
URL: https://issues.apache.org/jira/browse/SPARK-6323
Project: Spark
Issue Type: New Feature
Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Debasish Das
Fix For: 1.4.0

Original Estimate: 672h
Remaining Estimate: 672h

Currently ml.recommendation.ALS is optimized for gram matrix generation which
scales to modest ranks. The problems that we can solve are in the normal
equation/quadratic form: 0.5x'Hx + c'x + g(z)
g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala
In this PR we will re-use ml.recommendation.ALS design and come up with
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent
changes, it's straightforward to do it now !
ALM will be capable of solving the following problems: min f ( x ) + g ( z )
1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and
HingeLoss. Most likely we will re-use the Gradient interfaces already defined
and implement LoglikelihoodLoss
2. Constraints g ( z ) supported are same as above except that we don't
support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we
don't need that for ML applications
3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which
in turn uses projection based solver (SPG) or proximal solvers (ADMM) based
on convergence speed.
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
4. The factors will be SparseVector so that we keep shuffle size in check.
For example we will run with 10K ranks but we will force factors to be
100-sparse.
This is closely related to Sparse LDA
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we
are not using graph representation here.
As we do scaling experiments, we will understand which flow is more suited as
ratings get denser (my understanding is that since we already scaled ALS to 2
billion ratings and we will keep sparsity in check, the same 2 billion flow
will scale to 10K ranks as well)...
This JIRA is intended to extend the capabilities of ml recommendation to
generalized loss function.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2015-03-13 Thread Debasish Das (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Debasish Das updated SPARK-6323:

Description:
Currently ml.recommendation.ALS is optimized for gram matrix generation which
only scales to modest ranks. The problems that we can solve are in the normal
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f(x) + g(z)

1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss.
Most likely we will re-use the Gradient interfaces already defined and
implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support
affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need
that for ML applications

3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in
turn uses projection based solver (SPG) or proximal solvers (ADMM) based on
convergence speed.

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala

4. The factors will be SparseVector so that we keep shuffle size in check. For
example we will run with 10K ranks but we will force factors to be 100-sparse.

This is closely related to Sparse LDA
https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we
are not using graph representation here.

As we do scaling experiments, we will understand the underlying architecture.

This JIRA is intended to extend the capabilities of Spark's collaborative
filtering toolkit to generalized loss function.

was:
Currently ml.recommendation.ALS is optimized for gram matrix generation which
only scales to modest ranks. The problems that we can solve are in the normal
equation/quadratic form: 0.5x'Hx + c'x + g(z)

g(z) can be one of the constraints from Breeze proximal library:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala

In this PR we will re-use ml.recommendation.ALS design and come up with
ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent
changes, it's straightforward to do it now !

ALM will be capable of solving the following problems: min f (x) + g (z)

1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss.
Most likely we will re-use the Gradient interfaces already defined and
implement LoglikelihoodLoss

2. Constraints g(z) supported are same as above except that we don't support
affine constraint Aeq x = beq , lb = x = ub yet. But most likely we don't
need that for ML applications