[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728020#comment-16728020 ] Debasish Das commented on SPARK-24374: -- Hi [~mengxr] with barrier mode available is it not possible to use native TF parameter server in place of using MPI ? Although we are offloading compute from spark to tf workers/ps, still if there is an exception that comes out, tracking it with native TF API might be easier than MPI exception...great work by the way...I was looking for a cloud-ml alternative using spark over aws/azure/gcp and looks like barrier should help a lot although I am still not clear on the limitations of TensorflowOnSpark project from Yahoo [https://github.com/yahoo/TensorFlowOnSpark] which tried to put barrier like syntax but not sure if few partitions fails on some tfrecord read / communication exceptions whether it can re-run full job or it will only re-run the failed partition...I guess the exception from few partitions can be thrown back to spark driver and driver can take the action for re-run..when multiple tf training jobs get scheduled on the same spark cluster I suspect TFoS might have issues as well... > SPIP: Support Barrier Execution Mode in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Epic > Components: ML, Spark Core >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen, SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15809876#comment-15809876 ] Debasish Das commented on SPARK-10078: -- I looked into the code and I see we are replicating Breeze BFGS and OWLQN core logic in this PR: https://github.com/yanboliang/spark-vlbfgs/tree/master/src/main/scala/org/apache/spark/ml/optim. We can provide a DiffFunction interface that works on feature partition and add the VL-BFGS paper logic as a refactoring to current Breeze BFGS code... Now DiffFunction can run with a DistributedVector or a Vector. What that helps with is that even with features < 100M, we can run multi-core VLBFGS with putting multiple partitions and a if-else switch is not necessary. I can provide breeze interfaces based on your PR if you agree with the idea. BFGS and OWLQN are few variants but Breeze has several constraint solvers that use BFGS code... > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793770#comment-15793770 ] Debasish Das commented on SPARK-10078: -- [~mengxr] [~dlwh] is it possible to implement VL-BFGS as part of breeze so that OWLQN, LBFGS, LBFGS-B and proximal.NonlinearMinimizer get benefited by it ? We can bring it the way we bring LBFGS/OWLQN right now...If it makes sense, I can look at the design doc and propose a breeze interface to abstract RDD details... > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793760#comment-15793760 ] Debasish Das edited comment on SPARK-10078 at 1/3/17 12:26 AM: --- Ideally feature partitioning should be automatically tuned...at 100M features master only processing what we do with Breeze LBFGS / OWLQN will also get benefitted by VL-BFGSIdeally it should be part of breeze and a proper interface should be defined so that the Breeze VL-BFGS solver can be called in Spark ML...There are bounded BFGS that's in breeze...all of them will be benefited by this change. A solver can be used in other frameworks as well and may not be constrained to RDD if possible... was (Author: debasish83): Ideally feature partitioning should be automatically tuned...at 100M features master only processing what we do with Breeze LBFGS / OWLQN will also get benefitted by VL-BFGSIdeally it should be part of breeze and a proper interface should be defined so that the Breeze VL-BFGS solver can be called in Spark ML... > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793760#comment-15793760 ] Debasish Das commented on SPARK-10078: -- Ideally feature partitioning should be automatically tuned...at 100M features master only processing what we do with Breeze LBFGS / OWLQN will also get benefitted by VL-BFGSIdeally it should be part of breeze and a proper interface should be defined so that the Breeze VL-BFGS solver can be called in Spark ML... > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15777650#comment-15777650 ] Debasish Das edited comment on SPARK-13857 at 12/26/16 5:57 AM: item->item and user->user was done in an old PR I had...if there is interest I can resend it...nice to see how it compares with approximate nearest neighbor work from uber: https://github.com/apache/spark/pull/6213 was (Author: debasish83): item->item and user->user was done in an old PR I had...if there is interested I can resend it...nice to see how it compares with approximate nearest neighbor work from uber: https://github.com/apache/spark/pull/6213 > Feature parity for ALS ML with MLLIB > > > Key: SPARK-13857 > URL: https://issues.apache.org/jira/browse/SPARK-13857 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Nick Pentreath >Assignee: Nick Pentreath > > Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods > {{recommendProducts/recommendUsers}} for recommending top K to a given user / > item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to > recommend top K across all users/items. > Additionally, SPARK-10802 is for adding the ability to do > {{recommendProductsForUsers}} for a subset of users (or vice versa). > Look at exposing or porting (as appropriate) these methods to ALS in ML. > Investigate if efficiency can be improved at the same time (see SPARK-11968). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15777650#comment-15777650 ] Debasish Das commented on SPARK-13857: -- item->item and user->user was done in an old PR I had...if there is interested I can resend it...nice to see how it compares with approximate nearest neighbor work from uber: https://github.com/apache/spark/pull/6213 > Feature parity for ALS ML with MLLIB > > > Key: SPARK-13857 > URL: https://issues.apache.org/jira/browse/SPARK-13857 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Nick Pentreath >Assignee: Nick Pentreath > > Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods > {{recommendProducts/recommendUsers}} for recommending top K to a given user / > item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to > recommend top K across all users/items. > Additionally, SPARK-10802 is for adding the ability to do > {{recommendProductsForUsers}} for a subset of users (or vice versa). > Look at exposing or porting (as appropriate) these methods to ALS in ML. > Investigate if efficiency can be improved at the same time (see SPARK-11968). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581366#comment-15581366 ] Debasish Das commented on SPARK-5992: - Also do you have hash function for euclidean distance? We use cosine, jaccard and euclidean with SPARK-4823...for knn comparison we can use overlap metric...pick up k and then compare overlap within lsh based approximate knn and brute force knn...let me know if you need help in running the benchmarks... > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581361#comment-15581361 ] Debasish Das commented on SPARK-5992: - Did you compare with brute force knn ? Normally lsh does not work well for nn queries and that's why hybrid spill trees and other ideas came alongI can run some comparisons using SPARK-4823 > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581359#comment-15581359 ] Debasish Das commented on SPARK-4823: - We use it in multiple usecases internally but did not get time to refactor the PR into 3 smaller PRsI will update the PR to 2.0 > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > Attachments: MovieLensSimilarity Comparisons.pdf, > SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf > > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server
[ https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411023#comment-15411023 ] Debasish Das commented on SPARK-6932: - [~rxin] [~sowen] Do we have any other active parameter server effort going on other than glint project from Rolf ? I have started to look into glint to scale Spark-as-a-Service to process queries (idea is that can we keep Spark master as a coordinator but 0 compute happens on Spark master other than coordination through messages, in our impl right now compute is happening on master which is a major con right now). More details will be covered in the talk https://spark-summit.org/eu-2016/events/fusing-apache-spark-and-lucene-for-near-realtime-predictive-model-building/ but I believe parameter server (or something similar) will be needed to scale query-processing further to Cassandra ring architecture for example...We will provide our implementation for spark-lucene integration as part of our framework (Trapezium) open source. > A Prototype of Parameter Server > --- > > Key: SPARK-6932 > URL: https://issues.apache.org/jira/browse/SPARK-6932 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib, Spark Core >Reporter: Qiping Li > > h2. Introduction > As specified in > [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be > very helpful to integrate parameter server into Spark for machine learning > algorithms, especially for those with ultra high dimensions features. > After carefully studying the design doc of [Parameter > Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and > the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], > we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with > several key design concerns: > * *User friendly interface* > Careful investigation is done to most existing Parameter Server > systems(including: [petuum|http://petuum.github.io], [parameter > server|http://parameterserver.org], > [paracel|https://github.com/douban/paracel]) and a user friendly interface is > design by absorbing essence from all these system. > * *Prototype of distributed array* > IndexRDD (see > [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem > to be a good option for distributed array, because in most case, the #key > updates/second is not be very high. > So we implement a distributed HashMap to store the parameters, which can > be easily extended to get better performance. > > * *Minimal code change* > Quite a lot of effort in done to avoid code change of Spark core. Tasks > which need parameter server are still created and scheduled by Spark's > scheduler. Tasks communicate with parameter server with a client object, > through *akka* or *netty*. > With all these concerns we propose the following architecture: > h2. Architecture > !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg! > Data is stored in RDD and is partitioned across workers. During each > iteration, each worker gets parameters from parameter server then computes > new parameters based on old parameters and data in the partition. Finally > each worker updates parameters to parameter server.Worker communicates with > parameter server through a parameter server client,which is initialized in > `TaskContext` of this worker. > The current implementation is based on YARN cluster mode, > but it should not be a problem to transplanted it to other modes. > h3. Interface > We refer to existing parameter server systems(petuum, parameter server, > paracel) when design the interface of parameter server. > *`PSClient` provides the following interface for workers to use:* > {code} > // get parameter indexed by key from parameter server > def get[T](key: String): T > // get multiple parameters from parameter server > def multiGet[T](keys: Array[String]): Array[T] > // add parameter indexed by `key` by `delta`, > // if multiple `delta` to update on the same parameter, > // use `reduceFunc` to reduce these `delta`s frist. > def update[T](key: String, delta: T, reduceFunc: (T, T) => T): Unit > // update multiple parameters at the same time, use the same `reduceFunc`. > def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) => > T: Unit > > // advance clock to indicate that current iteration is finished. > def clock(): Unit > > // block until all workers have reached this line of code. > def sync(): Unit > {code} > *`PSContext` provides following functions to use on driver:* > {code} > // load parameters from existing rdd. > def loadPSModel[T](model: RDD[String, T]) > // fetch parameters from parameter server to construct
[jira] [Comment Edited] (SPARK-9834) Normal equation solver for ordinary least squares
[ https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315935#comment-15315935 ] Debasish Das edited comment on SPARK-9834 at 6/5/16 4:49 PM: - Do you have runtime comparisons that when features <= 4096, OLS using Normal Equations is faster than BFGS ? I am extending OLS for sparse features and it will be great if you can point to the runtime experiments you have done... was (Author: debasish83): Do you have runtime comparisons that when features <= 4096, OLS using Normal Equations is faster than BFGS ? > Normal equation solver for ordinary least squares > - > > Key: SPARK-9834 > URL: https://issues.apache.org/jira/browse/SPARK-9834 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.6.0 > > > Add normal equation solver for ordinary least squares with not many features. > The approach requires one pass to collect AtA and Atb, then solve the problem > on driver. It works well when the problem is not very ill-conditioned and not > having many columns. It also provides R-like summary statistics. > We can hide this implementation under LinearRegression. It is triggered when > there are no more than, e.g., 4096 features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9834) Normal equation solver for ordinary least squares
[ https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315935#comment-15315935 ] Debasish Das commented on SPARK-9834: - Do you have runtime comparisons that when features <= 4096, OLS using Normal Equations is faster than BFGS ? > Normal equation solver for ordinary least squares > - > > Key: SPARK-9834 > URL: https://issues.apache.org/jira/browse/SPARK-9834 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.6.0 > > > Add normal equation solver for ordinary least squares with not many features. > The approach requires one pass to collect AtA and Atb, then solve the problem > on driver. It works well when the problem is not very ill-conditioned and not > having many columns. It also provides R-like summary statistics. > We can hide this implementation under LinearRegression. It is triggered when > there are no more than, e.g., 4096 features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735706#comment-14735706 ] Debasish Das commented on SPARK-10408: -- [~avulanov] In MLP can we change BFGS to OWLQN and get L1 regularization ? That way I can get sparse weights and clean up the network to avoid overfitting...For the autoencoder did you experiment with graphx based design ? I would like to work on it. Basically the idea is to come up with a N layer deep autoencoder that can support similar prediction APIs like matrix factorization. > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers: > References: > 1-3. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf > 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9834) Normal equation solver for ordinary least squares
[ https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734170#comment-14734170 ] Debasish Das edited comment on SPARK-9834 at 9/8/15 3:18 PM: - [~mengxr] If you are open to use breeze.proximal.QuadraticMinimizer we can support elastic net in this variant as well...The flow will be very similar to QuadraticMinimizer integration to ALS...I have done runtime benchmarks compared to OWLQN and if we can afford to do dense cholesky QuadraticMinimizer converges faster than OWLQN. There are two new QuadraticMinimizer features I am working on which will further improve the solver: 1. sparse ldl through tim davis lgpl code and using breeze sparse matrix for sparse gram and conic formulations. Plan is to add it in breeze-native under lgpl similar to netlib-java integration. 2. admm acceleration using nesterov method. ADMM can be run in the same complexity as FISTA (implemented in TFOCS). Reference: http://www.optimization-online.org/DB_FILE/2009/12/2502.pdf Although in practice I found even the ADMM implemented right now in QuadraticMinimizer converges faster than OWLQN. Tom in his paper demonstrated faster ADMM convergence compared to FISTA for quadratic problems: ftp://ftp.math.ucla.edu/pub/camreport/cam12-35.pdf. Due to the X^TX availability in these problems (ALS and linear regression) I also compute the min and max eigen values using power iteration (breeze.optimize.linear.PowerMethod) in the code which gives the Lipschitz estimator L and there is no line search overhead. This trick did not work for the nonlinear variant as the hessian estimates are not close to gram matrix ! QuadraticMinimizer is optimized to run at par with blas dposv when there are no constraints while BFGS/OWLQN both still have lot of overhead from iterators etc. That might also be the reason that I see QuadraticMinimizer is faster than BFGS/OWLQN. It might be the right time to do the micro-benchmark as well that you asked for QuadraticMinimizer. Let me know what you think. I can finish up the micro-benchmark, bring the runtime of QuadraticMinimizer to ALS NormalEquationSolver and then start the L1 experiments. was (Author: debasish83): If you are open to use breeze.proximal.QuadraticMinimizer we can support elastic net in this variant as well...I can add it on top of your PR...it will be very similar to quadraticminimizer integration to ALS...I have done runtime benchmarks compared to OWLQN and if we can afford to do dense cholesky QuadraticMinimizer converges faster than OWLQN...there are two new features I am working on...sparse ldl through tim davis lgpl code and using breeze sparse matrix for sparse gram and conic formulations and admm acceleration using nesterov method...admm can also be run in the same complexity as FISTA...david goldferb proved it. > Normal equation solver for ordinary least squares > - > > Key: SPARK-9834 > URL: https://issues.apache.org/jira/browse/SPARK-9834 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Add normal equation solver for ordinary least squares with not many features. > The approach requires one pass to collect AtA and Atb, then solve the problem > on driver. It works well when the problem is not very ill-conditioned and not > having many columns. It also provides R-like summary statistics. > We can hide this implementation under LinearRegression. It is triggered when > there are no more than, e.g., 4096 features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9834) Normal equation solver for ordinary least squares
[ https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734170#comment-14734170 ] Debasish Das commented on SPARK-9834: - If you are open to use breeze.proximal.QuadraticMinimizer we can support elastic net in this variant as well...I can add it on top of your PR...it will be very similar to quadraticminimizer integration to ALS...I have done runtime benchmarks compared to OWLQN and if we can afford to do dense cholesky QuadraticMinimizer converges faster than OWLQN...there are two new features I am working on...sparse ldl through tim davis lgpl code and using breeze sparse matrix for sparse gram and conic formulations and admm acceleration using nesterov method...admm can also be run in the same complexity as FISTA...david goldferb proved it. > Normal equation solver for ordinary least squares > - > > Key: SPARK-9834 > URL: https://issues.apache.org/jira/browse/SPARK-9834 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Add normal equation solver for ordinary least squares with not many features. > The approach requires one pass to collect AtA and Atb, then solve the problem > on driver. It works well when the problem is not very ill-conditioned and not > having many columns. It also provides R-like summary statistics. > We can hide this implementation under LinearRegression. It is triggered when > there are no more than, e.g., 4096 features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734130#comment-14734130 ] Debasish Das commented on SPARK-10078: -- [~mengxr] will it be Breeze LBFGS modification or part of mllib.optimization ? Is someone looking into it ? > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-4823: Attachment: SparkMeetup2015-Experiments2.pdf SparkMeetup2015-Experiments1.pdf rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Attachments: MovieLensSimilarity Comparisons.pdf, SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648340#comment-14648340 ] Debasish Das commented on SPARK-4823: - We did more detailed experiment for July 2015 Spark Meetup to understand the shuffle effects on runtime. I attached the data for experiments in the JIRA. I will update the PR as discussed with Reza. I am targeting 1 PR for Spark 1.5. rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Attachments: MovieLensSimilarity Comparisons.pdf RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583886#comment-14583886 ] Debasish Das commented on SPARK-2336: - Very cool idea Sen. Did you also look into FLANN for randomized KDTree and KMeansTree. We have a PR for rowSimilarities which we will use to compare the QoR of your PR as soon as you open up a stable version. Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: clustering, features After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583886#comment-14583886 ] Debasish Das edited comment on SPARK-2336 at 6/12/15 6:51 PM: -- Very cool idea Sen. Did you also look into FLANN for randomized KDTree and KMeansTree. We have a PR for rowSimilarities https://github.com/apache/spark/pull/6213 for brute force KNN generation which we will use to compare the QoR of your PR as soon as you open up a stable version. was (Author: debasish83): Very cool idea Sen. Did you also look into FLANN for randomized KDTree and KMeansTree. We have a PR for rowSimilarities which we will use to compare the QoR of your PR as soon as you open up a stable version. Approximate k-NN Models for MLLib - Key: SPARK-2336 URL: https://issues.apache.org/jira/browse/SPARK-2336 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Brian Gawalt Priority: Minor Labels: clustering, features After tackling the general k-Nearest Neighbor model as per https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to also offer approximate k-Nearest Neighbor. A promising approach would involve building a kd-tree variant within from each partition, a la http://www.autonlab.org/autonweb/14714.html?branch=1language=2 This could offer a simple non-linear ML model that can label new data with much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Affects Version/s: (was: 1.4.0) Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Debasish Das Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-4823: Attachment: MovieLensSimilarity Comparisons.pdf The attached file shows the runtime comparison of row and column based flow on all items from MovieLens dataset on my local Macbook with 8 cores, 1 GB driver, 4 GB executor memory. 1e-2 is the threshold that's being set to both row based kernel flow and column based dimsum flow. Stage 24 - 35 is the row similarity flow. Total runtime ~ 20 s Stage 64 is col similarity mapPartitions. Total runtime ~ 4.6 mins This shows the power of blocking in Spark and I have not yet gone to gemv which will decrease the runtime further. I updated the driver code in examples.mllib.MovieLensSimilarity rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Attachments: MovieLensSimilarity Comparisons.pdf RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14557416#comment-14557416 ] Debasish Das commented on SPARK-2426: - [~mengxr] Should I add the PR to spark packages and close the JIRA ? The main contribution was to add sparsity constraints (L1 and probability simplex) to user and product factors in implicit and explicit feedback factorization and interested users can use the features from spark packages if they need...Later if there is community interest, we can pull it in to master ALS ? Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551294#comment-14551294 ] Debasish Das commented on SPARK-6323: - Petuum paper that got released today mentioned going to larger topic size (~10-100K) http://arxiv.org/pdf/1412.1576v1.pdf Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547318#comment-14547318 ] Debasish Das commented on SPARK-4823: - I opened up a PR that worked well for our datasets. It is still brute-force computation although we use blocked cartesian and user defined kernels to optimize on cutting computation and shuffle...There are trivial ideas to go from BLAS-1 to BLAS-2 and BLAS-3 as more sparse operations are added to mllib BLAS although I don't think it will give us the runtime boost we are looking for... We are looking into approximate KNN family of algorithms to improve the runtime further...KDTree is good for dense vector with low features but for sparse vector in higher dimensions researches did not find it useful.. LSH seems to be most commonly used and that's the direction we are looking into. I looked into papers but the one that showed good recall values in their experiments as compared to brute force KNN is Google Correlate and that's the validation strategy we will focus at https://www.google.com/trends/correlate/nnsearch.pdf. Please point to any other references that deem fit. There are twitter papers as well using LSH and the implementation is available in algebird. We will start with algebird LSH but ideally we don't want to have a distance metric hardcoded in LSH. If we get good recall using LSH based method compared to the rowSimilarities code from the PR, we will use LSH based method to approximate compute similarities between dense/sparse rows using cosine kernel, dense userFactor, productFactor from factorization using product kernel and dense user/product factor similarities using cosine kernel. The kernel abstraction is part of the current PR and right now we support Cosine, Product, Euclidean and RBF. Pearson is something that's of interest but it's not added yet. For approximate row similarity I will open up a separate JIRA. rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-4231: Affects Version/s: (was: 1.2.0) 1.4.0 Add RankingMetrics to examples.MovieLensALS --- Key: SPARK-4231 URL: https://issues.apache.org/jira/browse/SPARK-4231 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 1.4.0 Reporter: Debasish Das Original Estimate: 24h Remaining Estimate: 24h examples.MovieLensALS computes RMSE for movielens dataset but after addition of RankingMetrics and enhancements to ALS, it is critical to look at not only the RMSE but also measures like prec@k and MAP. In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and also added a flag that takes an input whether user/product recommendation is being validated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das reopened SPARK-4231: - The code was not part of SPARK-3066 and so reopening... Add RankingMetrics to examples.MovieLensALS --- Key: SPARK-4231 URL: https://issues.apache.org/jira/browse/SPARK-4231 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 1.2.0 Reporter: Debasish Das Original Estimate: 24h Remaining Estimate: 24h examples.MovieLensALS computes RMSE for movielens dataset but after addition of RankingMetrics and enhancements to ALS, it is critical to look at not only the RMSE but also measures like prec@k and MAP. In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and also added a flag that takes an input whether user/product recommendation is being validated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512843#comment-14512843 ] Debasish Das commented on SPARK-5992: - Did someone compared algebird LSH with spark minhash link above ? Unless algebird is slow (which I found for TopK monoid) we should use it the same way HLL is being used in Spark streaming ? Is it ok to add algebird to mllib ? Locality Sensitive Hashing (LSH) for MLlib -- Key: SPARK-5992 URL: https://issues.apache.org/jira/browse/SPARK-5992 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Locality Sensitive Hashing (LSH) would be very useful for ML. It would be great to discuss some possible algorithms here, choose an API, and make a PR for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3987) NNLS generates incorrect result
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484646#comment-14484646 ] Debasish Das commented on SPARK-3987: - @mengxr for this testcase it was fixed but I remember there was someone in user list who mentioned that he got incorrect result compared to some other tool...may be it's a good idea to ask for testcases... NNLS generates incorrect result --- Key: SPARK-3987 URL: https://issues.apache.org/jira/browse/SPARK-3987 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Debasish Das Assignee: Shuo Xiang Fix For: 1.1.1, 1.2.0 Hi, Please see the example gram matrix and linear term: val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, -5281.636812, 39298.355527, -3440.450858, 0.00, 13717.870243, -8471.405582, 2071.812204, 53928.060360, -8283.992464, 33507.951918, -24087.375367, -5408.929644, 4089.672804, -37103.267653, -33839.612565, 18537.392481, 7026.518692, 54636.778527, -57375.986301, -5281.636812, 9735.061160, -45360.674033, 10634.633559, 0.00,
[jira] [Comment Edited] (SPARK-3987) NNLS generates incorrect result
[ https://issues.apache.org/jira/browse/SPARK-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484646#comment-14484646 ] Debasish Das edited comment on SPARK-3987 at 4/8/15 3:31 AM: - [~mengxr] for this testcase it was fixed but I remember there was someone in user list who mentioned that he got incorrect result compared to some other tool...may be it's a good idea to ask for testcases... was (Author: debasish83): @mengxr for this testcase it was fixed but I remember there was someone in user list who mentioned that he got incorrect result compared to some other tool...may be it's a good idea to ask for testcases... NNLS generates incorrect result --- Key: SPARK-3987 URL: https://issues.apache.org/jira/browse/SPARK-3987 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Debasish Das Assignee: Shuo Xiang Fix For: 1.1.1, 1.2.0 Hi, Please see the example gram matrix and linear term: val P2 = new DoubleMatrix(20, 20, 333907.312770, -60814.043975, 207935.829941, -162881.367739, -43730.396770, 17511.428983, -243340.496449, -225245.957922, 104700.445881, 32430.845099, 336378.693135, -373497.970207, -41147.159621, 53928.060360, -293517.883778, 53105.278068, 0.00, -85257.781696, 84913.970469, -10584.080103, -60814.043975, 13826.806664, -38032.612640, 33475.833875, 10791.916809, -1040.950810, 48106.552472, 45390.073380, -16310.282190, -2861.455903, -60790.833191, 73109.516544, 9826.614644, -8283.992464, 56991.742991, -6171.366034, 0.00, 19152.382499, -13218.721710, 2793.734234, 207935.829941, -38032.612640, 129661.677608, -101682.098412, -27401.299347, 10787.713362, -151803.006149, -140563.601672, 65067.935324, 20031.263383, 209521.268600, -232958.054688, -25764.179034, 33507.951918, -183046.845592, 32884.782835, 0.00, -53315.811196, 52770.762546, -6642.187643, -162881.367739, 33475.833875, -101682.098412, 85094.407608, 25422.850782, -5437.646141, 124197.166330, 116206.265909, -47093.484134, -11420.168521, -163429.436848, 189574.783900, 23447.172314, -24087.375367, 148311.355507, -20848.385466, 0.00, 46835.814559, -38180.352878, 6415.873901, -43730.396770, 10791.916809, -27401.299347, 25422.850782, 8882.869799, 15.638084, 35933.473986, 34186.371325, -10745.330690, -974.314375, -43537.709621, 54371.010558, 7894.453004, -5408.929644, 42231.381747, -3192.010574, 0.00, 15058.753110, -8704.757256, 2316.581535, 17511.428983, -1040.950810, 10787.713362, -5437.646141, 15.638084, 2794.949847, -9681.950987, -8258.171646, 7754.358930, 4193.359412, 18052.143842, -15456.096769, -253.356253, 4089.672804, -12524.380088, 5651.579348, 0.00, -1513.302547, 6296.461898, 152.427321, -243340.496449, 48106.552472, -151803.006149, 124197.166330, 35933.473986, -9681.950987, 182931.600236, 170454.352953, -72361.174145, -19270.461728, -244518.179729, 279551.060579, 33340.452802, -37103.267653, 219025.288975, -33687.141423, 0.00, 67347.950443, -58673.009647, 8957.800259, -225245.957922, 45390.073380, -140563.601672, 116206.265909, 34186.371325, -8258.171646, 170454.352953, 159322.942894, -66074.960534, -16839.743193, -226173.967766, 260421.044094, 31624.194003, -33839.612565, 203889.695169, -30034.828909, 0.00, 63525.040745, -53572.741748, 8575.071847, 104700.445881, -16310.282190, 65067.935324, -47093.484134, -10745.330690, 7754.358930, -72361.174145, -66074.960534, 35869.598076, 13378.653317, 106033.647837, -111831.682883, -10455.465743, 18537.392481, -88370.612394, 20344.288488, 0.00, -22935.482766, 29004.543704, -2409.461759, 32430.845099, -2861.455903, 20031.263383, -11420.168521, -974.314375, 4193.359412, -19270.461728, -16839.743193, 13378.653317, 6802.081898, 33256.395091, -30421.985199, -1296.785870, 7026.518692, -24443.378205, 9221.982599, 0.00, -4088.076871, 10861.014242, -25.092938, 336378.693135, -60790.833191, 209521.268600, -163429.436848, -43537.709621, 18052.143842, -244518.179729, -226173.967766, 106033.647837, 33256.395091, 339200.268106, -375442.716811, -41027.594509, 54636.778527, -295133.248586, 54177.278365, 0.00, -85237.666701, 85996.957056, -10503.209968, -373497.970207, 73109.516544, -232958.054688, 189574.783900, 54371.010558, -15456.096769, 279551.060579, 260421.044094, -111831.682883, -30421.985199, -375442.716811, 427793.208465, 50528.074431, -57375.986301, 335203.382015, -52676.385869, 0.00, 102368.307670, -90679.792485, 13509.390393, -41147.159621, 9826.614644, -25764.179034, 23447.172314, 7894.453004, -253.356253, 33340.452802, 31624.194003, -10455.465743, -1296.785870, -41027.594509, 50528.074431, 7255.977434, -5281.636812, 39298.355527, -3440.450858, 0.00,
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388128#comment-14388128 ] Debasish Das commented on SPARK-5564: - [~sparks] we are trying to access the EC2 dataset but giving error: [ec2-user@ip-172-31-38-56 ~]$ aws s3 ls s3://files.sparks.requester.pays/enwiki_category_text/ A client error (AccessDenied) occurred when calling the ListObjects operation: Access Denied Could you please take a look if it is still available for use ? Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3066) Support recommendAll in matrix factorization model
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389973#comment-14389973 ] Debasish Das edited comment on SPARK-3066 at 4/1/15 4:28 AM: - Also unless the raw flow runs there is no way to validate how good a LSH based flow is doing...I updated the PR today with [~mengxr] reviews...I am working on level 3 BLAS routines for item-item similarity calculation from matrix factors and the same optimization can be applied here...I will open up the PR for that in coming weeks...we already have a JIRA for rowSimilarities... was (Author: debasish83): Also unless the raw flow runs there is no way to validate how good a LSH based flow is doing since users...I updated the PR today with [~mengxr] reviews...I am working on level 3 BLAS routines for item-item similarity calculation from matrix factors and the same optimization can be applied here...I will open up the PR for that in coming weeks...we already have a JIRA for rowSimilarities... Support recommendAll in matrix factorization model -- Key: SPARK-3066 URL: https://issues.apache.org/jira/browse/SPARK-3066 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Debasish Das ALS returns a matrix factorization model, which we can use to predict ratings for individual queries as well as small batches. In practice, users may want to compute top-k recommendations offline for all users. It is very expensive but a common problem. We can do some optimization like 1) collect one side (either user or product) and broadcast it as a matrix 2) use level-3 BLAS to compute inner products 3) use Utils.takeOrdered to find top-k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389973#comment-14389973 ] Debasish Das commented on SPARK-3066: - Also unless the raw flow runs there is no way to validate how good a LSH based flow is doing since users...I updated the PR today with [~mengxr] reviews...I am working on level 3 BLAS routines for item-item similarity calculation from matrix factors and the same optimization can be applied here...I will open up the PR for that in coming weeks...we already have a JIRA for rowSimilarities... Support recommendAll in matrix factorization model -- Key: SPARK-3066 URL: https://issues.apache.org/jira/browse/SPARK-3066 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Debasish Das ALS returns a matrix factorization model, which we can use to predict ratings for individual queries as well as small batches. In practice, users may want to compute top-k recommendations offline for all users. It is very expensive but a common problem. We can do some optimization like 1) collect one side (either user or product) and broadcast it as a matrix 2) use level-3 BLAS to compute inner products 3) use Utils.takeOrdered to find top-k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387180#comment-14387180 ] Debasish Das commented on SPARK-5564: - Cool...I will run my experiments on the same dataset as well and report results... Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387180#comment-14387180 ] Debasish Das edited comment on SPARK-5564 at 3/30/15 6:52 PM: -- Cool...I will run my experiments on the same dataset as well and report results...By the way my plan is to run 1000 sparse topics here...K will be 1000 but sparse and so we never shuffle more than 100 sparse vectors...For sparsity experiments did you also add something specific ? was (Author: debasish83): Cool...I will run my experiments on the same dataset as well and report results... Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049 ] Debasish Das commented on SPARK-5564: - [~josephkb] could you please point me to the datasets that are used for benchmarking? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases from your LDA JIRA...For recommendation, I know how to construct the testcases... Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049 ] Debasish Das edited comment on SPARK-5564 at 3/30/15 12:31 AM: --- [~josephkb] could you please point me to the datasets that are used for benchmarking LDA and how do they scale as we start scaling the topics? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases based on document and word matrix...For recommendation, I know how to construct the testcases with loglikelihood loss was (Author: debasish83): [~josephkb] could you please point me to the datasets that are used for benchmarking? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases based on document and word matrix...For recommendation, I know how to construct the testcases with loglikelihood loss Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049 ] Debasish Das edited comment on SPARK-5564 at 3/30/15 12:30 AM: --- [~josephkb] could you please point me to the datasets that are used for benchmarking? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases based on document and word matrix...For recommendation, I know how to construct the testcases with loglikelihood loss was (Author: debasish83): [~josephkb] could you please point me to the datasets that are used for benchmarking? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases from your LDA JIRA...For recommendation, I know how to construct the testcases... Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-2426: Affects Version/s: (was: 1.3.0) 1.4.0 Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357 ] Debasish Das edited comment on SPARK-2426 at 3/24/15 3:23 PM: -- [~acopich] From your comment before Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ., could you please point me to a good reference with application to collaborative filtering/topic modeling ? Stochastic matrix decomposition is what we can do in this PR now https://github.com/apache/spark/pull/3221Is not there is log term that multiplies with R to make it a KL divergence loss ? May be the log term can removed under non-negative and normalization constraints ? @mengxr any ideas here ? If we can do that we can target KL divergence loss from Lee's paper: http://hebb.mit.edu/people/seung/papers/ls-lponm-99.pdf For MAP loss, I will open up a PR in a week through JIRA https://issues.apache.org/jira/browse/SPARK-6323. I am very curious how much slower we get compared to stochastic matrix decomposition using ALS. MAP loss looks like a strong contender to LDA and can natively handle counts (does not need regression style datasets which is difficult to get in practical setup where people normally don't give any rating and satisfaction should be infered from viewing time etc) was (Author: debasish83): [~acopich] From your comment before Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ., could you please point me to a good reference with application to collaborative filtering/topic modeling ? Stochastic matrix decomposition is what we can do in this PR now https://github.com/apache/spark/pull/3221 For MAP loss, I will open up a PR in a week through JIRA https://issues.apache.org/jira/browse/SPARK-6323. I am very curious how much slower we get compared to stochastic matrix decomposition using ALS. MAP loss looks like a strong contender to LDA and can natively handle counts (does not need regression style datasets which is difficult to get in practical setup where people normally don't give any rating and satisfaction should be infered from viewing time etc) Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357 ] Debasish Das edited comment on SPARK-2426 at 3/24/15 3:23 PM: -- [~acopich] From your comment before Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ., could you please point me to a good reference with application to collaborative filtering/topic modeling ? Stochastic matrix decomposition is what we can do in this PR now https://github.com/apache/spark/pull/3221 Is not there is log term that multiplies with R to make it a KL divergence loss ? May be the log term can removed under non-negative and normalization constraints ? @mengxr any ideas here ? If we can do that we can target KL divergence loss from Lee's paper: http://hebb.mit.edu/people/seung/papers/ls-lponm-99.pdf For MAP loss, I will open up a PR in a week through JIRA https://issues.apache.org/jira/browse/SPARK-6323. I am very curious how much slower we get compared to stochastic matrix decomposition using ALS. MAP loss looks like a strong contender to LDA and can natively handle counts (does not need regression style datasets which is difficult to get in practical setup where people normally don't give any rating and satisfaction should be infered from viewing time etc) was (Author: debasish83): [~acopich] From your comment before Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ., could you please point me to a good reference with application to collaborative filtering/topic modeling ? Stochastic matrix decomposition is what we can do in this PR now https://github.com/apache/spark/pull/3221Is not there is log term that multiplies with R to make it a KL divergence loss ? May be the log term can removed under non-negative and normalization constraints ? @mengxr any ideas here ? If we can do that we can target KL divergence loss from Lee's paper: http://hebb.mit.edu/people/seung/papers/ls-lponm-99.pdf For MAP loss, I will open up a PR in a week through JIRA https://issues.apache.org/jira/browse/SPARK-6323. I am very curious how much slower we get compared to stochastic matrix decomposition using ALS. MAP loss looks like a strong contender to LDA and can natively handle counts (does not need regression style datasets which is difficult to get in practical setup where people normally don't give any rating and satisfaction should be infered from viewing time etc) Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378062#comment-14378062 ] Debasish Das commented on SPARK-6323: - I did some more reading and realized that even for small ranks, the current least square approach won't be able to handle both KL divergence loss (unless there is an approximation that I am missing, more discussions on https://issues.apache.org/jira/browse/SPARK-2426) and PLSA formulation (KL divergence loss with additional constraints)...Even for collaborative filtering with small ranks, this code will be useful... Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357 ] Debasish Das commented on SPARK-2426: - [~acopich] From your comment before Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ||.|| stands for Frobenius norm (or l1)., could you please point me to a good reference with application to collaborative filtering/topic modeling ? Stochastic matrix decomposition is what we can do in this PR now https://github.com/apache/spark/pull/3221 For MAP loss, I will open up a PR in a week through JIRA https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much slower we get compared to stochastic matrix decomposition using ALS Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357 ] Debasish Das edited comment on SPARK-2426 at 3/24/15 6:11 AM: -- [~acopich] From your comment before Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. || . || stands for Frobenius norm (or l1)., could you please point me to a good reference with application to collaborative filtering/topic modeling ? Stochastic matrix decomposition is what we can do in this PR now https://github.com/apache/spark/pull/3221 For MAP loss, I will open up a PR in a week through JIRA https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much slower we get compared to stochastic matrix decomposition using ALS was (Author: debasish83): [~acopich] From your comment before Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ||.|| stands for Frobenius norm (or l1)., could you please point me to a good reference with application to collaborative filtering/topic modeling ? Stochastic matrix decomposition is what we can do in this PR now https://github.com/apache/spark/pull/3221 For MAP loss, I will open up a PR in a week through JIRA https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much slower we get compared to stochastic matrix decomposition using ALS Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357 ] Debasish Das edited comment on SPARK-2426 at 3/24/15 6:11 AM: -- [~acopich] From your comment before Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ., could you please point me to a good reference with application to collaborative filtering/topic modeling ? Stochastic matrix decomposition is what we can do in this PR now https://github.com/apache/spark/pull/3221 For MAP loss, I will open up a PR in a week through JIRA https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much slower we get compared to stochastic matrix decomposition using ALS was (Author: debasish83): [~acopich] From your comment before Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. || . || stands for Frobenius norm (or l1)., could you please point me to a good reference with application to collaborative filtering/topic modeling ? Stochastic matrix decomposition is what we can do in this PR now https://github.com/apache/spark/pull/3221 For MAP loss, I will open up a PR in a week through JIRA https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much slower we get compared to stochastic matrix decomposition using ALS Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377357#comment-14377357 ] Debasish Das edited comment on SPARK-2426 at 3/24/15 6:13 AM: -- [~acopich] From your comment before Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ., could you please point me to a good reference with application to collaborative filtering/topic modeling ? Stochastic matrix decomposition is what we can do in this PR now https://github.com/apache/spark/pull/3221 For MAP loss, I will open up a PR in a week through JIRA https://issues.apache.org/jira/browse/SPARK-6323. I am very curious how much slower we get compared to stochastic matrix decomposition using ALS. MAP loss looks like a strong contender to LDA and can natively handle counts (does not need regression style datasets which is difficult to get in practical setup where people normally don't give any rating and satisfaction should be infered from viewing time etc) was (Author: debasish83): [~acopich] From your comment before Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ., could you please point me to a good reference with application to collaborative filtering/topic modeling ? Stochastic matrix decomposition is what we can do in this PR now https://github.com/apache/spark/pull/3221 For MAP loss, I will open up a PR in a week through JIRA https://issues.apache.org/jira/browse/SPARK-6323...I am very curious how much slower we get compared to stochastic matrix decomposition using ALS Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3735) Sending the factor directly or AtA based on the cost in ALS
[ https://issues.apache.org/jira/browse/SPARK-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376046#comment-14376046 ] Debasish Das commented on SPARK-3735: - We might want to consider doing some of these things through indexed RDD exposed through an API...right now ALS is completely join based...can we do something nicer if we have access to an efficient read only cache from ALS mapPartitions...Idea here is to think about zeros explicitly and not adding the implicit heuristic which is generally hard to tune... Sending the factor directly or AtA based on the cost in ALS --- Key: SPARK-3735 URL: https://issues.apache.org/jira/browse/SPARK-3735 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng It is common to have some super popular products in the dataset. In this case, sending many user factors to the target product block could be more expensive than sending the normal equation `\sum_i u_i u_i^T` and `\sum_i u_i r_ij` to the product block. The cost of sending a single factor is `k`, while the cost of sending a normal equation is much more expensive, `k * (k + 3) / 2`. However, if we use normal equation for all products associated with a user, we don't need to send this user factor. Determining the optimal assignment is hard. But we could use a simple heuristic. Inside any rating block, 1) order the product ids by the number of user ids associated with them in desc order 2) starting from the most popular product, mark popular products as use normal eq and calculate the cost Remember the best assignment that comes with the lowest cost and use it for computation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375325#comment-14375325 ] Debasish Das commented on SPARK-2426: - [~acopich] There's a completely different loss... BTW, we've used a factorisation with the loss you've described as an initial approximation for PLSA. It gave a significant speed-up. Could you help adding some testcases and driver for the PLSA approximation ? the PR https://github.com/apache/spark/pull/3221 has now the LSA constraints and least square loss... Idea here is to do probability simplex on user side, bounds on the item side and normalization on item columns at each ALS iteration...The MAP loss is tracked through https://issues.apache.org/jira/browse/SPARK-6323 but the solve idea will be very similar as I mentioned before and so we can re-use the flow test-cases...We can discuss more on the PR...It will be great if you can help add examples.mllib.PLSA as well that will driver both PLSA through ALS and ALM (alternating MAP loss optimization)... Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss and LoglikelihoodLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. was: Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library:
[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956 ] Debasish Das edited comment on SPARK-6323 at 3/16/15 6:30 PM: -- g(z) is not regularization...we support constraints like z=0; lb = z = ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported through QuadraticMinimizer for 2426. I already migrated ALS to use QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze release. I call it z since we are using splitting algorithms here for the solve (projection based or admm + proximal)... Sure for papers on global objective refer to any PLSA paper with matrix factorization. I personally like these 2 and I am focused on them: 1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization Equation (2) and (3) 2. The original PLSA paper from Hoffman et al. 3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic Models for Collaborative Filtering 4. Industry specific application: http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818 For large rank matrix factorization the requirement also come from sparse topics now which can easily range in ~ 10K... The idea can be implemented in the Sparse LDA JIRA as well https://issues.apache.org/jira/browse/SPARK-5564 and I asked [~josephkb] if he thinks we should do it in LDA framework but I don't think we know which flow will scale better yet as the data moves from sparse from dense. With the factorization flow I will start to see results next week. These are the algorithm steps: 1. minimize f(w,h*) s.t 1'w = 1, w =0 (row constraints) 2. minimize f(w*,h) s.t 0 = h = 1, 3. Normalize each column in h Note that 2 and 3 is an approximation to the original matrix formulation but the column normalization makes the factor probabilistically well defined for next PLSA iteration. f(w,h*) is loglikelihood loss from PLSA paper. I will start to look into graphx based flow after that because in general that flow makes more sense for distributed nets where objective is no longer separable like ALS-WR / PLSA. was (Author: debasish83): g(z) is not regularization...we support constraints like z=0; lb = z = ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported through QuadraticMinimizer for 2426. I already migrated ALS to use QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze release. I call it z since we are using splitting algorithms here for the solve (projection based or admm + proximal)... Sure for papers on global objective refer to any PLSA paper with matrix factorization. I personally like these 2 and I am focused on them: 1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization Equation (2) and (3) 2. The original PLSA paper from Hoffman et al. 3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic Models for Collaborative Filtering 4. Industry specific application: http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818 For large rank matrix factorization the requirement also come from sparse topics now which can easily range in ~ 10K... The idea can be implemented in the Sparse LDA JIRA as well https://issues.apache.org/jira/browse/SPARK-5564 and I asked [~josephkb] if he thinks we should do it in LDA framework but I don't think we know which flow will scale better yet as the data moves from sparse from dense. With the factorization flow I will start to see results next week as the flow is designed to handle these ideas. I will start to look into graphx based flow after that. Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f (
[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956 ] Debasish Das edited comment on SPARK-6323 at 3/15/15 4:29 PM: -- g(z) is not regularization...we support constraints like z=0; lb = z = ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported through QuadraticMinimizer for 2426. I already migrated ALS to use QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze release. I call it z since we are using splitting algorithms here for the solve (projection based or admm + proximal)... Sure for papers on global objective refer to any PLSA paper with matrix factorization. I personally like these 2 and I am focused on them: 1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization Equation (2) and (3) 2. The original PLSA paper from Hoffman et al. 3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic Models for Collaborative Filtering 4. Industry specific application: http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818 For large rank matrix factorization the requirement also come from sparse topics now which can easily range in ~ 10K... The idea can be implemented in the Sparse LDA JIRA as well https://issues.apache.org/jira/browse/SPARK-5564 and I asked [~josephkb] if he thinks we should do it in LDA framework but I don't think we know which flow will scale better yet as the data moves from sparse from dense. With the factorization flow I will start to see results next week as the flow is designed to handle these ideas. I will start to look into graphx based flow after that. was (Author: debasish83): g(z) is not regularization...we support constraints like z=0; lb = z = ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported through QuadraticMinimizer for 2426. I already migrated ALS to use QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze release. I call it z since we are using splitting algorithms here for the solve (projection based or admm + proximal)... Sure for papers on global objective refer to any PLSA paper with matrix factorization. I personally like these 2 and I am focused on them: 1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization Equation (2) and (3) 2. The original PLSA paper from Hoffman et al. 3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic Models for Collaborative Filtering 4. Industry specific application: http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818 For large rank matrix factorization the requirement also come from sparse topics now which can easily range in ~ 10K... The idea can be implemented in the Sparse LDA JIRA as well https://issues.apache.org/jira/browse/SPARK-5564 and I asked Joseph if he thinks we should do it in LDA framework but I don't think we know which flow will scale better yet. With the factorization flow I will start to see results next week as the flow is designed to handle these ideas. I will start to look into graphx based flow after that. Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver
[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956 ] Debasish Das edited comment on SPARK-6323 at 3/15/15 4:26 PM: -- g(z) is not regularization...we support constraints like z=0; lb = z = ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported through QuadraticMinimizer for 2426. I already migrated ALS to use QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze release. I call it z since we are using splitting algorithms here for the solve (projection based or admm + proximal)... Sure for papers on global objective refer to any PLSA paper with matrix factorization. I personally like these 2 and I am focused on them: 1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization Equation (2) and (3) 2. The original PLSA paper from Hoffman et al. 3. Collaborative filtering using PLSA from Hoffman et al. Latent Semantic Models for Collaborative Filtering 4. Industry specific application: http://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818 For large rank matrix factorization the requirement also come from sparse topics now which can easily range in ~ 10K... The idea can be implemented in the Sparse LDA JIRA as well https://issues.apache.org/jira/browse/SPARK-5564 and I asked Joseph if he thinks we should do it in LDA framework but I don't think we know which flow will scale better yet. With the factorization flow I will start to see results next week as the flow is designed to handle these ideas. I will start to look into graphx based flow after that. was (Author: debasish83): g(z) is not regularization...we support constraints like z=0; lb = z = ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported through QuadraticMinimizer for 2426. I already migrated ALS to use QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze release. I call it z since we are using splitting algorithms here for the solve (projection based or admm + proximal)... Sure for papers on global objective refer to any PLSA paper with matrix factorization. I personally like these 2 and I am focused on them: 1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization Equation (2) and (3) 2. The original PLSA paper from Hoffman et al. For large rank matrix factorization I think the requirements come from sparse topics now which can easily range in ~ 10K... Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion
[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361981#comment-14361981 ] Debasish Das commented on SPARK-6323: - By the way I can close the JIRA if it is not related to broader interest of the community... Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360956#comment-14360956 ] Debasish Das commented on SPARK-6323: - g(z) is not regularization...we support constraints like z=0; lb = z = ub;1'z = s, z =0;L1(z) for now...These are the same constraints I supported through QuadraticMinimizer for 2426. I already migrated ALS to use QuadraticMinimizer (default) and NNLS(positive) and waiting for the next breeze release. I call it z since we are using splitting algorithms here for the solve (projection based or admm + proximal)... Sure for papers on global objective refer to any PLSA paper with matrix factorization. I personally like these 2 and I am focused on them: 1. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization Equation (2) and (3) 2. The original PLSA paper from Hoffman et al. For large rank matrix factorization I think the requirements come from sparse topics now which can easily range in ~ 10K... Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361005#comment-14361005 ] Debasish Das edited comment on SPARK-6323 at 3/13/15 7:48 PM: -- There are some other interesting cases for large rank non-convex function but we will come to it once fixing PLSA using factorization but yes in all the formulation things break up like ALS and that's why we can distribute the solve to spark workers...If the objective function does not break like neural net (which is the natural extension for ALS) then we need parameter server type ideas for solver... was (Author: debasish83): There are some other interesting cases for large rank non-convex function but we will come to it once fixing PLSA using factorization... Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361005#comment-14361005 ] Debasish Das commented on SPARK-6323: - There are some other interesting cases for large rank non-convex function but we will come to it once fixing PLSA using factorization... Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f(x) + g(z) 1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g(z) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand the underlying architecture. This JIRA is intended to extend the capabilities of Spark's collaborative filtering toolkit to generalized loss function. was: Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f (x) + g (z) 1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g(z) supported are same as above except that we don't support affine constraint Aeq x = beq , lb = x = ub yet. But most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand the underlying architecture. This JIRA is intended to extend the capabilities of Spark's collaborative filtering toolkit to generalized loss function. Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems:
[jira] [Created] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
Debasish Das created SPARK-6323: --- Summary: Large rank matrix factorization with Nonlinear loss and constraints Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f(x) + g(z) 1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g(z) supported are same as above except that we don't support affine constraint Aeq x = beq , lb = x = ub yet. But most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand the underlying architecture. This JIRA is intended to extend the capabilities of Spark's collaborative filtering toolkit to generalized loss function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f(x) + g(z) 1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g(z) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and since we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of Spark's collaborative filtering toolkit to generalized loss function. was: Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f(x) + g(z) 1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g(z) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand the underlying architecture. This JIRA is intended to extend the capabilities of Spark's collaborative filtering toolkit to generalized loss function. Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and
[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and since we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. was: Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and since we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library:
[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and since we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. was: Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g(z) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and since we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal
[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f (x) + g (z) 1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g(z) supported are same as above except that we don't support affine constraint Aeq x = beq , lb = x = ub yet. But most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand the underlying architecture. This JIRA is intended to extend the capabilities of Spark's collaborative filtering toolkit to generalized loss function. was: Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f(x) + g(z) 1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g(z) supported are same as above except that we don't support affine constraint Aeq x = beq , lb = x = ub yet. But most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand the underlying architecture. This JIRA is intended to extend the capabilities of Spark's collaborative filtering toolkit to generalized loss function. Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following
[jira] [Resolved] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das resolved SPARK-4231. - Resolution: Duplicate Add RankingMetrics to examples.MovieLensALS --- Key: SPARK-4231 URL: https://issues.apache.org/jira/browse/SPARK-4231 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 1.2.0 Reporter: Debasish Das Original Estimate: 24h Remaining Estimate: 24h examples.MovieLensALS computes RMSE for movielens dataset but after addition of RankingMetrics and enhancements to ALS, it is critical to look at not only the RMSE but also measures like prec@k and MAP. In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and also added a flag that takes an input whether user/product recommendation is being validated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f(x) + g(z) 1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g(z) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and since we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. was: Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f(x) + g(z) 1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g(z) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and since we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of Spark's collaborative filtering toolkit to generalized loss function. Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal
[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g(z) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and since we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. was: Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f(x) + g(z) 1. Loss function f(x) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g(z) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and since we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which only scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library:
[jira] [Updated] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-6323: Description: Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. was: Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/Proximal.scala In this PR we will re-use ml.recommendation.ALS design and come up with ml.recommendation.ALM (Alternating Minimization). Thanks to [~mengxr] recent changes, it's straightforward to do it now ! ALM will be capable of solving the following problems: min f ( x ) + g ( z ) 1. Loss function f ( x ) can be LeastSquareLoss, LoglikelihoodLoss and HingeLoss. Most likely we will re-use the Gradient interfaces already defined and implement LoglikelihoodLoss 2. Constraints g ( z ) supported are same as above except that we don't support affine + bounds yet Aeq x = beq , lb = x = ub yet. Most likely we don't need that for ML applications 3. For solver we will use breeze.optimize.proximal.NonlinearMinimizer which in turn uses projection based solver (SPG) or proximal solvers (ADMM) based on convergence speed. https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala 4. The factors will be SparseVector so that we keep shuffle size in check. For example we will run with 10K ranks but we will force factors to be 100-sparse. This is closely related to Sparse LDA https://issues.apache.org/jira/browse/SPARK-5564 with the difference that we are not using graph representation here. As we do scaling experiments, we will understand which flow is more suited as ratings get denser (my understanding is that since we already scaled ALS to 2 billion ratings and since we will keep sparsity in check, the same 2 billion flow will scale to 10K ranks as well)... This JIRA is intended to extend the capabilities of ml recommendation to generalized loss function. Large rank matrix factorization with Nonlinear loss and constraints --- Key: SPARK-6323 URL: https://issues.apache.org/jira/browse/SPARK-6323 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Debasish Das Fix For: 1.4.0 Original Estimate: 672h Remaining Estimate: 672h Currently ml.recommendation.ALS is optimized for gram matrix generation which scales to modest ranks. The problems that we can solve are in the normal equation/quadratic form: 0.5x'Hx + c'x + g(z) g(z) can be one of the constraints from Breeze proximal library:
[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359892#comment-14359892 ] Debasish Das commented on SPARK-3066: - We use the non-level 3 BLAS code in our internal flows with ~ 60M x 3M datasets...Runtime is decent...I am moving to level 3 BLAS for 4823 and I think the speed will improve further Support recommendAll in matrix factorization model -- Key: SPARK-3066 URL: https://issues.apache.org/jira/browse/SPARK-3066 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Debasish Das ALS returns a matrix factorization model, which we can use to predict ratings for individual queries as well as small batches. In practice, users may want to compute top-k recommendations offline for all users. It is very expensive but a common problem. We can do some optimization like 1) collect one side (either user or product) and broadcast it as a matrix 2) use level-3 BLAS to compute inner products 3) use Utils.takeOrdered to find top-k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351839#comment-14351839 ] Debasish Das commented on SPARK-2426: - [~mengxr] NNLS and QuadraticMinimizer are both merged to BreezeI will migrate ml.recommendation.ALS accordingly... Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351945#comment-14351945 ] Debasish Das commented on SPARK-3066: - [~josephkb] do you mean knn ? For recommendation until you do the dot product I am not sure how can you find topk..level 3 BLAS will definite give a big boost since it's all blocked dense with dense multiplication...For https://issues.apache.org/jira/browse/SPARK-4823 I am looking into dense dense BLAS and dense sparse BLAS..ideally there we can add in a knn based optimization followed by row similarity calculation Support recommendAll in matrix factorization model -- Key: SPARK-3066 URL: https://issues.apache.org/jira/browse/SPARK-3066 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Debasish Das ALS returns a matrix factorization model, which we can use to predict ratings for individual queries as well as small batches. In practice, users may want to compute top-k recommendations offline for all users. It is very expensive but a common problem. We can do some optimization like 1) collect one side (either user or product) and broadcast it as a matrix 2) use level-3 BLAS to compute inner products 3) use Utils.takeOrdered to find top-k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351948#comment-14351948 ] Debasish Das commented on SPARK-4823: - [~mengxr] I need level 3 BLAS for this JIRA as well as https://issues.apache.org/jira/browse/SPARK-4675...Specifically I am looking for dense matrix x dense matrix and dense matrix x sparse matrix...Does breeze CSCMatrix support BLAS 3 based dense matrix x CSCMatrix product ? I had some code with breeze dot and it was extremely slow...I will migrate the code to netlib java BLAS from mllib and update the results on the JIRA... rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14351948#comment-14351948 ] Debasish Das edited comment on SPARK-4823 at 3/8/15 6:42 AM: - [~mengxr] I need level 3 BLAS for this JIRA as well as https://issues.apache.org/jira/browse/SPARK-4675. Specifically I am looking for dense matrix x dense matrix and dense matrix x sparse matrix...Does breeze CSCMatrix support BLAS 3 based dense matrix x CSCMatrix product ? I had some code with breeze dot and it was extremely slow...I will migrate the code to netlib java BLAS from mllib and update the results on the JIRA... was (Author: debasish83): [~mengxr] I need level 3 BLAS for this JIRA as well as https://issues.apache.org/jira/browse/SPARK-4675...Specifically I am looking for dense matrix x dense matrix and dense matrix x sparse matrix...Does breeze CSCMatrix support BLAS 3 based dense matrix x CSCMatrix product ? I had some code with breeze dot and it was extremely slow...I will migrate the code to netlib java BLAS from mllib and update the results on the JIRA... rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311 ] Debasish Das edited comment on SPARK-5564 at 3/1/15 4:41 PM: - I am right now using the following PR to do large rank matrix factorization with various constraints...https://github.com/scalanlp/breeze/pull/364 I am not sure if the current ALS will scale to large ranks (we want to go ~ 10K range) and so I am keen to compare the exact formulation in graphx based LDA flow... Idea here is to solve the constrained factorization problem as explained in Vorontsov and Potapenko: minimize f(w,h*) s.t 1'w = 1, w =0 (row constraints) minimize f(w*,h) s.t 0 = h = 1, Normalize each column in h Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426 for low ranks and got good improvement in MAP statistics for recommendation workloads...Here also I expect Perplexity will improve... If no one else is looking into it I would like to compare join based factorization based flow (ml.recommendation.ALS) with the graphx based LDA flow... Infact if you think for large ranks, LDA based flow will be more efficient than join based factorization flow, I can implement stochastic matrix factorization directly on top of LDA flow and add both the least square and MAP losses... I am assuming here that LDA architecture is a bipartite graph with nodes as docs/words and there are counts on each edge...The solver will be run once on every node of each partition after it collects the ratings and factors from it's edges.. was (Author: debasish83): I am right now using the following PR to do large rank matrix factorization with various constraints...I am not sure if the current ALS will scale to large ranks but I am keen to compare the exact formulation in graphx based LDA flow... https://github.com/scalanlp/breeze/pull/364 Idea here is to solve the constrained factorization problem as explained in Vorontsov and Potapenko: minimize f(w,h*) s.t 1'w = 1, w =0 (row constraints) minimize f(w*,h) s.t 0 = h = 1, Normalize each column in h Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in MAP statistics...Here also I expect Perplexity will improve... If no one else is looking into it I would like to compare join based factorization based flow (ml.recommendation.ALS) with the graphx based LDA flow... Infact if you think for large ranks, LDA based flow will be more efficient than join based factorization flow, I can implement stochastic matrix factorization directly on top of LDA flow and add both the least square and MAP losses... Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311 ] Debasish Das edited comment on SPARK-5564 at 3/1/15 4:51 PM: - I am right now using the following PR to do large rank matrix factorization with various constraints...https://github.com/scalanlp/breeze/pull/364 I am not sure if the current ALS will scale to large ranks (we want to go ~ 10K range sparse) and so I am keen to compare the exact formulation in graphx based LDA flow... Idea here is to solve the constrained factorization problem as explained in Vorontsov and Potapenko: minimize f(w,h*) s.t 1'w = 1, w =0 (row constraints) minimize f(w*,h) s.t 0 = h = 1, Normalize each column in h Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426 for low ranks and got good improvement in MAP statistics for recommendation workloads...Here also I expect Perplexity will improve... If no one else is looking into it I would like to compare join based factorization based flow (ml.recommendation.ALS) with the graphx based LDA flow... Infact if you think for large ranks, LDA based flow will be more efficient than join based factorization flow, I can implement stochastic matrix factorization directly on top of LDA flow and add both the least square and MAP losses... I am assuming here that LDA architecture is a bipartite graph with nodes as docs/words and there are counts on each edge...The solver will be run once on every node of each partition after it collects the ratings and factors from it's edges.. was (Author: debasish83): I am right now using the following PR to do large rank matrix factorization with various constraints...https://github.com/scalanlp/breeze/pull/364 I am not sure if the current ALS will scale to large ranks (we want to go ~ 10K range) and so I am keen to compare the exact formulation in graphx based LDA flow... Idea here is to solve the constrained factorization problem as explained in Vorontsov and Potapenko: minimize f(w,h*) s.t 1'w = 1, w =0 (row constraints) minimize f(w*,h) s.t 0 = h = 1, Normalize each column in h Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426 for low ranks and got good improvement in MAP statistics for recommendation workloads...Here also I expect Perplexity will improve... If no one else is looking into it I would like to compare join based factorization based flow (ml.recommendation.ALS) with the graphx based LDA flow... Infact if you think for large ranks, LDA based flow will be more efficient than join based factorization flow, I can implement stochastic matrix factorization directly on top of LDA flow and add both the least square and MAP losses... I am assuming here that LDA architecture is a bipartite graph with nodes as docs/words and there are counts on each edge...The solver will be run once on every node of each partition after it collects the ratings and factors from it's edges.. Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311 ] Debasish Das commented on SPARK-5564: - I am right now using the following PR to do large rank matrix factorization with various constraints...I am not sure if the current ALS will scale to large ranks but I will keen to compare the exact formulation in graphx based LDA flow... https://github.com/scalanlp/breeze/pull/364 Idea here is to solve the constrained factorization problem as explained in Vorontsov and Potapenko: minimize f(w,h*) s.t 1'w = 1, w =0 (row constraints) minimize f(w*,h) s.t 0 = h = 1, Normalize each column in h Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in MAP statistics...Here also I expect Perplexity will improve... If no one else is looking into it I would like to compare join based factorization based flow (ml.recommendation.ALS) with the graphx based LDA flow... Infact if you think for large ranks, LDA based flow will be more efficient than join based factorization flow, I can implement stochastic matrix factorization directly on top of LDA and add both the least square and MAP losses... Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311 ] Debasish Das edited comment on SPARK-5564 at 3/1/15 4:20 PM: - I am right now using the following PR to do large rank matrix factorization with various constraints...I am not sure if the current ALS will scale to large ranks but I am keen to compare the exact formulation in graphx based LDA flow... https://github.com/scalanlp/breeze/pull/364 Idea here is to solve the constrained factorization problem as explained in Vorontsov and Potapenko: minimize f(w,h*) s.t 1'w = 1, w =0 (row constraints) minimize f(w*,h) s.t 0 = h = 1, Normalize each column in h Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in MAP statistics...Here also I expect Perplexity will improve... If no one else is looking into it I would like to compare join based factorization based flow (ml.recommendation.ALS) with the graphx based LDA flow... Infact if you think for large ranks, LDA based flow will be more efficient than join based factorization flow, I can implement stochastic matrix factorization directly on top of LDA flow and add both the least square and MAP losses... was (Author: debasish83): I am right now using the following PR to do large rank matrix factorization with various constraints...I am not sure if the current ALS will scale to large ranks but I am keen to compare the exact formulation in graphx based LDA flow... https://github.com/scalanlp/breeze/pull/364 Idea here is to solve the constrained factorization problem as explained in Vorontsov and Potapenko: minimize f(w,h*) s.t 1'w = 1, w =0 (row constraints) minimize f(w*,h) s.t 0 = h = 1, Normalize each column in h Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in MAP statistics...Here also I expect Perplexity will improve... If no one else is looking into it I would like to compare join based factorization based flow (ml.recommendation.ALS) with the graphx based LDA flow... Infact if you think for large ranks, LDA based flow will be more efficient than join based factorization flow, I can implement stochastic matrix factorization directly on top of LDA and add both the least square and MAP losses... Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311 ] Debasish Das edited comment on SPARK-5564 at 3/1/15 4:19 PM: - I am right now using the following PR to do large rank matrix factorization with various constraints...I am not sure if the current ALS will scale to large ranks but I am keen to compare the exact formulation in graphx based LDA flow... https://github.com/scalanlp/breeze/pull/364 Idea here is to solve the constrained factorization problem as explained in Vorontsov and Potapenko: minimize f(w,h*) s.t 1'w = 1, w =0 (row constraints) minimize f(w*,h) s.t 0 = h = 1, Normalize each column in h Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in MAP statistics...Here also I expect Perplexity will improve... If no one else is looking into it I would like to compare join based factorization based flow (ml.recommendation.ALS) with the graphx based LDA flow... Infact if you think for large ranks, LDA based flow will be more efficient than join based factorization flow, I can implement stochastic matrix factorization directly on top of LDA and add both the least square and MAP losses... was (Author: debasish83): I am right now using the following PR to do large rank matrix factorization with various constraints...I am not sure if the current ALS will scale to large ranks but I will keen to compare the exact formulation in graphx based LDA flow... https://github.com/scalanlp/breeze/pull/364 Idea here is to solve the constrained factorization problem as explained in Vorontsov and Potapenko: minimize f(w,h*) s.t 1'w = 1, w =0 (row constraints) minimize f(w*,h) s.t 0 = h = 1, Normalize each column in h Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in MAP statistics...Here also I expect Perplexity will improve... If no one else is looking into it I would like to compare join based factorization based flow (ml.recommendation.ALS) with the graphx based LDA flow... Infact if you think for large ranks, LDA based flow will be more efficient than join based factorization flow, I can implement stochastic matrix factorization directly on top of LDA and add both the least square and MAP losses... Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342312#comment-14342312 ] Debasish Das commented on SPARK-5564: - By the way the following step is an approximation to the real constraint but if we get good results over Gibbs Sampling based approaches, there are ways to solve the real problem as well... minimize f(w*,h) s.t 0 = h = 1, Normalize each column in h Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302932#comment-14302932 ] Debasish Das commented on SPARK-2426: - [~mengxr] [~coderxiang] David is out in Feb and I am not sure if we can cut a breeze release with the code. I refactored NNLS to breeze.optimize.linear due to its similarity to CG core. Proximal algorithms and QuadraticMinimizer are refactored to breeze.optimize.proximal. It will be great if you could also review the PR https://github.com/scalanlp/breeze/pull/321. With this solver added to Breeze I am ready to add in ALS modifications to Spark. The test-cases for default ALS and nnls runs fine with my Spark PR. I need to add appropriate test-cases for sparse coding and least square loss with lsa constraints as explained above. Should I add them to ml.als or mllib.als since we have now two codebases ? My current PR will merge fine with mllib.als but not with ml.als. I see there is a CholeskySolver but all those features are supported in breeze.optimize.proximal.QuadraticMinimizer. Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel
[ https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243026#comment-14243026 ] Debasish Das commented on SPARK-4675: - Is there a metric like MAP / AUC kind of measure that can help us validate similarUsers and similarProducts ? Right now if I run column similarities with sparse vector on matrix factorization datasets for product similarities, it will assume all unvisited entries (which should be ?) as 0 and compute column similarities for...If the sparse vector has ? in place of 0 then basically all similarity calculation is incorrect...so in that sense it makes more sense to compute the similarities on the matrix factors... But then we are back to map-reduce calculation of rowSimilarities. Find similar products and similar users in MatrixFactorizationModel --- Key: SPARK-4675 URL: https://issues.apache.org/jira/browse/SPARK-4675 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Steven Bourke Priority: Trivial Labels: mllib, recommender Using the latent feature space that is learnt in MatrixFactorizationModel, I have added 2 new functions to find similar products and similar users. A user of the API can for example pass a product ID, and get the closest products. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243048#comment-14243048 ] Debasish Das commented on SPARK-4823: - [~srowen] did you implement map-reduce row similarities for user factors ? What's the algorithm that you used ? Any pointers will be really helpful... rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243149#comment-14243149 ] Debasish Das commented on SPARK-2426: - [~mengxr] as per our discussion, QuadraticMinimizer and NNLS are both added to breeze and updated with breeze DenseMatrix and DenseVector...Inside breeze I did some interesting comparisons and that motivated me to port NNLS to breeze as well...I added all the testcases for QuadraticMinimizer and NNLS as well based on my experiments with MovieLens dataset... Here is the PR: https://github.com/scalanlp/breeze/pull/321 To run the Quadratic programming variants in breeze: runMain breeze.optimize.quadratic.QuadraticMinimizer 100 1 0.1 0.99 regParam = 0.1, beta = 0.99 is Elastic Net parameter It will randomly generate quadratic problems with 100 variables, 1 equality constraint and lower/upper bounds. This format is similar to PDCO QP generator (please look into my Matlab examples) 0.5x'Hx + c'x s.t Ax = B, lb = x = ub 1. Unconstrained minimization: breeze luSolve, cg and qp(dposv added to breeze through this PR) Minimize 0.5x'Hx + c'x ||qp - lu|| norm 4.312577233496585E-10 max-norm 1.3842793578078272E-10 ||cg - lu|| norm 4.167925029822007E-7 max-norm 1.0053204402282745E-7 dim 100 lu 86.007 qp 41.56 cg 102.627 ||qp - lu|| norm 4.267891623199082E-8 max-norm 6.681460718027665E-9 ||cg - lu|| norm 1.94497623480055E-7 max-norm 2.6288773824489908E-8 dim 500 lu 169.993 qp 78.001 cg 443.044 qp is faster than cg for smaller dimensions as expected. I also tried unconstrained BFGS but the results were not good. We are looking into it. 2. Elastic Net formulation: 0.5 x'Hx + c'x + (1-beta)*L2(x) + beta*regParam*L1(x) beta = 0.99 Strong L1 regParam=0.1 ||owlqn - sparseqp|| norm 0.1653200701235298 inf-norm 0.051855911945906996 sparseQp 61.948 ms iters 227 owlqn 928.11 ms beta = 0.5 average L1 regParam=0.1 ||owlqn - sparseqp|| norm 0.15823773098501168 inf-norm 0.035153837685728107 sparseQp 69.934 ms iters 353 owlqn 882.104 ms beta = 0.01 mostly BFGS regParam=0.1 ||owlqn - sparseqp|| norm 0.17950035092790165 inf-norm 0.04718697692014828 sparseQp 80.411 ms iters 580 owlqn 988.313 ms ADMM based proximal formulation is faster for smaller dimension. Even as I scale dimension, I notice similar behavior that owlqn is taking longer to converge and results are not same. Look for example in dim = 500 case: ||owlqn - sparseqp|| norm 10.946326189397649 inf-norm 1.412726586317294 sparseQp 830.593 ms iters 2417 owlqn 19848.932 ms I validated ADMM through Matlab scripts so there is something funky going on in OWLQN. 3. NNLS formulation: 0.5 x'Hx + c'x s.t x = 0 Here are compared ADMM based proximal formulation with CG based projected gradient in NNLS. NNLS converges much nicer but the convergence criteria does not look same as breeze CG but they should be same. For now I ported it to breeze and we can call NNLS for x = 0 and QuadraticMinimizer for other formulations dim = 100 posQp 16.367 ms iters 284 nnls 8.854 ms iters 107 dim = 500 posQp 303.184 ms iters 950 nnls 183.543 ms iters 517 NNLS on average looks 2X faster ! 4. Bounds formulation: 0.5x'Hx + c'x s.t lb = x = ub Validated through Matlab scripts above. Here are the runtime numbers: dim = 100 boundsQp 15.654 ms iters 284 converged true dim= 500 boundsQp 311.613 ms iters 950 converged true 5. Equality and positivity: 0.5 x'Hx + c'x s.t \sum_i x_i = 1, x_i =0 Validated through Matlab scripts above. Here are the runtime numbers: dim = 100 Qp Equality 13.64 ms iters 184 converged true dim = 500 Qp Equality 278.525 ms iters 890 converged true With this change all copyrights are moved to breeze. Once it merges, I will update the Spark PR. With this change we can move ALS code to Breeze DenseMatrix and DenseVector as well My focus next will be to get a Truncated Newton running for convex cost since convex cost is required for PLSA, SVM and Neural Net formulations... I am still puzzled that why BFGS/OWLQN is not working well for the unconstrained case/L1 optimization. If TRON works well for unconstrained case, that's what I will use for NonlinearMinimizer. I am looking more into it. Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243207#comment-14243207 ] Debasish Das commented on SPARK-4823: - Even for matrix factorization userFactors are user x rank...with modest ranks of 50..and users at 10M, I don't think it is possible to transpose the matrix and run column similarities...doing it on the fly complexity wise is still O(n*n) right... rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243456#comment-14243456 ] Debasish Das commented on SPARK-2426: - [~akopich] I got good MAP results on recommendation datasets with the approximated PLSA formulation. I did not get time to compare that formulation with Gibbs sampling based LDA PR: https://issues.apache.org/jira/browse/SPARK-1405 yet. Did you compare them ? Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel
[ https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241535#comment-14241535 ] Debasish Das commented on SPARK-4675: - There are few issues: 1. Batch API for topK similar users and topK similar products 2. Comparison of product x product similarities generated with columnSimilarities and compared with topK similar products I added batch APIs for topK product recommendation for each user and topK user recommendation for each product in SPARK-4231...similar batch API will be very helpful for topK similar users and topK similar products... I agree with Cosine Similarity...you should be able to re-use column similarity calculations...I think a better idea is to add rowMatrix.similarRows and re-use that code to generate product similarities and user similarities... But my question is more on validation. We can compute product similarities on raw features and we can compute product similarities on matrix product factor...which one is better ? Find similar products and similar users in MatrixFactorizationModel --- Key: SPARK-4675 URL: https://issues.apache.org/jira/browse/SPARK-4675 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Steven Bourke Priority: Trivial Labels: mllib, recommender Using the latent feature space that is learnt in MatrixFactorizationModel, I have added 2 new functions to find similar products and similar users. A user of the API can for example pass a product ID, and get the closest products. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242031#comment-14242031 ] Debasish Das commented on SPARK-4823: - I am considering coming up with a baseline version that's very close to brute force but we cut the computation with a topK number...for each user come up with topK users where K is defined by the client..this will take care of matrix factorization use-case... Basically on master we collect a set of user factors, broadcast it to every node and does a reduceByKey to generate topK users for each user from this user block...We send a kernel function (cosine / polynomial / rbf) in this calculation... But this idea does not work for raw features right...If we do map features to a lower dimension using factorization then this approach should run fine...but I am not sure if we can ask users to map their data into a lower dimension Is it possible to bring in ideas from fastfood and kitchen sink to do this ? rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel
[ https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242034#comment-14242034 ] Debasish Das commented on SPARK-4675: - [~josephkb] how do we validate that low dimension space is giving more meaningful similarities than the feature space (which is sparse) ? Find similar products and similar users in MatrixFactorizationModel --- Key: SPARK-4675 URL: https://issues.apache.org/jira/browse/SPARK-4675 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Steven Bourke Priority: Trivial Labels: mllib, recommender Using the latent feature space that is learnt in MatrixFactorizationModel, I have added 2 new functions to find similar products and similar users. A user of the API can for example pass a product ID, and get the closest products. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222024#comment-14222024 ] Debasish Das commented on SPARK-1405: - We need a larger dataset as well where topics go to the range of 1+...That range will stress factorization based LSA formulations since there is broadcast of factors at each stepNIPS dataset is small...you guy's will be willing to test a wikipedia dataset for example ? If there is a pre-processed version from either mahout or scikit-learn we can use that ? parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222024#comment-14222024 ] Debasish Das edited comment on SPARK-1405 at 11/22/14 4:22 PM: --- We need a larger dataset as well where topics go to the range of 1+...That range will stress factorization based LSA formulations since there is broadcast of factors at each stepNIPS dataset is small...Let's start with that...But we should test a large dataset like wikipedia as well..If there is a pre-processed version from either mahout or scikit-learn we can use that ? was (Author: debasish83): We need a larger dataset as well where topics go to the range of 1+...That range will stress factorization based LSA formulations since there is broadcast of factors at each stepNIPS dataset is small...you guy's will be willing to test a wikipedia dataset for example ? If there is a pre-processed version from either mahout or scikit-learn we can use that ? parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222027#comment-14222027 ] Debasish Das commented on SPARK-1405: - [~pedrorodriguez] did you write the metric in your repo as well ? That way I don't have to code it up again.. parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222089#comment-14222089 ] Debasish Das commented on SPARK-1405: - NIPS dataset is common for PLSA and additive regularization based matrix factorization formulations as well since the experiments in this paper focused on the NIPS dataset as well... http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf I will be using NIPS dataset for quality experiments but for scaling experiments, wiki data is good...wiki data was demo-ed by Databricks in last spark summit...it will be great if we can get it from that demo parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222108#comment-14222108 ] Debasish Das commented on SPARK-1405: - @sparks that will be awesome...I should be fine running experiments on EC2... parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222108#comment-14222108 ] Debasish Das edited comment on SPARK-1405 at 11/22/14 6:40 PM: --- [~sparks] that will be awesome...I should be fine running experiments on EC2... was (Author: debasish83): @sparks that will be awesome...I should be fine running experiments on EC2... parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221379#comment-14221379 ] Debasish Das commented on SPARK-3066: - I did experiments on MovieLens dataset with varying rank on my localhost spark with 4 GB RAM and 4 cores to see how much MAP improvement we see as the rank is scaled === rank=10 (default) Got 1000209 ratings from 6040 users on 3706 movies. Training: 799747, test: 200462. Test RMSE = 0.8528377625407709. Test users 6036 MAP 0.03851426277536059 Runtime: 30s === rank=25 Got 1000209 ratings from 6040 users on 3706 movies. Training: 800417, test: 199792. Test RMSE = 0.8518001349769724. Test users 6037 MAP 0.04508057348514959 Runtime: 30 s === rank=50 Got 1000209 ratings from 6040 users on 3706 movies. Training: 800823, test: 199386. Test RMSE = 0.8487416471685229. Test users 6038 MAP 0.05145126538369158 Runtime 42s === rank=100 Got 1000209 ratings from 6040 users on 3706 movies. Training: 800720, test: 199489. Test RMSE = 0.8508095863317275. Test users 6033 MAP 0.0561225429735388 Runtime 1.5m === rank=150 Got 1000209 ratings from 6040 users on 3706 movies. Training: 800257, test: 199952. Test RMSE = 0.8435902056186158. Test users 6035 MAP 0.05855252471878818 Runtime 3.6 m === rank=200 Got 1000209 ratings from 6040 users on 3706 movies. Training: 800356, test: 199853. Test RMSE = 0.8452385688272382. Test users 6037 MAP 0.059176892052172934 Runtime 7.4 mins I will run through MovieLens10m and Netflix dataset and generate the numbers of them with varying ranks as well. I need to run them on cluster. Support recommendAll in matrix factorization model -- Key: SPARK-3066 URL: https://issues.apache.org/jira/browse/SPARK-3066 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng ALS returns a matrix factorization model, which we can use to predict ratings for individual queries as well as small batches. In practice, users may want to compute top-k recommendations offline for all users. It is very expensive but a common problem. We can do some optimization like 1) collect one side (either user or product) and broadcast it as a matrix 2) use level-3 BLAS to compute inner products 3) use Utils.takeOrdered to find top-k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221505#comment-14221505 ] Debasish Das edited comment on SPARK-1405 at 11/21/14 10:28 PM: [~witgo] where can I access your dataset ? I got the NIPS dataset from Pedro but here the runtimes reported are on a different dataset...also should we use the same accuracy measure that Pedro is using ? was (Author: debasish83): [~witgo] where can I access your dataset ? I got the NIPS dataset from Pedro but here the runtimes reported are on a different dataset...also should be use the same accuracy measure that Pedro is using... parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4231) Add RankingMetrics to examples.MovieLensALS
[ https://issues.apache.org/jira/browse/SPARK-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219799#comment-14219799 ] Debasish Das commented on SPARK-4231: - [~srowen] I added batch predict APIs for user and product recommendation in the PR (both batch and mini-batch APIs that takes a per user topK products) For the MAP calculation code I still have kept it in examples.MovieLensALS in this PR but I feel it should be part of MatrixFactorizationModel as well where client can send a RDD of userID and list of products they don't want to be recommended in the topK...I will add this API in MatrixFactorizationModel if it makes sense... Add RankingMetrics to examples.MovieLensALS --- Key: SPARK-4231 URL: https://issues.apache.org/jira/browse/SPARK-4231 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 1.2.0 Reporter: Debasish Das Fix For: 1.2.0 Original Estimate: 24h Remaining Estimate: 24h examples.MovieLensALS computes RMSE for movielens dataset but after addition of RankingMetrics and enhancements to ALS, it is critical to look at not only the RMSE but also measures like prec@k and MAP. In this JIRA we added RMSE and MAP computation for examples.MovieLensALS and also added a flag that takes an input whether user/product recommendation is being validated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3066) Support recommendAll in matrix factorization model
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218667#comment-14218667 ] Debasish Das edited comment on SPARK-3066 at 11/19/14 10:59 PM: [~mengxr] as per our discussions, I added APIs for batch user and product recommendation and MAP computation for recommending topK products for users... Note that I don't use reservoir sampling and used your idea of filtering the test set users for which there are no model being built...I thought reservoir sampling should be part of a separate PR APIs added: recommendProductsForUsers(num: Int) : topK is fixed for all users recommendProductsForUsers(userTopK: RDD[(Int, Int)]): variable topK for every user recommendUsersForProducts(num: Int): topK is fixed for all products recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for every product I used mllib BLAS for all the computation and cleaned up DoubleMatrix code from MatrixFactorizationModel...I have not used level 3 BLAS yet...I can add that as well if rest of the flow makes sense... On examples.MovieLensALS we can activate the user map calculation using --validateRecommendation flag: ./bin/spark-submit --master spark://localhost:7077 --jars scopt_2.10-3.2.0.jar --total-executor-cores 4 --executor-memory 4g --driver-memory 1g --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --kryo --lambda 0.065 --validateRecommendation hdfs://localhost:8020/sandbox/movielens/ Got 1000209 ratings from 6040 users on 3706 movies. Training: 799617, test: 200592. Test RMSE = 0.8495476608536306. Test users 6032 MAP 0.03798337814233403 I will run the netflix dataset and report the MAP measures for that.. On our internal datasets, I have tested for 1M users, 10K products, 120 cores, 240GB for topK users for each product and that takes around 5 mins...on an average I generate a ranked list of 6000 users for each product...Basically internally we are using the batch API: recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for every product was (Author: debasish83): @mengxr as per our discussions, I added APIs for batch user and product recommendation and MAP computation for recommending topK products for users... Note that I don't use reservoir sampling and used your idea of filtering the test set users for which there are no model being built...I thought reservoir sampling should be part of a separate PR APIs added: recommendProductsForUsers(num: Int) : topK is fixed for all users recommendProductsForUsers(userTopK: RDD[(Int, Int)]): variable topK for every user recommendUsersForProducts(num: Int): topK is fixed for all products recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for every product I used mllib BLAS for all the computation and cleaned up DoubleMatrix code from MatrixFactorizationModel...I have not used level 3 BLAS yet...I can add that as well if rest of the flow makes sense... On examples.MovieLensALS we can activate the user map calculation using --validateRecommendation flag: ./bin/spark-submit --master spark://localhost:7077 --jars scopt_2.10-3.2.0.jar --total-executor-cores 4 --executor-memory 4g --driver-memory 1g --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --kryo --lambda 0.065 --validateRecommendation hdfs://localhost:8020/sandbox/movielens/ Got 1000209 ratings from 6040 users on 3706 movies. Training: 799617, test: 200592. Test RMSE = 0.8495476608536306. Test users 6032 MAP 0.03798337814233403 I will run the netflix dataset and report the MAP measures for that.. On our internal datasets, I have tested for 1M users, 10K products, 120 cores, 240GB for topK users for each product and that takes around 5 mins...on an average I generate a ranked list of 6000 users for each product...Basically internally we are using the batch API: recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for every product Support recommendAll in matrix factorization model -- Key: SPARK-3066 URL: https://issues.apache.org/jira/browse/SPARK-3066 Project: Spark Issue Type: New Feature
[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218667#comment-14218667 ] Debasish Das commented on SPARK-3066: - @mengxr as per our discussions, I added APIs for batch user and product recommendation and MAP computation for recommending topK products for users... Note that I don't use reservoir sampling and used your idea of filtering the test set users for which there are no model being built...I thought reservoir sampling should be part of a separate PR APIs added: recommendProductsForUsers(num: Int) : topK is fixed for all users recommendProductsForUsers(userTopK: RDD[(Int, Int)]): variable topK for every user recommendUsersForProducts(num: Int): topK is fixed for all products recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for every product I used mllib BLAS for all the computation and cleaned up DoubleMatrix code from MatrixFactorizationModel...I have not used level 3 BLAS yet...I can add that as well if rest of the flow makes sense... On examples.MovieLensALS we can activate the user map calculation using --validateRecommendation flag: ./bin/spark-submit --master spark://localhost:7077 --jars scopt_2.10-3.2.0.jar --total-executor-cores 4 --executor-memory 4g --driver-memory 1g --class org.apache.spark.examples.mllib.MovieLensALS ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --kryo --lambda 0.065 --validateRecommendation hdfs://localhost:8020/sandbox/movielens/ Got 1000209 ratings from 6040 users on 3706 movies. Training: 799617, test: 200592. Test RMSE = 0.8495476608536306. Test users 6032 MAP 0.03798337814233403 I will run the netflix dataset and report the MAP measures for that.. On our internal datasets, I have tested for 1M users, 10K products, 120 cores, 240GB for topK users for each product and that takes around 5 mins...on an average I generate a ranked list of 6000 users for each product...Basically internally we are using the batch API: recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for every product Support recommendAll in matrix factorization model -- Key: SPARK-3066 URL: https://issues.apache.org/jira/browse/SPARK-3066 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng ALS returns a matrix factorization model, which we can use to predict ratings for individual queries as well as small batches. In practice, users may want to compute top-k recommendations offline for all users. It is very expensive but a common problem. We can do some optimization like 1) collect one side (either user or product) and broadcast it as a matrix 2) use level-3 BLAS to compute inner products 3) use Utils.takeOrdered to find top-k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218845#comment-14218845 ] Debasish Das commented on SPARK-1405: - I would like to compare the LSA formulations (sparse coding and PLSA with least square loss) from https://issues.apache.org/jira/browse/SPARK-2426 with LDA I added MAP metric for examples.MovieLensALS in https://issues.apache.org/jira/browse/SPARK-4231 but I am not sure how MAP can be used for topic modeling datasetswe need some perplexity measures... Could you guys please point me to the dataset and the quality measures that are being benchmarked on the LDA PR so that I can also test the LSA formulations in parallel ? parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218891#comment-14218891 ] Debasish Das commented on SPARK-2426: - With the MAP measures being added to examples.MovieLensALS through https://issues.apache.org/jira/browse/SPARK-4231 I compared the quality and runtime of the matrix completion formulations on MovieLens 1M dataset: Default: userConstraint L2, productConstraint L2 lambdaUser=lambdaProduct=0.065 rank=100 iterations 10 Test RMSE = 0.8436480113821955. Test users 6038 MAP 0.05860164548002782 Solver: Cholesky decomposition followed by forward-backward solves Per iteration runtime for baseline (solveTime in ms) 14/11/19 17:37:06 INFO ALS: usersOrProducts 924 slowConvergence 0 QuadraticMinimizer solveTime 362.813 Iters 0 14/11/19 17:37:06 INFO ALS: usersOrProducts 910 slowConvergence 0 QuadraticMinimizer solveTime 314.527 Iters 0 14/11/19 17:37:06 INFO ALS: usersOrProducts 927 slowConvergence 0 QuadraticMinimizer solveTime 265.75 Iters 0 14/11/19 17:37:06 INFO ALS: usersOrProducts 918 slowConvergence 0 QuadraticMinimizer solveTime 271.513 Iters 0 14/11/19 17:37:09 INFO ALS: usersOrProducts 1510 slowConvergence 0 QuadraticMinimizer solveTime 370.177 Iters 0 14/11/19 17:37:09 INFO ALS: usersOrProducts 1512 slowConvergence 0 QuadraticMinimizer solveTime 467.994 Iters 0 14/11/19 17:37:09 INFO ALS: usersOrProducts 1507 slowConvergence 0 QuadraticMinimizer solveTime 511.894 Iters 0 14/11/19 17:37:09 INFO ALS: usersOrProducts 1511 slowConvergence 0 QuadraticMinimizer solveTime 481.189 Iters 0 NMF: userConstraint POSITIVE, productConstraint POSITIVE, userLambda=productLambda=0.065 L2 regularization Got 1000209 ratings from 6040 users on 3706 movies. Training: 800670, test: 199539. Quadratic minimization userConstraint POSITIVE productConstraint POSITIVE Test RMSE = 0.8435335132641906. Test users 6038 MAP 0.056361816590625446 ALS iteration1 runtime: QuadraticMinimizer convergence profile: 14/11/19 17:46:46 INFO ALS: usersOrProducts 918 slowConvergence 0 QuadraticMinimizer solveTime 1936.281 Iters 73132 14/11/19 17:46:46 INFO ALS: usersOrProducts 927 slowConvergence 0 QuadraticMinimizer solveTime 1871.364 Iters 75219 14/11/19 17:46:46 INFO ALS: usersOrProducts 910 slowConvergence 0 QuadraticMinimizer solveTime 2067.735 Iters 73180 14/11/19 17:46:46 INFO ALS: usersOrProducts 924 slowConvergence 0 QuadraticMinimizer solveTime 2127.161 Iters 75546 14/11/19 17:46:53 INFO ALS: usersOrProducts 1507 slowConvergence 0 QuadraticMinimizer solveTime 3813.923 Iters 193207 14/11/19 17:46:54 INFO ALS: usersOrProducts 1511 slowConvergence 0 QuadraticMinimizer solveTime 3894.068 Iters 196882 14/11/19 17:46:54 INFO ALS: usersOrProducts 1510 slowConvergence 0 QuadraticMinimizer solveTime 3875.915 Iters 193987 14/11/19 17:46:54 INFO ALS: usersOrProducts 1512 slowConvergence 0 QuadraticMinimizer solveTime 3939.765 Iters 192471 NNLS convergence profile: 14/11/19 17:46:46 INFO ALS: NNLS solveTime 252.909 iters 7381 14/11/19 17:46:46 INFO ALS: NNLS solveTime 256.803 iters 7740 14/11/19 17:46:46 INFO ALS: NNLS solveTime 274.352 iters 7491 14/11/19 17:46:46 INFO ALS: NNLS solveTime 272.971 iters 7664 14/11/19 17:46:53 INFO ALS: NNLS solveTime 1487.262 iters 60338 14/11/19 17:46:54 INFO ALS: NNLS solveTime 1472.742 iters 61321 14/11/19 17:46:54 INFO ALS: NNLS solveTime 1489.863 iters 62228 14/11/19 17:46:54 INFO ALS: NNLS solveTime 1494.192 iters 60489 ALS iteration 10 Quadratic Minimizer convergence profile: 14/11/19 17:48:17 INFO ALS: usersOrProducts 924 slowConvergence 0 QuadraticMinimizer solveTime 1082.056 Iters 53724 14/11/19 17:48:17 INFO ALS: usersOrProducts 910 slowConvergence 0 QuadraticMinimizer solveTime 1180.601 Iters 50593 14/11/19 17:48:17 INFO ALS: usersOrProducts 927 slowConvergence 0 QuadraticMinimizer solveTime 1106.131 Iters 53069 14/11/19 17:48:17 INFO ALS: usersOrProducts 918 slowConvergence 0 QuadraticMinimizer solveTime 1108.478 Iters 50895 14/11/19 17:48:23 INFO ALS: usersOrProducts 1510 slowConvergence 0 QuadraticMinimizer solveTime 2262.193 Iters 116818 14/11/19 17:48:23 INFO ALS: usersOrProducts 1512 slowConvergence 0 QuadraticMinimizer solveTime 2293.64 Iters 116026 14/11/19 17:48:23 INFO ALS: usersOrProducts 1507 slowConvergence 0 QuadraticMinimizer solveTime 2241.491 Iters 116293 14/11/19 17:48:23 INFO ALS: usersOrProducts 1511 slowConvergence 0 QuadraticMinimizer solveTime 2372.957 Iters 118391 NNLS convergence profile: 14/11/19 17:48:17 INFO ALS: NNLS solveTime 623.031 iters 21611 14/11/19 17:48:17 INFO ALS: NNLS solveTime 553.493 iters 21732 14/11/19 17:48:17 INFO ALS: NNLS solveTime 559.9 iters 22511 14/11/19 17:48:17 INFO ALS: NNLS solveTime 556.654 iters 21330 14/11/19 17:48:23 INFO ALS: NNLS solveTime 1672.582 iters 86006 14/11/19 17:48:23 INFO ALS: NNLS solveTime 1703.221 iters 85824 14/11/19 17:48:23 INFO ALS: NNLS