[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib
[ https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395014#comment-16395014 ] Valeriy Avanesov commented on SPARK-23437: -- So, the basic implementation is ready. Please, feel free to try it out. > [ML] Distributed Gaussian Process Regression for MLlib > -- > > Key: SPARK-23437 > URL: https://issues.apache.org/jira/browse/SPARK-23437 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.2.1 >Reporter: Valeriy Avanesov >Assignee: Apache Spark >Priority: Major > > Gaussian Process Regression (GP) is a well known black box non-linear > regression approach [1]. For years the approach remained inapplicable to > large samples due to its cubic computational complexity, however, more recent > techniques (Sparse GP) allowed for only linear complexity. The field > continues to attracts interest of the researches – several papers devoted to > GP were present on NIPS 2017. > Unfortunately, non-parametric regression techniques coming with mllib are > restricted to tree-based approaches. > I propose to create and include an implementation (which I am going to work > on) of so-called robust Bayesian Committee Machine proposed and investigated > in [2]. > [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian > Processes for Machine Learning (Adaptive Computation and Machine Learning)_. > The MIT Press. > [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian > processes. In _Proceedings of the 32nd International Conference on > International Conference on Machine Learning - Volume 37_ (ICML'15), Francis > Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib
[ https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383980#comment-16383980 ] Valeriy Avanesov commented on SPARK-23437: -- I've created a repo. https://github.com/akopich/spark-gp > [ML] Distributed Gaussian Process Regression for MLlib > -- > > Key: SPARK-23437 > URL: https://issues.apache.org/jira/browse/SPARK-23437 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.2.1 >Reporter: Valeriy Avanesov >Assignee: Apache Spark >Priority: Major > > Gaussian Process Regression (GP) is a well known black box non-linear > regression approach [1]. For years the approach remained inapplicable to > large samples due to its cubic computational complexity, however, more recent > techniques (Sparse GP) allowed for only linear complexity. The field > continues to attracts interest of the researches – several papers devoted to > GP were present on NIPS 2017. > Unfortunately, non-parametric regression techniques coming with mllib are > restricted to tree-based approaches. > I propose to create and include an implementation (which I am going to work > on) of so-called robust Bayesian Committee Machine proposed and investigated > in [2]. > [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian > Processes for Machine Learning (Adaptive Computation and Machine Learning)_. > The MIT Press. > [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian > processes. In _Proceedings of the 32nd International Conference on > International Conference on Machine Learning - Volume 37_ (ICML'15), Francis > Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib
[ https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381831#comment-16381831 ] Valeriy Avanesov commented on SPARK-23437: -- What does the assignment to Apache Spark mean? > [ML] Distributed Gaussian Process Regression for MLlib > -- > > Key: SPARK-23437 > URL: https://issues.apache.org/jira/browse/SPARK-23437 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.2.1 >Reporter: Valeriy Avanesov >Assignee: Apache Spark >Priority: Major > > Gaussian Process Regression (GP) is a well known black box non-linear > regression approach [1]. For years the approach remained inapplicable to > large samples due to its cubic computational complexity, however, more recent > techniques (Sparse GP) allowed for only linear complexity. The field > continues to attracts interest of the researches – several papers devoted to > GP were present on NIPS 2017. > Unfortunately, non-parametric regression techniques coming with mllib are > restricted to tree-based approaches. > I propose to create and include an implementation (which I am going to work > on) of so-called robust Bayesian Committee Machine proposed and investigated > in [2]. > [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian > Processes for Machine Learning (Adaptive Computation and Machine Learning)_. > The MIT Press. > [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian > processes. In _Proceedings of the 32nd International Conference on > International Conference on Machine Learning - Volume 37_ (ICML'15), Francis > Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib
[ https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368198#comment-16368198 ] Valeriy Avanesov commented on SPARK-23437: -- [~sethah], thanks for your input. I believe, GPflow implements linear time GP. However, it is not distributed. Regarding investigation of user demand: can't we just hold a vote among the users? > [ML] Distributed Gaussian Process Regression for MLlib > -- > > Key: SPARK-23437 > URL: https://issues.apache.org/jira/browse/SPARK-23437 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.2.1 >Reporter: Valeriy Avanesov >Priority: Major > > Gaussian Process Regression (GP) is a well known black box non-linear > regression approach [1]. For years the approach remained inapplicable to > large samples due to its cubic computational complexity, however, more recent > techniques (Sparse GP) allowed for only linear complexity. The field > continues to attracts interest of the researches – several papers devoted to > GP were present on NIPS 2017. > Unfortunately, non-parametric regression techniques coming with mllib are > restricted to tree-based approaches. > I propose to create and include an implementation (which I am going to work > on) of so-called robust Bayesian Committee Machine proposed and investigated > in [2]. > [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian > Processes for Machine Learning (Adaptive Computation and Machine Learning)_. > The MIT Press. > [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian > processes. In _Proceedings of the 32nd International Conference on > International Conference on Machine Learning - Volume 37_ (ICML'15), Francis > Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib
[ https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366857#comment-16366857 ] Valeriy Avanesov commented on SPARK-23437: -- [~mlnick], is that really supposed to happen to a textbook algorithm filling in the vacuum? There is currently no non-parametric regression techniques inferring a smooth function provided by MLlib. Regarding the guidelines: the requirements for the algorithm are # Be widely known # Be used and accepted (academic citations and concrete use cases can help justify this) # Be highly scalable and I think all of them hold (see the original post). > [ML] Distributed Gaussian Process Regression for MLlib > -- > > Key: SPARK-23437 > URL: https://issues.apache.org/jira/browse/SPARK-23437 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.2.1 >Reporter: Valeriy Avanesov >Priority: Major > > Gaussian Process Regression (GP) is a well known black box non-linear > regression approach [1]. For years the approach remained inapplicable to > large samples due to its cubic computational complexity, however, more recent > techniques (Sparse GP) allowed for only linear complexity. The field > continues to attracts interest of the researches – several papers devoted to > GP were present on NIPS 2017. > Unfortunately, non-parametric regression techniques coming with mllib are > restricted to tree-based approaches. > I propose to create and include an implementation (which I am going to work > on) of so-called robust Bayesian Committee Machine proposed and investigated > in [2]. > [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian > Processes for Machine Learning (Adaptive Computation and Machine Learning)_. > The MIT Press. > [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian > processes. In _Proceedings of the 32nd International Conference on > International Conference on Machine Learning - Volume 37_ (ICML'15), Francis > Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib
[ https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Valeriy Avanesov updated SPARK-23437: - Summary: [ML] Distributed Gaussian Process Regression for MLlib (was: Distributed Gaussian Process Regression for MLlib) > [ML] Distributed Gaussian Process Regression for MLlib > -- > > Key: SPARK-23437 > URL: https://issues.apache.org/jira/browse/SPARK-23437 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.2.1 >Reporter: Valeriy Avanesov >Priority: Major > > Gaussian Process Regression (GP) is a well known black box non-linear > regression approach [1]. For years the approach remained inapplicable to > large samples due to its cubic computational complexity, however, more recent > techniques (Sparse GP) allowed for only linear complexity. The field > continues to attracts interest of the researches – several papers devoted to > GP were present on NIPS 2017. > Unfortunately, non-parametric regression techniques coming with mllib are > restricted to tree-based approaches. > I propose to create and include an implementation (which I am going to work > on) of so-called robust Bayesian Committee Machine proposed and investigated > in [2]. > [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian > Processes for Machine Learning (Adaptive Computation and Machine Learning)_. > The MIT Press. > [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian > processes. In _Proceedings of the 32nd International Conference on > International Conference on Machine Learning - Volume 37_ (ICML'15), Francis > Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23437) Distributed Gaussian Process Regression for MLlib
Valeriy Avanesov created SPARK-23437: Summary: Distributed Gaussian Process Regression for MLlib Key: SPARK-23437 URL: https://issues.apache.org/jira/browse/SPARK-23437 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 2.2.1 Reporter: Valeriy Avanesov Gaussian Process Regression (GP) is a well known black box non-linear regression approach [1]. For years the approach remained inapplicable to large samples due to its cubic computational complexity, however, more recent techniques (Sparse GP) allowed for only linear complexity. The field continues to attracts interest of the researches – several papers devoted to GP were present on NIPS 2017. Unfortunately, non-parametric regression techniques coming with mllib are restricted to tree-based approaches. I propose to create and include an implementation (which I am going to work on) of so-called robust Bayesian Committee Machine proposed and investigated in [2]. [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)_. The MIT Press. [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian processes. In _Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37_ (ICML'15), Francis Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123803#comment-16123803 ] Valeriy Avanesov edited comment on SPARK-5564 at 8/12/17 10:29 AM: --- I am considering working on this issue. The question is whether there should be another EMLDAOptimizerVorontsov or shall the existing EMLDAOptimizer be re-written. [~josephkb], what are your thoughts? was (Author: acopich): I am considering working on this issue. The question is whether there should be another EMLDAOptimizerVorontsov or shall the existing EMLDAOptimizer be re-written. > Support sparse LDA solutions > > > Key: SPARK-5564 > URL: https://issues.apache.org/jira/browse/SPARK-5564 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Latent Dirichlet Allocation (LDA) currently requires that the priors’ > concentration parameters be > 1.0. It should support values > 0.0, which > should encourage sparser topics (phi) and document-topic distributions > (theta). > For EM, this will require adding a projection to the M-step, as in: Vorontsov > and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive > Regularization for Stochastic Matrix Factorization." 2014. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14371) OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver
[ https://issues.apache.org/jira/browse/SPARK-14371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124531#comment-16124531 ] Valeriy Avanesov commented on SPARK-14371: -- Hi, I've opened a PR regarding this Jira yesterday https://github.com/apache/spark/pull/18924 However, something seems to be wrong -- the Jira is still not "in Progress" and the PR is not linked to it. Could anyone please check out what's wrong? > OnlineLDAOptimizer should not collect stats for each doc in mini-batch to > driver > > > Key: SPARK-14371 > URL: https://issues.apache.org/jira/browse/SPARK-14371 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > See this line: > https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L437 > The second element in each row of "stats" is a list with one Vector for each > document in the mini-batch. Those are collected to the driver in this line: > https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L456 > We should not collect those to the driver. Rather, we should do the > necessary maps and aggregations in a distributed manner. This will involve > modify the Dirichlet expectation implementation. (This JIRA should be done > by someone knowledge about online LDA and Spark.) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123803#comment-16123803 ] Valeriy Avanesov commented on SPARK-5564: - I am considering working on this issue. The question is whether there should be another EMLDAOptimizerVorontsov or shall the existing EMLDAOptimizer be re-written. > Support sparse LDA solutions > > > Key: SPARK-5564 > URL: https://issues.apache.org/jira/browse/SPARK-5564 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Latent Dirichlet Allocation (LDA) currently requires that the priors’ > concentration parameters be > 1.0. It should support values > 0.0, which > should encourage sparser topics (phi) and document-topic distributions > (theta). > For EM, this will require adding a projection to the M-step, as in: Vorontsov > and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive > Regularization for Stochastic Matrix Factorization." 2014. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273551#comment-14273551 ] Valeriy Avanesov commented on SPARK-1405: - [~josephkb], I've read your proposal and I suggest to consider Stochastic Gradient Langevin Dynamics [1]. It was shown be ~100 times faster than Gibbs sampling [2]. Though, I'm not sure if it's implementable in terms of RDD. [1] http://papers.nips.cc/paper/4883-stochastic-gradient-riemannian-langevin-dynamics-on-the-probability-simplex.pdf [2] http://www.ics.uci.edu/~sungjia/icml2014_dist_v0.2.pdf parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243398#comment-14243398 ] Valeriy Avanesov commented on SPARK-2426: - what's the normalization constraint ? Each row of W should sum upto 1 and each column of H should sum upto 1 with positivity ? Yes. That is similar to PLSA right except that PLSA will have a bi-concave loss... There's a completely different loss... BTW, we've used a factorisation with the loss you've described as an initial approximation for PLSA. It gave a significant speed-up. Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231375#comment-14231375 ] Valeriy Avanesov edited comment on SPARK-2426 at 12/2/14 11:47 AM: --- I'm not sure if I understand your question... As far as I can see, w_i stands for a row of the matrix w and h_j stands for a column of the matrix h. \sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - w_i*h_j)^2 It looks like l2 regularized stochastic matrix decomposition with respect to Frobenius (or l1) norm. But I don't understand why do you consider k optimization problems (do you? What does k \in {1 ... 25} stand for?). Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ||.|| stands for Frobenius norm (or l1). By the way: is the matrix of ranks r stochastic? Stochastic matrix decomposition doesn't seem reasonable if it's not. was (Author: acopich): I'm not sure if I understand your question... As far as I can see, w_i stands for a row of the matrix w and h_j stands for a column of the matrix h. \sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - w_i*h_j)^2 It looks like l2 regularized stochastic matrix decomposition with respect to Frobenius (or l1) norm. But I don't understand why do you consider k optimization problems (do you? What does k \in {1 ... 25} stand for?). Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ||..|| stands for Frobenius norm (or l1). By the way: is the matrix of ranks r stochastic? Stochastic matrix decomposition doesn't seem reasonable if it's not. Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231375#comment-14231375 ] Valeriy Avanesov commented on SPARK-2426: - I'm not sure if I understand your question... As far as I can see, w_i stands for a row of the matrix w and h_j stands for a column of the matrix h. \sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - w_i*h_j)^2 It looks like l2 regularized stochastic matrix decomposition with respect to Frobenius (or l1) norm. But I don't understand why do you consider k optimization problems (do you? What does k \in {1 ... 25} stand for?). Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ||..|| stands for Frobenius norm (or l1). By the way: is the matrix of ranks r stochastic? Stochastic matrix decomposition doesn't seem reasonable if it's not. Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231375#comment-14231375 ] Valeriy Avanesov edited comment on SPARK-2426 at 12/2/14 11:47 AM: --- I'm not sure if I understand your question... As far as I can see, w_i stands for a row of the matrix w and h_j stands for a column of the matrix h. \sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - w_i*h_j)^2 It looks like l2 regularized stochastic matrix decomposition with respect to Frobenius (or l1) norm. But I don't understand why do you consider k optimization problems (do you? What does k \in {1 ... 25} stand for?). Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. \||.|| stands for Frobenius norm (or l1). By the way: is the matrix of ranks r stochastic? Stochastic matrix decomposition doesn't seem reasonable if it's not. was (Author: acopich): I'm not sure if I understand your question... As far as I can see, w_i stands for a row of the matrix w and h_j stands for a column of the matrix h. \sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - w_i*h_j)^2 It looks like l2 regularized stochastic matrix decomposition with respect to Frobenius (or l1) norm. But I don't understand why do you consider k optimization problems (do you? What does k \in {1 ... 25} stand for?). Anyway, l2 regularized stochastic matrix decomposition problem is defined as follows Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||) under non-negativeness and normalization constraints. ||.|| stands for Frobenius norm (or l1). By the way: is the matrix of ranks r stochastic? Stochastic matrix decomposition doesn't seem reasonable if it's not. Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037659#comment-14037659 ] Valeriy Avanesov commented on SPARK-2199: - Here is the implementation we currently have. https://github.com/akopich/dplsa Robust and non robust PLSA are implemented but no regularizers are currently supported. Distributed probabilistic latent semantic analysis in MLlib --- Key: SPARK-2199 URL: https://issues.apache.org/jira/browse/SPARK-2199 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0 Reporter: Denis Turdakov Labels: features Probabilistic latent semantic analysis (PLSA) is a topic model which extracts topics from text corpus. PLSA was historically a predecessor of LDA. However recent research shows that modifications of PLSA sometimes performs better then LDA[1]. Furthermore, the most recent paper by same authors shows that there is a clear way to extend PLSA to LDA and beyond[2]. We should implement distributed versions of PLSA. In addition it should be possible to easily add user defined regularizers or combination of them. We will implement regularizers that allows * extract sparse topics * extract human interpretable topics * perform semi-supervised training * sort out non-topic specific terms. [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In Proceedings of ECIR'13. [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)