[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2017-08-11 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123803#comment-16123803
 ] 

Valeriy Avanesov commented on SPARK-5564:
-

I am considering working on this issue. The question is whether there should be 
another EMLDAOptimizerVorontsov or shall the existing EMLDAOptimizer be 
re-written.



> Support sparse LDA solutions
> 
>
> Key: SPARK-5564
> URL: https://issues.apache.org/jira/browse/SPARK-5564
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
> concentration parameters be > 1.0.  It should support values > 0.0, which 
> should encourage sparser topics (phi) and document-topic distributions 
> (theta).
> For EM, this will require adding a projection to the M-step, as in: Vorontsov 
> and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive 
> Regularization for Stochastic Matrix Factorization." 2014.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-07-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632712#comment-14632712
 ] 

Apache Spark commented on SPARK-5564:
-

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7507

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-07-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632715#comment-14632715
 ] 

Apache Spark commented on SPARK-5564:
-

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7507

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-07-19 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632718#comment-14632718
 ] 

Feynman Liang commented on SPARK-5564:
--

Sorry, tagged the wrong JIRA. Ignore the above PR; I'm not currently working on 
this.

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-07-19 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632719#comment-14632719
 ] 

Feynman Liang commented on SPARK-5564:
--

[~josephkb] can you reset this status to open?

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-31 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388128#comment-14388128
 ] 

Debasish Das commented on SPARK-5564:
-

[~sparks] we are trying to access the EC2 dataset but giving error:

[ec2-user@ip-172-31-38-56 ~]$ aws s3 ls 
s3://files.sparks.requester.pays/enwiki_category_text/

A client error (AccessDenied) occurred when calling the ListObjects operation: 
Access Denied

Could you please take a look if it is still available for use ?

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387085#comment-14387085
 ] 

Joseph K. Bradley commented on SPARK-5564:
--

[~debasish83] I mainly used a Wikipedia dataset.  Here's an S3 bucket 
(requestor pays) which [~sparks] created: 
[s3://files.sparks.requester.pays/enwiki_category_text/] which holds a big 
Wikipedia dataset.  I'm not sure if it's the same one I used, but it should be 
similar qualitatively.  Mine had ~1.1 billion tokens, with about 1 million 
documents and 1 million terms (vocab size).

As far as scaling, the EM code scaled linearly with the number of topics K.  
Communication was the bottleneck for sizable datasets, and it scales linearly 
with K.  The largest K I've run with on that dataset was K=100; that was using 
16 r3.2xlarge workers.

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-30 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387180#comment-14387180
 ] 

Debasish Das commented on SPARK-5564:
-

Cool...I will run my experiments on the same dataset as well and report 
results...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387232#comment-14387232
 ] 

Joseph K. Bradley commented on SPARK-5564:
--

No, I didn't use sparsity at all, so yours will be the first ones looking at 
reducing communication via sparsity (AFAIK).

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-29 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049
 ] 

Debasish Das commented on SPARK-5564:
-

[~josephkb] could you please point me to the datasets that are used for 
benchmarking? I have started testing loglikelihood loss for recommendation and 
since I already added the constraints, this is the right time to test it on LDA 
benchmarks as well...I will open up the code as part of 
https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears 
it...

I am looking into LDA test-cases but since I am optimizing log-likelihood 
directly, I am looking to add more testcases from your LDA JIRA...For 
recommendation, I know how to construct the testcases...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-02 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343476#comment-14343476
 ] 

Joseph K. Bradley commented on SPARK-5564:
--

It would be interesting to see comparisons between the two, but I don't have a 
good sense of which would be more efficient.

{quote} I am assuming here that LDA architecture is a bipartite graph with 
nodes as docs/words and there are counts on each edge {quote}
-- You're correct.


 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311
 ] 

Debasish Das commented on SPARK-5564:
-

I am right now using the following PR to do large rank matrix factorization 
with various constraints...I am not sure if the current ALS will scale to large 
ranks but I will keen to compare the exact formulation in graphx based LDA 
flow...

https://github.com/scalanlp/breeze/pull/364

Idea here is to solve the constrained factorization problem as explained in 
Vorontsov and Potapenko:
minimize f(w,h*)
s.t 1'w = 1, w =0 (row constraints)

minimize f(w*,h)
s.t 0 = h = 1, Normalize each column in h

Here I want f(w,h) to be MAP loss but I already solved the least square variant 
in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in 
MAP statistics...Here also I expect Perplexity will improve...

If no one else is looking into it I would like to compare join based 
factorization based flow (ml.recommendation.ALS) with the graphx based LDA 
flow...

Infact if you think for large ranks, LDA based flow will be more efficient than 
join based factorization flow, I can implement stochastic matrix factorization 
directly on top of LDA and add both the least square and MAP losses...

 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-01 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342312#comment-14342312
 ] 

Debasish Das commented on SPARK-5564:
-

By the way the following step is an approximation to the real constraint but if 
we get good results over Gibbs Sampling based approaches, there are ways to 
solve the real problem as well...

minimize f(w*,h)
s.t 0 = h = 1, Normalize each column in h 


 Support sparse LDA solutions
 

 Key: SPARK-5564
 URL: https://issues.apache.org/jira/browse/SPARK-5564
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
 concentration parameters be  1.0.  It should support values  0.0, which 
 should encourage sparser topics (phi) and document-topic distributions 
 (theta).
 For EM, this will require adding a projection to the M-step, as in: Vorontsov 
 and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive 
 Regularization for Stochastic Matrix Factorization. 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org