[jira] [Updated] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib

Denis Turdakov (JIRA) Thu, 19 Jun 2014 06:58:19 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Denis Turdakov updated SPARK-2199:
----------------------------------

    Description: 
Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
topics from text corpus. PLSA was historically a predecessor of LDA. However 
recent research shows that modifications of PLSA sometimes performs better then 
LDA[1]. Furthermore, the most recent paper by same authors shows that there is 
a clear way to extend PLSA to LDA and beyond[2].
(empty line)
We should implement distributed versions of PLSA. In addition it should be 
possible  to easily add user defined regularizers or combination of them. We 
will implement regularizers that allows
* extract sparse topics
* extract human interpretable topics 
* perform semi-supervised training 
* sort out non-topic specific terms. 

[1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
Proceedings of ECIR'13.
[2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
Regularization for Stochastic Matrix Factorization. 
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 


  was:
Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
topics from text corpus. PLSA was historically a predecessor of LDA. However 
recent research shows that modifications of PLSA sometimes performs better then 
LDA[1]. Furthermore, the most recent paper by same authors shows that there is 
a clear way to extend PLSA to LDA and beyond[2].
We should implement distributed versions of PLSA. In addition it should be 
possible  to easily add user defined regularizers or combination of them. We 
will implement regularizers that allows
•       extract sparse topics
•       extract human interpretable topics 
•       perform semi-supervised training 
•       sort out non-topic specific terms. 

[1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
Proceedings of ECIR'13.
[2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
Regularization for Stochastic Matrix Factorization. 
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 



> Distributed probabilistic latent semantic analysis in MLlib
> -----------------------------------------------------------
>
>                 Key: SPARK-2199
>                 URL: https://issues.apache.org/jira/browse/SPARK-2199
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.1.0
>            Reporter: Denis Turdakov
>              Labels: features
>
> Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
> topics from text corpus. PLSA was historically a predecessor of LDA. However 
> recent research shows that modifications of PLSA sometimes performs better 
> then LDA[1]. Furthermore, the most recent paper by same authors shows that 
> there is a clear way to extend PLSA to LDA and beyond[2].
> (empty line)
> We should implement distributed versions of PLSA. In addition it should be 
> possible  to easily add user defined regularizers or combination of them. We 
> will implement regularizers that allows
> * extract sparse topics
> * extract human interpretable topics 
> * perform semi-supervised training 
> * sort out non-topic specific terms. 
> [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
> Proceedings of ECIR'13.
> [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
> Regularization for Stochastic Matrix Factorization. 
> http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib

Reply via email to