[ https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Denis Turdakov updated SPARK-2199: ---------------------------------- Description: Probabilistic latent semantic analysis (PLSA) is a topic model which extracts topics from text corpus. PLSA was historically a predecessor of LDA. However recent research shows that modifications of PLSA sometimes performs better then LDA[1]. Furthermore, the most recent paper by same authors shows that there is a clear way to extend PLSA to LDA and beyond[2]. We should implement distributed versions of PLSA. In addition it should be possible to easily add user defined regularizers or combination of them. We will implement regularizers that allows * extract sparse topics * extract human interpretable topics * perform semi-supervised training * sort out non-topic specific terms. [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In Proceedings of ECIR'13. [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf was: Probabilistic latent semantic analysis (PLSA) is a topic model which extracts topics from text corpus. PLSA was historically a predecessor of LDA. However recent research shows that modifications of PLSA sometimes performs better then LDA[1]. Furthermore, the most recent paper by same authors shows that there is a clear way to extend PLSA to LDA and beyond[2]. (empty line) We should implement distributed versions of PLSA. In addition it should be possible to easily add user defined regularizers or combination of them. We will implement regularizers that allows * extract sparse topics * extract human interpretable topics * perform semi-supervised training * sort out non-topic specific terms. [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In Proceedings of ECIR'13. [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf > Distributed probabilistic latent semantic analysis in MLlib > ----------------------------------------------------------- > > Key: SPARK-2199 > URL: https://issues.apache.org/jira/browse/SPARK-2199 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.1.0 > Reporter: Denis Turdakov > Labels: features > > Probabilistic latent semantic analysis (PLSA) is a topic model which extracts > topics from text corpus. PLSA was historically a predecessor of LDA. However > recent research shows that modifications of PLSA sometimes performs better > then LDA[1]. Furthermore, the most recent paper by same authors shows that > there is a clear way to extend PLSA to LDA and beyond[2]. > We should implement distributed versions of PLSA. In addition it should be > possible to easily add user defined regularizers or combination of them. We > will implement regularizers that allows > * extract sparse topics > * extract human interpretable topics > * perform semi-supervised training > * sort out non-topic specific terms. > [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In > Proceedings of ECIR'13. > [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive > Regularization for Stochastic Matrix Factorization. > http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)