Latent Dirichlet Allocation in Spark

Manish Tripathi Thu, 16 Feb 2017 10:38:05 -0800

Hi

I am trying to do topic modeling in Spark using Spark's LDA package. Using
Spark 2.0.2 and pyspark API.


I ran the code as below:

*from pyspark.ml.clustering import LDA*
*lda = LDA(featuresCol="tf_features",k=10, seed=1, optimizer="online")*
*ldaModel=lda.fit(tf_df)*

*lda_df=ldaModel.transform(tf_df)*

I went through the docs to understand the output (the form of data) Spark
generates for LDA.

I understand the ldaModel.describeTopics() method. Gives topics with list
of terms and weights.

But I am not sure I understand the method ldamodel.topicsMatrix().

It gives me this:




if the doc says it is the distribution of words for each topic (1184 words
as rows, 10 topics as columns and the values of these cells. But then these
values are not probabilities which is what one would expect for topic-word
distribution.

These have random values more than 1 (132.76, 3.00 and so on).

Any jdea on this?

Thanks
ᐧ

Latent Dirichlet Allocation in Spark

Reply via email to