Hi! I already have a StackOverflow question on this (see here <https://stackoverflow.com/questions/33072449/extract-document-topic-matrix-from-pyspark-lda-model> ), but haven't received any responses, so I thought I'd try here!
Long story short, I'm working in PySpark and have successfully generated an LDA topic model, but can't figure out how to (or if I can) extract the topic distributions for each document from the model. I understand the LDA functionality is still in development, but getting document topic distributions is arguably the principal use case here, and is not (as far as I can tell) implemented in the Python API. I can easily get the *word*-topic distribution, by calling model.topicsMatrix(), but this isn't what I need, and there don't seems to be any other useful methods in the Python LDA model class. The only glimmer of hope came from finding the documentation for DistributedLDAModel in the Java api, which has a topicDistributions() method that I think is just what I need here (but I'm 100% sure if the LDAModel in Pyspark is in fact a DistributedLDAModel under the hood...). In any case, I am able to indirectly call this method like so, without any overt failures: In [127]: model.call('topicDistributions') Out[127]: MapPartitionsRDD[3156] at mapPartitions at PythonMLLibAPI.scala:1480 But if I actually look at the results, all I get are strings telling me that the result is actually a Scala tuple (I think): In [128]: model.call('topicDistributions').take(5) Out[128]: [{u'__class__': u'scala.Tuple2'}, {u'__class__': u'scala.Tuple2'}, {u'__class__': u'scala.Tuple2'}, {u'__class__': u'scala.Tuple2'}, {u'__class__': u'scala.Tuple2'}] Maybe this is generally the right approach, but is there way to get the actual results? Thanks in advance for any guidance you can offer! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Get-document-topic-distribution-from-PySpark-LDA-model-tp25063.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org