I have been testing on the 20 NewsGroups dataset - which the Spark docs themselves reference. I can confirm that perplexity increases and likelihood decreases as topics increase - and am similarly confused by these results.
2017-09-28 10:50 GMT-07:00 Cody Buntain <cbunt...@cs.umd.edu>: > Hi, all! > > Is there an example somewhere on using LDA’s logPerplexity()/logLikelihood() > functions to evaluate topic counts? The existing MLLib LDA examples show > calling them, but I can’t find any documentation about how to interpret the > outputs. Graphing the outputs for logs of perplexity and likelihood aren’t > consistent with what I expected (perplexity increases and likelihood > decreases as topics increase, which seem odd to me). > > An example of what I’m doing is here: http://www.cs.umd.edu/~ > cbuntain/FindTopicK-pyspark-regex.html > > Thanks very much in advance! If I can figure this out, I can post example > code online, so others can see how this process is done. > > -Best regards, > Cody > _________________ > Cody Buntain, PhD > Postdoc, @UMD_CS > Intelligence Community Postdoctoral Fellow > cbunt...@cs.umd.edu > www.cs.umd.edu/~cbuntain > >