[ https://issues.apache.org/jira/browse/SPARK-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576371#comment-14576371 ]
Joseph K. Bradley commented on SPARK-5567: ------------------------------------------ Keeping the shared code in an object method sounds reasonable to me. Inference should be significantly easier, with the topics fixed. But it is unclear how we "should" do inference and what type of prediction we should return. Does this sound reasonable: * With the topics fixed and the docConcentration parameter(s) fixed, it should be straightforward to compute the MAP prediction for topicDistributions (since we can sum out the token-topic assignments as is done in collapsed Gibbs sampling). * People probably just want the MAP predictions, right? I'm assuming they would not want more details about the distribution beyond the mode. * With this setup, we could share the prediction code between all LDA models. > Add prediction methods to LDA > ----------------------------- > > Key: SPARK-5567 > URL: https://issues.apache.org/jira/browse/SPARK-5567 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.3.0 > Reporter: Joseph K. Bradley > > LDA currently supports prediction on the training set. E.g., you can call > logLikelihood and topicDistributions to get that info for the training data. > However, it should support the same functionality for new (test) documents. > This will require inference but should be able to use the same code, with a > few modification to keep the inferred topics fixed. > Note: The API for these methods is already in the code but is commented out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org