Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-70537930 I've had a good chance to look at PR while making changes to my own code. I really liked the Graph initialization code (especially the partition strategy), I was able to copy that and get a 2x boost almost across the board in computation phases. # Questions in PR: ## Naming (LDA vs LatentDirichletAllocation) I think that Linear Discriminant Analysis and Latent Dirichlet Allocation are different enough that it doesn't warrant making the class names significantly longer, so LDA is probably good. ## Package Name I think a separate topicmodeling package is fitting, it makes it very clear that LDA is for topic modeling. The second motivation for a topicmodeling package is that it looks like there will be this EM implementation and soon the Gibbs version I am using, so both could fit in this namespace well. Along those same lines, perhaps keep LDAModel for the abstract class and for implementations use some way to denote whether it is Gibbs ( LDAGibbs, GibbsLDA, LDAWithGibbs) or EM based (LDAEM, EMLDA, LDAWithEM). Naming is hard... ## Unit Test On tests, I am for including a small training example, similar to here: https://github.com/EntilZha/spark/blob/LDA/graphx/src/test/scala/org/apache/spark/graphx/lib/LDASuite.scala#L52-L76 It has been very helpful to have a sanity check test as I have been working on potentially breaking changes. In that example, the data set is 9 lines, so it would probably be better to refactor and place it in an Array[String] defined in code instead of an external file. Lastly, it looks like the abstract classes/traits haven't made it in yet to have multiple implementations satisfy a common LDA API. I have taken a careful look at the design doc and your code, and am fairly confident that I have a way to do this with minimal code changes. I will post something once I refactor what I have to satisfy it, and post the proposed changes that would need to be made. The general sketch is to make LDA a trait, LearningState a trait, and have implementations have a signature something like object LDAGibbs extends LDA and provide a LearningState implementation. More soon.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org