[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

EntilZha Mon, 19 Jan 2015 10:20:58 -0800

Github user EntilZha commented on the pull request:

    https://github.com/apache/spark/pull/4047#issuecomment-70537930
  
    I've had a good chance to look at PR while making changes to my own code. I 
really liked the Graph initialization code (especially the partition strategy), 
I was able to copy that and get a 2x boost almost across the board in 
computation phases.
    
    # Questions in PR:
    ## Naming (LDA vs LatentDirichletAllocation)
    I think that Linear Discriminant Analysis and Latent Dirichlet Allocation 
are different enough that it doesn't warrant making the class names 
significantly longer, so LDA is probably good. 
    
    ## Package Name
    I think a separate topicmodeling package is fitting, it makes it very clear 
that LDA is for topic modeling. The second motivation for a topicmodeling 
package is that it looks like there will be this EM implementation and soon the 
Gibbs version I am using, so both could fit in this namespace well. Along those 
same lines, perhaps keep LDAModel for the abstract class and for 
implementations use some way to denote whether it is Gibbs ( LDAGibbs, 
GibbsLDA, LDAWithGibbs) or EM based (LDAEM, EMLDA, LDAWithEM). Naming is hard...
    
    ## Unit Test
    On tests, I am for including a small training example, similar to here:
    
https://github.com/EntilZha/spark/blob/LDA/graphx/src/test/scala/org/apache/spark/graphx/lib/LDASuite.scala#L52-L76
    It has been very helpful to have a sanity check test as I have been working 
on potentially breaking changes. In that example, the data set is 9 lines, so 
it would probably be better to refactor and place it in an Array[String] 
defined in code instead of an external file.
    
    Lastly, it looks like the abstract classes/traits haven't made it in yet to 
have multiple implementations satisfy a common LDA API. I have taken a careful 
look at the design doc and your code, and am fairly confident that I have a way 
to do this with minimal code changes. I will post something once I refactor 
what I have to satisfy it, and post the proposed changes that would need to be 
made. The general sketch is to make LDA a trait, LearningState a trait, and 
have implementations have a signature something like object LDAGibbs extends 
LDA and provide a LearningState implementation. More soon.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

Reply via email to