[GitHub] spark pull request: [SPARK-5563][mllib] online lda initial checkin

jkbradley Mon, 23 Feb 2015 11:57:54 -0800

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4419#issuecomment-75618893
  
    @hhbyyh Thanks for the initial PR!  Here are some high-level comments:
    
    * RDD.sliding(): This may not take much advantage of parallelism.  It 
slides across the RDD by partitions first, meaning that only 1 (or a few) 
workers will be active on each iteration.  For the batch (RDD) setting, I 
wonder if it would be better to sample.  That would mean stochastic gradient 
descent, and it would hopefully be faster because of the expense of computing 
the gradient.  That would require some testing on an actual cluster to know for 
sure.
    
    * local vs. distributed models: The EM implementation supports very large 
vocabularies, where the topic distributions have to be distributed (the "term" 
vertices in the Graph).  It would be nice if the online LDA could support that 
too.  (I have heard of many use cases involving k and vocabSize large enough 
that the model would take many GB to store.)  However, I realize that storing 
the model (topics) locally is helpful for efficiency if the model is small 
enough.  Could you please sketch out how we might maintain a distributed model 
and the costs of doing that?
    
    * Returning DistributedLDAModel vs. LDAModel: It's true that online LDA 
should not return the current DistributedLDAModel since DistributedLDAModel has 
methods for returning info about the full training dataset.  That makes me 
wonder if we should have a different algorithm API for online LDA (OnlineLDA 
alongside LDA).  Does that sound reasonable?
    
    * code readability (though I know this is a WIP PR right now)
      * It will be helpful to have more comments and organization in the core 
optimization part of the code for reviewers to understand it.
      * Relatedly, it will be helpful to have the optimization steps (computing 
the gradient, computing the regularization, making the update, etc.) be 
separated out.  The optimization framework in MLlib is not suitable for you to 
use yet, probably, but hopefully it will be in the future (after this PR).  
Separation of parts will help with those future changes.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5563][mllib] online lda initial checkin

Reply via email to