Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4419#issuecomment-75618893
@hhbyyh Thanks for the initial PR! Here are some high-level comments:
* RDD.sliding(): This may not take much advantage of parallelism. It
slides across the RDD by partitions first, meaning that only 1 (or a few)
workers will be active on each iteration. For the batch (RDD) setting, I
wonder if it would be better to sample. That would mean stochastic gradient
descent, and it would hopefully be faster because of the expense of computing
the gradient. That would require some testing on an actual cluster to know for
sure.
* local vs. distributed models: The EM implementation supports very large
vocabularies, where the topic distributions have to be distributed (the "term"
vertices in the Graph). It would be nice if the online LDA could support that
too. (I have heard of many use cases involving k and vocabSize large enough
that the model would take many GB to store.) However, I realize that storing
the model (topics) locally is helpful for efficiency if the model is small
enough. Could you please sketch out how we might maintain a distributed model
and the costs of doing that?
* Returning DistributedLDAModel vs. LDAModel: It's true that online LDA
should not return the current DistributedLDAModel since DistributedLDAModel has
methods for returning info about the full training dataset. That makes me
wonder if we should have a different algorithm API for online LDA (OnlineLDA
alongside LDA). Does that sound reasonable?
* code readability (though I know this is a WIP PR right now)
* It will be helpful to have more comments and organization in the core
optimization part of the code for reviewers to understand it.
* Relatedly, it will be helpful to have the optimization steps (computing
the gradient, computing the regularization, making the update, etc.) be
separated out. The optimization framework in MLlib is not suitable for you to
use yet, probably, but hopefully it will be in the future (after this PR).
Separation of parts will help with those future changes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]