Re: Spark EM LDA Optimizer support

Edwards, Brandon Fri, 21 Jul 2017 12:05:37 -0700

Yes it makes sense that the results would differ if you ‘scored’ data on a 
distributed model and then later tried to rescore with the saved model (that 
was necessarily converted to local).


My opinion is that we only should support use cases where you carry over a 
model that was trained on one batch of data to be the starting point for the 
training on the next batch. Let me know if anyone disagrees with that.

Based on the above opinion, my concern is this. If someone wanted to use the EM 
optimizer, they could only use it on the first batch. From then on, they are 
loading saved local models in which case you can only continue with Online 
optimization. This is an extra complication for configuration. I mean, saying 
you wanted EM as your optimizer in the config would mean only that the first 
run was done that way.

On the other hand, I guess it’s possible that starting with a large batch using 
EM would perform better in the long run than starting with Online? We could 
look into this, but our tests so far have shown Online to be up to par with EM 
if I recall correctly. 

On 7/21/17, 11:44 AM, "Barona, Ricardo" <[email protected]> wrote:

    Once a saved model is loaded it needs to be converted to LocalLDAModel if 
it’s a DistributedLDAModel but from what I heard, the importance of what you 
used for training, EM and Online is in the topics matrix that generates one and 
the other. I’m not exactly and expert but I’d think they are going to be 
different, right? The topics matrix of a LocalLDAModel coming from 
DistributedLDAModel will remain the same and topic distributions will be 
calculated based on that. 
    
    On 7/21/17, 1:26 PM, "Edwards, Brandon" <[email protected]> wrote:
    
        A question just came up for me. Is there a true use case for utilizing 
EM that allows one to carry context from previous models into the future? It 
seems that once you save to a local model in order to utilize it for future 
data, from then on you only can use the Online optimizer. If this is correct, I 
vote for getting rid of EM. I don’t see value in supporting a use case that 
does not carry context into future models.
        
        On 7/21/17, 11:08 AM, "Barona, Ricardo" <[email protected]> 
wrote:
        
            During the last 9 days, I've been working on modifying Apache Spot 
LDA wrapper to enable the possibility of saving models and load existing models 
and then get topic distributions for the same corpus or for new documents (see 
https://issues.apache.org/jira/browse/SPOT-196). Until now, Apache Spot ML 
module has been running in batch mode training and getting topic distributions 
with the same documents it trained but that needs to change soon as we are 
looking forward to achieving near real time.
            
            Since this year, Apache Spot enabled Online optimizer so users can 
select whether to run LDA using EM or Online; EM was the first option we 
implemented and then we decided it was a good idea to offer Online as well.
            
            In my intention for keep supporting both, EM and Online optimizer, 
I modified the code in such way that you can train with either one but only get 
topic distributions with LocalLDAModel. The reason for that is that only 
LocalLDAModel supports getting topic distributions for new documents. The 
problem with that approach is that a very simple unit test we have is failing 
now and the it is because when I convert DistributedLDAModel to LocalLDAModel, 
the document concentration parameter remains the same as it was originally 
provided for EM but it doesn't necessarily work for 
LocalLDAModel.topicDistributions method.
            
            Take a look at 
https://issues.apache.org/jira/secure/attachment/12878382/everythingOK.png. 
There you can see the expected result from training and getting topic 
distributions with EM only or Online only in a two document one word each 
document data set.
            
            Then, here is the problem I explained before about converting 
DistributedLDAModel to LocalLDAModel: 
https://issues.apache.org/jira/secure/attachment/12878381/notSoOk.png
            
            A possible solution for this is to use the following code to 
implement a custom function to convert DistributedLDAModel to LocalLDAModel 
(see 
https://issues.apache.org/jira/secure/attachment/12878380/possibleSolution.png 
and the code below):
            
            package org.apache.spark.mllib.clustering
            
            import org.apache.spark.mllib.linalg.{Matrix, Vector}
            
            object SpotLDA {
              /**
                * Creates a new LocalLDAModel but it can reset alpha and beta 
(although we just need alpha).
                * @param topicsMatrix Distributed LDA Model topicsMatrix
                * @param alpha New value for alpha i.e. If Model was trained 
with 1.002 for alpha using EM optimizer, this method
                *              allows you to reset alpha to something like 
0.0009 and get topic distributions with the desired
                *              document concentration.
                * @param beta New value for beta
                * @return LocalLDAModel
                */
              def toLocal(topicsMatrix: Matrix, alpha: Vector, beta: Double): 
LocalLDAModel ={
            
                new LocalLDAModel(topicsMatrix, alpha, beta)
              }
            }
            
            The only disadvantage I see here is that users will need to provide 
3 parameters if they are using EM optimizer instead of only 2:
            
            -          EM alpha
            
            -          EM beta
            
            -          Online alpha
            Or provide only 2 parameters if they prefer to work with Online 
Optimizer only
            
            -          Online alpha
            
            -          Online beta
            
            Discussing this with Gustavo, he suggested we even set a “default” 
number for Online alpha so if users only configure EM alpha and EM beta the 
application will keep working.
            
            Being said all that, here is the big question I’d like to ask: 
should we keep supporting both, EM Optimizer and Online Optimizer and have 
users to configure the required parameters or do you think is time to let EM go 
and just keep Online optimizer?
            
            My vote is for keep both but let me know if what you think.
            
            Thanks,
            Ricardo Barona

Re: Spark EM LDA Optimizer support

Reply via email to