[ 
https://issues.apache.org/jira/browse/MADLIB-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873668#comment-16873668
 ] 

Himanshu Pandey commented on MADLIB-1351:
-----------------------------------------

[~fmcquillan]

As per the discussion, I am suggesting the new suggested interface as follows:  
{code:java}
lda_train( data_table,
           model_table,
           output_data_table,
           voc_size,
           topic_num,
           iter_num,
           alpha,
           beta,
          evaluate_every,
          perplexity_tol
         )

{code}
Where
{code:java}
evaluate_every : INT, Optional. default = 0. To calculate perplexity, this 
column, should be set to greater than 0.
{code}
{code:java}
perplexity_tol: FLOAT8, Optional. default = 0.1. This is the perplexity 
tolerance column and it will be used only when evaluate_every > 0.
{code}
And for the output of perplexity, newly suggested interface:
{code:java}
voc_size:       INTEGER. Size of the vocabulary. As mentioned above for the 
input table, wordid consists of 
                         contiguous integers going from 0 to voc_size − 1.
topic_num:      INTEGER. Number of topics.
alpha:          DOUBLE PRECISION. Dirichlet prior for the per-document topic 
multinomial.
beta:                DOUBLE PRECISION. Dirichlet prior for the per-topic word 
multinomial.
model:          BIGINT[]. The encoded model description (not human readable).
perplexity:        DOUBLE PRECISION. Calculated perplexity for every model 
generated. 
{code}
Let me know your thoughts on this. 

 

Thanks! 

> Add stopping criteria on perplexity to LDA
> ------------------------------------------
>
>                 Key: MADLIB-1351
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1351
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Parallel Latent Dirichlet Allocation
>            Reporter: Frank McQuillan
>            Assignee: Himanshu Pandey
>            Priority: Minor
>             Fix For: v1.17
>
>
> In LDA 
> http://madlib.apache.org/docs/latest/group__grp__lda.html
> make stopping criteria on perplexity rather than just number of iterations.
> Suggested approach is to do what scikit-learn does
> https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
> evaluate_every : int, optional (default=0)
> How often to evaluate perplexity. Set it to 0 or negative number to not 
> evaluate perplexity in training at all. Evaluating perplexity can help you 
> check convergence in training process, but it will also increase total 
> training time. Evaluating perplexity in every iteration might increase 
> training time up to two-fold.
> perplexity_tol : float, optional (default=1e-1)
> Perplexity tolerance to stop iterating. Only used when evaluate_every is 
> greater than 0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to