fmcquillan99 commented on issue #432: MADLIB-1351 : Added stopping criteria on 
perplexity to LDA
URL: https://github.com/apache/madlib/pull/432#issuecomment-546154433
 
 
   (1)
   Please add `num_iterations` to the output table.  This is needed now because
   we have a perplexity tolerance, so training may not run the maximum number 
of iterations
   specified.  The model table should look like:
   
   ```
   model_table
   ...
   model        BIGINT[]. The encoded model ...etc...
   num_iterations       INTEGER. Number of iterations that training ran for,
   which may be less than the maximum value specified in the parameter 
'iter_num' if
   the perplexity tolerance was reached.
   perplexity   DOUBLE PRECISION[] Array of ...etc....
   ...
   ```
   
   (2)
   The parameter 'perplexity_tol' can be any value >= 0.0  Currently it errors 
out below a
   value of 0.1 which is not correct.  I may want to set it to 0.0 so that 
training runs
   for the full number of iterations.  So please change it to error out if 
'perplexity_tol'<0.
   
   ```
   DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
   
   SELECT madlib.lda_train( 'documents_tf',          -- documents table in the 
form of term frequency
                            'lda_model_perp',        -- model table created by 
LDA training (not human readable)
                            'lda_output_data_perp',  -- readable output data 
table
                            103,                     -- vocabulary size
                            5,                       -- number of topics
                            10,                      -- number of iterations
                            5,                       -- Dirichlet prior for the 
per-doc topic multinomial (alpha)
                            0.01,                    -- Dirichlet prior for the 
per-topic word multinomial (beta)
                            2,                       -- Evaluate perplexity 
every 2 iterations
                            0.0                      -- Set tolerance to 0 so 
runs full number of iterations
                          );
   ```
   produces
   ```
   InternalError: (psycopg2.InternalError) plpy.Error: invalid argument: 
perplexity_tol should not be less than .1 (plpython.c:5038)
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "lda_train", line 22, in <module>
       voc_size, topic_num, iter_num, alpha, beta,evaluate_every , 
perplexity_tol)
     PL/Python function "lda_train", line 519, in lda_train
     PL/Python function "lda_train", line 96, in _assert
   PL/Python function "lda_train"
    [SQL: "SELECT madlib.lda_train( 'documents_tf',          -- documents table 
in the form of term frequency\n                         'lda_model_perp',       
 -- model table created by LDA training (not human readable)\n                  
       'lda_output_data_perp',  -- readable output data table \n                
         103,                     -- vocabulary size\n                         
5,                       -- number of topics\n                         10,      
                -- number of iterations\n                         5,            
           -- Dirichlet prior for the per-doc topic multinomial (alpha)\n       
                  0.01,                    -- Dirichlet prior for the per-topic 
word multinomial (beta)\n                         2,                       -- 
Evaluate perplexity every 2 iterations\n                         0.0            
          -- Set tolerance to 0 so runs full number of iterations\n             
          );"]
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to