[GitHub] [madlib] fmcquillan99 commented on issue #432: MADLIB-1351 : Added stopping criteria on perplexity to LDA

GitBox Mon, 04 Nov 2019 15:56:48 -0800

fmcquillan99 commented on issue #432: MADLIB-1351 : Added stopping criteria on 
perplexity to LDA
URL: https://github.com/apache/madlib/pull/432#issuecomment-549600980
 
 
   
   -----------------------------------------------------------------
   
   Re-test after latest commits
   
   
   
   (1)
   Please add `num_iterations` to the output table.  This is needed now because
   we have a perplexity tolerance, so training may not run the maximum number 
of iterations
   specified.  The model table should look like:
   
   ```
   model_table
   ...
   model        BIGINT[]. The encoded model ...etc...
   num_iterations       INTEGER. Number of iterations that training ran for,
   which may be less than the maximum value specified in the parameter 
'iter_num' if
   the perplexity tolerance was reached.
   perplexity   DOUBLE PRECISION[] Array of ...etc....
   ...
   ```
   
   Now looks like:
   
   ```
   -[ RECORD 1 ]----+--------------------------------------------
   voc_size         | 384
   topic_num        | 5
   alpha            | 5
   beta             | 0.01
   num_iterations   | 9
   perplexity       | {196.148467882,192.142777576,193.872066117}
   perplexity_iters | {3,6,9}
   ```
   
   OK
   
   
   
   (2)
   The parameter 'perplexity_tol' can be any value >= 0.0  Currently it errors 
out below a
   value of 0.1 which is not correct.  I may want to set it to 0.0 so that 
training runs
   for the full number of iterations.  So please change it to error out if 
'perplexity_tol'<0.
   
   ```
   DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
   
   SELECT madlib.lda_train( 'documents_tf',          -- documents table in the 
form of term frequency
                            'lda_model_perp',        -- model table created by 
LDA training (not human readable)
                            'lda_output_data_perp',  -- readable output data 
table
                            103,                     -- vocabulary size
                            5,                       -- number of topics
                            10,                      -- number of iterations
                            5,                       -- Dirichlet prior for the 
per-doc topic multinomial (alpha)
                            0.01,                    -- Dirichlet prior for the 
per-topic word multinomial (beta)
                            2,                       -- Evaluate perplexity 
every 2 iterations
                            0.0                      -- Set tolerance to 0 so 
runs full number of iterations
                          );
   ```
   produces
   ```
   -[ RECORD 1 
]----+--------------------------------------------------------------------------------------------------------------------------------------------
   voc_size         | 384
   topic_num        | 5
   alpha            | 5
   beta             | 0.01
   num_iterations   | 20
   perplexity       | 
{191.992070922,188.198782019,187.433873268,184.973287318,184.491077644,176.27420008,180.63646659,180.456641184,179.574266867,179.152413582}
   perplexity_iters | {2,4,6,8,10,12,14,16,18,20}
   ```
   
   OK
   
   (3)
   Last iteration value for perplexity doe not match final perplexity value:
   
   ```
   DROP TABLE IF EXISTS documents;
   CREATE TABLE documents(docid INT4, contents TEXT);
   
   INSERT INTO documents VALUES
   (0, 'Statistical topic models are a class of Bayesian latent variable 
models, originally developed for analyzing the semantic content of large 
document corpora.'),
   (1, 'By the late 1960s, the balance between pitching and hitting had swung 
in favor of the pitchers. In 1968 Carl Yastrzemski won the American League 
batting title with an average of just .301, the lowest in history.'),
   (2, 'Machine learning is closely related to and often overlaps with 
computational statistics; a discipline that also specializes in 
prediction-making. It has strong ties to mathematical optimization, which 
deliver methods, theory and application domains to the field.'),
   (3, 'California''s diverse geography ranges from the Sierra Nevada in the 
east to the Pacific Coast in the west, from the Redwood Douglas fir forests of 
the northwest, to the Mojave Desert areas in the southeast. The center of the 
state is dominated by the Central Valley, a major agricultural area.'),
   (4, 'One of the many applications of Bayes'' theorem is Bayesian inference, 
a particular approach to statistical inference. When applied, the probabilities 
involved in Bayes'' theorem may have different probability interpretations. 
With the Bayesian probability interpretation the theorem expresses how a degree 
of belief, expressed as a probability, should rationally change to account for 
availability of related evidence. Bayesian inference is fundamental to Bayesian 
statistics.'),
   (5, 'When data are unlabelled, supervised learning is not possible, and an 
unsupervised learning approach is required, which attempts to find natural 
clustering of the data to groups, and then map new data to these formed groups. 
The support-vector clustering algorithm, created by Hava Siegelmann and 
Vladimir Vapnik, applies the statistics of support vectors, developed in the 
support vector machines algorithm, to categorize unlabeled data, and is one of 
the most widely used clustering algorithms in industrial applications.'),
   (6, 'Deep learning architectures such as deep neural networks, deep belief 
networks, recurrent neural networks and convolutional neural networks have been 
applied to fields including computer vision, speech recognition, natural 
language processing, audio recognition, social network filtering, machine 
translation, bioinformatics, drug design, medical image analysis, material 
inspection and board game programs, where they have produced results comparable 
to and in some cases superior to human experts.'),
   (7, 'A multilayer perceptron is a class of feedforward artificial neural 
network. An MLP consists of at least three layers of nodes: an input layer, a 
hidden layer and an output layer. Except for the input nodes, each node is a 
neuron that uses a nonlinear activation function. MLP utilizes a supervised 
learning technique called backpropagation for training.'),
   (8, 'In mathematics, an ellipse is a plane curve surrounding two focal 
points, such that for all points on the curve, the sum of the two distances to 
the focal points is a constant.'),
   (9, 'In artificial neural networks, the activation function of a node 
defines the output of that node given an input or set of inputs.'),
   (10, 'In mathematics, graph theory is the study of graphs, which are 
mathematical structures used to model pairwise relations between objects. A 
graph in this context is made up of vertices (also called nodes or points) 
which are connected by edges (also called links or lines). A distinction is 
made between undirected graphs, where edges link two vertices symmetrically, 
and directed graphs, where edges link two vertices asymmetrically; see Graph 
(discrete mathematics) for more detailed definitions and for other variations 
in the types of graph that are commonly considered. Graphs are one of the prime 
objects of study in discrete mathematics.'),
   (11, 'A Rube Goldberg machine, named after cartoonist Rube Goldberg, is a 
machine intentionally designed to perform a simple task in an indirect and 
overly complicated way. Usually, these machines consist of a series of simple 
unrelated devices; the action of each triggers the initiation of the next, 
eventually resulting in achieving a stated goal.'),
   (12, 'In statistics, the logistic model (or logit model) is used to model 
the probability of a certain class or event existing such as pass/fail, 
win/lose, alive/dead or healthy/sick. This can be extended to model several 
classes of events such as determining whether an image contains a cat, dog, 
lion, etc... Each object being detected in the image would be assigned a 
probability between 0 and 1 and the sum adding to one.'),
   (13, 'k-means clustering is a method of vector quantization, originally from 
signal processing, that is popular for cluster analysis in data mining. k-means 
clustering aims to partition n observations into k clusters in which each 
observation belongs to the cluster with the nearest mean, serving as a 
prototype of the cluster.'),
   (14, 'In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a 
non-parametric method used for classification and regression.');
   
   
   ALTER TABLE documents ADD COLUMN words TEXT[];
   
   UPDATE documents SET words =
       regexp_split_to_array(lower(
       regexp_replace(contents, E'[,.;\']','', 'g')
       ), E'[\\s+]');
   
   
   DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
   
   SELECT madlib.term_frequency('documents',    -- input table
                                'docid',        -- document id column
                                'words',        -- vector of words in document
                                'documents_tf', -- output documents table with 
term frequency
                                TRUE);          -- TRUE to created vocabulary 
table
   ```
   
   Train
   
   
   ```
   DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
   
   SELECT madlib.lda_train( 'documents_tf',          -- documents table in the 
form of term frequency
                            'lda_model_perp',        -- model table created by 
LDA training (not human readable)
                            'lda_output_data_perp',  -- readable output data 
table
                            384,                     -- vocabulary size
                            5,                        -- number of topics
                            100,                      -- number of iterations
                            5,                       -- Dirichlet prior for the 
per-doc topic multinomial (alpha)
                            0.01,                    -- Dirichlet prior for the 
per-topic word multinomial (beta)
                            1,                       -- Evaluate perplexity 
every n iterations
                            0.1                      -- Stopping perplexity 
tolerance
                          );
   
   SELECT voc_size, topic_num, alpha, beta, perplexity, perplexity_iters from 
lda_model_perp;
   
   -[ RECORD 1 
]----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   voc_size         | 384
   topic_num        | 5
   alpha            | 5
   beta             | 0.01
   num_iterations   | 16
   perplexity       | 
{195.582090721,192.071728778,191.048336558,194.186905186,195.150503634,191.566207005,191.199131632,185.533220287,189.910983656,184.981903783,185.753724338,183.043524383,189.125703696,191.460991339,189.193774612,189.182916247}
   perplexity_iters | {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}
   ```
   
   Perplexity on input data
   
   ```
   SELECT madlib.lda_get_perplexity( 'lda_model_perp',
                                     'lda_output_data_perp'
                                   );
   
    lda_get_perplexity 
   --------------------
      189.182916246556
   (1 row)
   
   ```
   which matches the last value in the array for the training function.
   
   OK
   
   
   And here is a new one:
   
   (5)
   
   ```
   DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
   
   SELECT madlib.lda_train( 'documents_tf',          -- documents table in the 
form of term frequency
                            'lda_model_perp',        -- model table created by 
LDA training (not human readable)
                            'lda_output_data_perp',  -- readable output data 
table 
                            384,                     -- vocabulary size
                            5,                        -- number of topics
                            20,                      -- number of iterations
                            5,                       -- Dirichlet prior for the 
per-doc topic multinomial (alpha)
                            0.01,                    -- Dirichlet prior for the 
per-topic word multinomial (beta)
                            2                       -- Evaluate perplexity 
every n iterations
   
   Done.
   (psycopg2.ProgrammingError) function madlib.lda_train(unknown, unknown, 
unknown, integer, integer, integer, integer, numeric, integer) does not exist
   LINE 1: SELECT madlib.lda_train( 'documents_tf',          -- documen...
                  ^
   HINT:  No function matches the given name and argument types. You might need 
to add explicit type casts.
    [SQL: "SELECT madlib.lda_train( 'documents_tf',          -- documents table 
in the form of term frequency\n                         'lda_model_perp',       
 -- model table created by LDA training (not human readable)\n                  
       'lda_output_data_perp',  -- readable output data table \n                
         384,                     -- vocabulary size\n                         
5,                        -- number of topics\n                         20,     
                 -- number of iterations\n                         5,           
            -- Dirichlet prior for the per-doc topic multinomial (alpha)\n      
                   0.01,                    -- Dirichlet prior for the 
per-topic word multinomial (beta)\n                         2                   
    -- Evaluate perplexity every n iterations\n                       );"]
   ```
   
   This should be the same results as:
   
   ```
   DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
   
   SELECT madlib.lda_train( 'documents_tf',          -- documents table in the 
form of term frequency
                            'lda_model_perp',        -- model table created by 
LDA training (not human readable)
                            'lda_output_data_perp',  -- readable output data 
table 
                            384,                     -- vocabulary size
                            5,                        -- number of topics
                            20,                      -- number of iterations
                            5,                       -- Dirichlet prior for the 
per-doc topic multinomial (alpha)
                            0.01,                    -- Dirichlet prior for the 
per-topic word multinomial (beta)
                            2,                       -- Evaluate perplexity 
every n iterations
                            NULL
                          );
   ```
   which actually does work if you put `NULL` for the last param.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [madlib] fmcquillan99 commented on issue #432: MADLIB-1351 : Added stopping criteria on perplexity to LDA

Reply via email to