fmcquillan99 commented on issue #432: MADLIB-1351 : Added stopping criteria on 
perplexity to LDA
URL: https://github.com/apache/madlib/pull/432#issuecomment-547026743
 
 
   (3)
   Last iteration value for perplexity doe not match final perplexity value:
   
   ```
   DROP TABLE IF EXISTS documents;
   CREATE TABLE documents(docid INT4, contents TEXT);
   
   INSERT INTO documents VALUES
   (0, 'Statistical topic models are a class of Bayesian latent variable 
models, originally developed for analyzing the semantic content of large 
document corpora.'),
   (1, 'By the late 1960s, the balance between pitching and hitting had swung 
in favor of the pitchers. In 1968 Carl Yastrzemski won the American League 
batting title with an average of just .301, the lowest in history.'),
   (2, 'Machine learning is closely related to and often overlaps with 
computational statistics; a discipline that also specializes in 
prediction-making. It has strong ties to mathematical optimization, which 
deliver methods, theory and application domains to the field.'),
   (3, 'California''s diverse geography ranges from the Sierra Nevada in the 
east to the Pacific Coast in the west, from the Redwood Douglas fir forests of 
the northwest, to the Mojave Desert areas in the southeast. The center of the 
state is dominated by the Central Valley, a major agricultural area.'),
   (4, 'One of the many applications of Bayes'' theorem is Bayesian inference, 
a particular approach to statistical inference. When applied, the probabilities 
involved in Bayes'' theorem may have different probability interpretations. 
With the Bayesian probability interpretation the theorem expresses how a degree 
of belief, expressed as a probability, should rationally change to account for 
availability of related evidence. Bayesian inference is fundamental to Bayesian 
statistics.'),
   (5, 'When data are unlabelled, supervised learning is not possible, and an 
unsupervised learning approach is required, which attempts to find natural 
clustering of the data to groups, and then map new data to these formed groups. 
The support-vector clustering algorithm, created by Hava Siegelmann and 
Vladimir Vapnik, applies the statistics of support vectors, developed in the 
support vector machines algorithm, to categorize unlabeled data, and is one of 
the most widely used clustering algorithms in industrial applications.'),
   (6, 'Deep learning architectures such as deep neural networks, deep belief 
networks, recurrent neural networks and convolutional neural networks have been 
applied to fields including computer vision, speech recognition, natural 
language processing, audio recognition, social network filtering, machine 
translation, bioinformatics, drug design, medical image analysis, material 
inspection and board game programs, where they have produced results comparable 
to and in some cases superior to human experts.'),
   (7, 'A multilayer perceptron is a class of feedforward artificial neural 
network. An MLP consists of at least three layers of nodes: an input layer, a 
hidden layer and an output layer. Except for the input nodes, each node is a 
neuron that uses a nonlinear activation function. MLP utilizes a supervised 
learning technique called backpropagation for training.'),
   (8, 'In mathematics, an ellipse is a plane curve surrounding two focal 
points, such that for all points on the curve, the sum of the two distances to 
the focal points is a constant.'),
   (9, 'In artificial neural networks, the activation function of a node 
defines the output of that node given an input or set of inputs.'),
   (10, 'In mathematics, graph theory is the study of graphs, which are 
mathematical structures used to model pairwise relations between objects. A 
graph in this context is made up of vertices (also called nodes or points) 
which are connected by edges (also called links or lines). A distinction is 
made between undirected graphs, where edges link two vertices symmetrically, 
and directed graphs, where edges link two vertices asymmetrically; see Graph 
(discrete mathematics) for more detailed definitions and for other variations 
in the types of graph that are commonly considered. Graphs are one of the prime 
objects of study in discrete mathematics.'),
   (11, 'A Rube Goldberg machine, named after cartoonist Rube Goldberg, is a 
machine intentionally designed to perform a simple task in an indirect and 
overly complicated way. Usually, these machines consist of a series of simple 
unrelated devices; the action of each triggers the initiation of the next, 
eventually resulting in achieving a stated goal.'),
   (12, 'In statistics, the logistic model (or logit model) is used to model 
the probability of a certain class or event existing such as pass/fail, 
win/lose, alive/dead or healthy/sick. This can be extended to model several 
classes of events such as determining whether an image contains a cat, dog, 
lion, etc... Each object being detected in the image would be assigned a 
probability between 0 and 1 and the sum adding to one.'),
   (13, 'k-means clustering is a method of vector quantization, originally from 
signal processing, that is popular for cluster analysis in data mining. k-means 
clustering aims to partition n observations into k clusters in which each 
observation belongs to the cluster with the nearest mean, serving as a 
prototype of the cluster.'),
   (14, 'In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a 
non-parametric method used for classification and regression.');
   
   
   ALTER TABLE documents ADD COLUMN words TEXT[];
   
   UPDATE documents SET words = 
       regexp_split_to_array(lower(
       regexp_replace(contents, E'[,.;\']','', 'g')
       ), E'[\\s+]');
   
   
   DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
   
   SELECT madlib.term_frequency('documents',    -- input table
                                'docid',        -- document id column
                                'words',        -- vector of words in document
                                'documents_tf', -- output documents table with 
term frequency
                                TRUE);          -- TRUE to created vocabulary 
table
   ```
   
   Train
   
   
   ```
   DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;
   
   SELECT madlib.lda_train( 'documents_tf',          -- documents table in the 
form of term frequency
                            'lda_model_perp',        -- model table created by 
LDA training (not human readable)
                            'lda_output_data_perp',  -- readable output data 
table 
                            384,                     -- vocabulary size
                            5,                        -- number of topics
                            100,                      -- number of iterations
                            5,                       -- Dirichlet prior for the 
per-doc topic multinomial (alpha)
                            0.01,                    -- Dirichlet prior for the 
per-topic word multinomial (beta)
                            1,                       -- Evaluate perplexity 
every n iterations
                            0.1                      -- Stopping perplexity 
tolerance
                          );
   
   SELECT voc_size, topic_num, alpha, beta, perplexity, perplexity_iters from 
lda_model_perp;
   
   -[ RECORD 1 
]----+--------------------------------------------------------------------------------------------------
   voc_size         | 384
   topic_num        | 5
   alpha            | 5
   beta             | 0.01
   perplexity       | 
{195.764020671,194.317808815,193.208428811,188.2838923,188.384646897,189.849099875,189.939592275}
   perplexity_iters | {1,2,3,4,5,6,7}
   ```
   
   Predict on input data
   
   ```
   DROP TABLE IF EXISTS outdata_predict_perp;
   
   SELECT madlib.lda_predict( 'documents_tf',          -- Document to predict
                              'lda_model_perp',             -- LDA model from 
training
                              'outdata_predict_perp'                
                            );
   
   SELECT madlib.lda_get_perplexity( 'lda_model_perp',
                                     'outdata_predict_perp'
                                   );
   
   -[ RECORD 1 ]------+-----------------
   lda_get_perplexity | 192.569799335159
   ```
   
   I would expect this to be `189.939592275` which is the last value in the 
array for perplexity at iteration 7.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to