Hi Frank, Thanks for the response. The solution you suggest doesn't fully answer to my question.
I don't want to join input table with top-k wordids from each topic as this is not what lda_predict is doing. Am I right? I want to explain the results of madlib.lda_predict exactly. I want to explain why lda_predict gives some topic distribution for some set of wordids. I understand your approach but unfortunately it doesn't fulfill my needs. I have found the solution for my needs but and was just asking if you know better way to handle the results of lda_predict (or data in *output_data_table *of lda_train). So my question was more about how to construct wordid array from 'counts' and 'words' columns of madlib.lda_predict results or e.g. 'my_outdata' in your example. Now I am doing it with (counts::text || ':' || words::text)::madlib.svec:: float[] The thing I don't understand in *output_data_table* format is that why *topic_assignment* is a dense vector but *words* is a sparse vector. I would suggest to change *words* format into dense vector to remove the need of playing with madlib.svec object to explain the lda_predict results exactly. And then the *counts* column is not needed anymore in *output_data_table*. Regards, Markus Paaso 2017-09-21 22:25 GMT+03:00 Frank McQuillan <[email protected]>: > Markus, > > Sorry for the delay in response. Below is another approach that involves > a JOIN between term frequency and an LDA output table. It is a bit simpler > at least for me. > > Regarding the indexing, you are correct that it is inconsistent. It seems > term frequency module uses 0 based indexing (e.g., docid) whereas LDA uses > 1 based indexing (e.g., topicid). I created a JIRA to address this > https://issues.apache.org/jira/browse/MADLIB-1160 > probably the best way is to standardize on 1 based indexing. > > ---- > > 1) Create data set > > DROP TABLE IF EXISTS documents; > CREATE TABLE documents(docid INT4, contents TEXT); > INSERT INTO documents VALUES > (0, 'Statistical topic models are a class of Bayesian latent variable > models, originally developed for analyzing the semantic content of large > document corpora.'), > (1, 'By the late 1960s, the balance between pitching and hitting had swung > in favor of the pitchers. In 1968 Carl Yastrzemski won the American League > batting title with an average of just .301, the lowest in history.'), > (2, 'Machine learning is closely related to and often overlaps with > computational statistics; a discipline that also specializes in > prediction-making. It has strong ties to mathematical optimization, which > deliver methods, theory and application domains to the field.'), > (3, 'California''s diverse geography ranges from the Sierra Nevada in the > east to the Pacific Coast in the west, from the Redwood–Douglas fir forests > of the northwest, to the Mojave Desert areas in the southeast. The center > of the state is dominated by the Central Valley, a major agricultural area. > '); > > > 2) Convert docs to a list of words > > ALTER TABLE documents ADD COLUMN words TEXT[]; > UPDATE documents SET words = regexp_split_to_array(lower(contents), > E'[\\s+\\.\\,]'); > > > 3) Term frequency > > DROP TABLE IF EXISTS my_training, my_training_vocabulary; > SELECT madlib.term_frequency('documents', 'docid', 'words', > 'my_training', TRUE); > SELECT * FROM my_training order by docid limit 20; > docid | wordid | count > -------+--------+------- > 0 | 57 | 1 > 0 | 86 | 1 > 0 | 4 | 1 > 0 | 55 | 1 > 0 | 69 | 2 > 0 | 81 | 1 > 0 | 30 | 1 > 0 | 33 | 1 > 0 | 36 | 1 > 0 | 43 | 1 > 0 | 25 | 1 > 0 | 65 | 2 > 0 | 72 | 1 > 0 | 9 | 1 > 0 | 0 | 2 > 0 | 29 | 1 > 0 | 18 | 1 > 0 | 12 | 1 > 0 | 96 | 1 > 0 | 91 | 1 > (20 rows) > > > 4) Run LDA > > DROP TABLE IF EXISTS my_model, my_outdata; > SELECT madlib.lda_train( 'my_training', -- Data table > 'my_model', -- Model table > 'my_outdata', -- Output table > 104, -- Vocab size > 5, -- Num topics > 10, -- Num iterations > 5, -- Alpha (Dirichlet param) > 0.01 -- Beta (Dirichlet param) > ); > > > 5) Get topic description by top-k words > > DROP TABLE IF EXISTS my_topic_desc; > SELECT madlib.lda_get_topic_desc( 'my_model', > 'my_training_vocabulary', > 'my_topic_desc', > 15); > > 6) Join topic description with term frequency > SELECT docid, wordid, topicid FROM my_topic_desc JOIN my_training USING > (wordid) WHERE docid=3 ORDER BY docid, wordid; > > docid | wordid | topicid > -------+--------+--------- > 3 | 0 | 3 > 3 | 4 | 2 > 3 | 5 | 3 > 3 | 13 | 2 > 3 | 14 | 5 > 3 | 20 | 5 > 3 | 21 | 2 > 3 | 23 | 2 > 3 | 24 | 3 > 3 | 27 | 2 > 3 | 32 | 3 > 3 | 35 | 3 > 3 | 38 | 5 > 3 | 39 | 5 > etc > > Frank > > > > On Tue, Aug 29, 2017 at 11:34 PM, Markus Paaso <[email protected]> > wrote: > >> Hi Frank, >> >> I want to explain the LDA results for a single document (in this case for >> docid = 6) by binding topicid into each wordid in the document. >> The SQL query below gives exactly what I want but I am not sure if that >> is the most effective way to build docid-wordid-topicid triples. >> >> SELECT docid, unnest((counts::text || ':' || >> words::text)::madlib.svec::float[]) AS wordid, unnest(topic_assignment) >> + 1 AS topicid FROM lda_output WHERE docid = 6; >> >> I have trained LDA with 'lda_output' as the output_data_table argument >> in madlib.lda_train. >> >> >> Regards, Markus >> >> 2017-08-28 23:19 GMT+03:00 Frank McQuillan <[email protected]>: >> >>> Markus, >>> >>> Please see example 4 in the user docs >>> http://madlib.apache.org/docs/latest/group__grp__lda.html#examples >>> which provides helper functions for learning more about the learned >>> model. >>> >>> -- The topic description by top-k words >>> DROP TABLE IF EXISTS my_topic_desc; >>> SELECT madlib.lda_get_topic_desc( 'my_model', >>> 'my_training_vocabulary', >>> 'my_topic_desc', >>> 15); >>> select * from my_topic_desc order by topicid, prob DESC; >>> >>> produces: >>> >>> topicid | wordid | prob | word >>> ---------+--------+--------------------+------------------- >>> 1 | 69 | 0.181900726392252 | of >>> 1 | 52 | 0.0608353510895884 | is >>> 1 | 65 | 0.0608353510895884 | models >>> 1 | 30 | 0.0305690072639225 | corpora >>> 1 | 1 | 0.0305690072639225 | 1960s >>> 1 | 57 | 0.0305690072639225 | latent >>> >>> Please let us know if this is of use, or you are looking for something >>> else? >>> >>> Frank >>> >>> >>> On Fri, Aug 11, 2017 at 6:45 AM, Markus Paaso <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> I found a working but quite awkward way to form docid-wordid-topicid >>>> pairing with a single SQL query: >>>> >>>> SELECT docid, unnest((counts::text || ':' || >>>> words::text)::madlib.svec::float[]) AS wordid, >>>> unnest(topic_assignment) + 1 AS topicid FROM lda_output WHERE docid = 6; >>>> >>>> Output: >>>> >>>> docid | wordid | topicid >>>> -------+--------+--------- >>>> 6 | 7386 | 3 >>>> 6 | 42021 | 17 >>>> 6 | 42021 | 17 >>>> 6 | 7705 | 12 >>>> 6 | 105334 | 16 >>>> 6 | 18083 | 3 >>>> 6 | 89364 | 3 >>>> 6 | 31073 | 3 >>>> 6 | 28934 | 3 >>>> 6 | 28934 | 16 >>>> 6 | 56286 | 16 >>>> 6 | 61921 | 3 >>>> 6 | 61921 | 3 >>>> 6 | 59142 | 17 >>>> 6 | 33364 | 3 >>>> 6 | 79035 | 17 >>>> 6 | 37792 | 11 >>>> 6 | 91823 | 11 >>>> 6 | 30422 | 3 >>>> 6 | 94672 | 17 >>>> 6 | 62107 | 3 >>>> 6 | 94673 | 2 >>>> 6 | 62080 | 16 >>>> 6 | 101046 | 17 >>>> 6 | 4379 | 8 >>>> 6 | 4379 | 8 >>>> 6 | 4379 | 8 >>>> 6 | 4379 | 8 >>>> 6 | 4379 | 8 >>>> 6 | 26503 | 12 >>>> 6 | 61105 | 3 >>>> 6 | 19193 | 3 >>>> 6 | 28929 | 3 >>>> >>>> >>>> Is there any simpler way to do that? >>>> >>>> >>>> Regards, >>>> Markus Paaso >>>> >>>> >>>> >>>> 2017-08-11 15:23 GMT+03:00 Markus Paaso <[email protected]>: >>>> >>>>> Hi, >>>>> >>>>> I am having some problems reading the LDA output. >>>>> >>>>> >>>>> Please see this row of madlib.lda_train output: >>>>> >>>>> docid | 6 >>>>> wordcount | 33 >>>>> words | {7386,42021,7705,105334,18083, >>>>> 89364,31073,28934,56286,61921,59142,33364,79035,37792,91823, >>>>> 30422,94672,62107,94673,62080,101046, 4379,26503,61105,19193,28929} >>>>> counts | {1,2,1,1,1,1,1,2,1,2,1,1,1,1,1 >>>>> ,1,1,1,1,1,1,5,1,1,1,1} >>>>> topic_count | {0,1,13,0,0,0,0,5,0,0,2,2,0,0,0,4,6,0,0,0} >>>>> topic_assignment | {2,16,16,11,15,2,2,2,2,15,15,2 >>>>> ,2,16,2,16,10,10,2,16,2,1,15,16,7,7,7,7,7,11,2,2,2} >>>>> >>>>> >>>>> It's hard to find which word ids are topic ids assigned to given when >>>>> *words* array have different length than *topic_assignment* array. >>>>> It would be nice if *words* array was same length than >>>>> *topic_assignment* array >>>>> >>>>> 1. What kind of SQL query would give a result with wordid - topicid >>>>> pairs? >>>>> I tried to match them by hand but failed for wordid: 28934. I wonder >>>>> if a repeating wordid can have different topic assignments in a same >>>>> document? >>>>> >>>>> wordid | topicid >>>>> ---------------- >>>>> 7386 | 2 >>>>> 42021 | 16 >>>>> 7705 | 11 >>>>> 105334 | 15 >>>>> 18083 | 2 >>>>> 89364 | 2 >>>>> 31073 | 2 >>>>> 28934 | 2 OR 15 ? >>>>> 56286 | 15 >>>>> 61921 | 2 >>>>> 59142 | 16 >>>>> 33364 | 2 >>>>> 79035 | 16 >>>>> 37792 | 10 >>>>> 91823 | 10 >>>>> 30422 | 2 >>>>> 94672 | 16 >>>>> 62107 | 2 >>>>> 94673 | 1 >>>>> 62080 | 15 >>>>> 101046 | 16 >>>>> 4379 | 7 >>>>> 26503 | 11 >>>>> 61105 | 2 >>>>> 19193 | 2 >>>>> 28929 | 2 >>>>> >>>>> >>>>> 2. Why is the *topic_assignment* using zero based indexing while >>>>> other results use one base indexing? >>>>> >>>>> >>>>> >>>>> Regards, >>>>> Markus Paaso >>>>> >>>> >>>> >>>> >>>> -- >>>> Markus Paaso >>>> Tel: +358504067849 <+358%2050%204067849> >>>> >>> >>> >> >> >> -- >> Markus Paaso >> Tel: +358504067849 <+358%2050%204067849> >> > > -- Markus Paaso Tel: +358504067849
