[ https://issues.apache.org/jira/browse/MADLIB-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357655#comment-16357655 ]
Frank McQuillan edited comment on MADLIB-1201 at 2/8/18 10:22 PM: ------------------------------------------------------------------ Testing this it seems to work fine now. For the small example in Jira description above, I get: {code:java} madlib=# select * from my_outdata order by docid; docid | wordcount | words | counts | topic_count | topic_assignment -------+-----------+-----------+-----------+-------------+------------------ 0 | 5 | {1,3,0,2} | {2,1,1,1} | {2,3} | {1,1,0,1,0} 1 | 7 | {5,0,4,6} | {1,2,1,3} | {3,4} | {0,1,0,0,1,1,1} (2 rows){code} and {code:java} madlib=# SELECT * FROM my_word_topic_count ORDER BY wordid; wordid | topic_count --------+------------- 0 | {1,2} 1 | {0,2} 2 | {1,0} 3 | {1,0} 4 | {1,0} 5 | {1,0} 6 | {0,3} (7 rows){code} which are consistent since if you expand the words vector: {code:java} doc 0 words (exp) 1 1 3 0 2 topic_assignment 1 1 0 1 0 doc 1 word (exp) 5 0 0 4 6 6 6 topic_assignment 0 1 0 0 1 1 1 {code} was (Author: fmcquillan): Testing this it seems to work fine now. For the small example in Jira description above, I get: {code:java} madlib=# select * from my_outdata order by docid; docid | wordcount | words | counts | topic_count | topic_assignment -------+-----------+-----------+-----------+-------------+------------------ 0 | 5 | {1,3,0,2} | {2,1,1,1} | {2,3} | {1,1,0,1,0} 1 | 7 | {5,0,4,6} | {1,2,1,3} | {3,4} | {0,1,0,0,1,1,1} (2 rows){code} and {code:java} madlib=# SELECT * FROM my_word_topic_count ORDER BY wordid; wordid | topic_count --------+------------- 0 | {1,2} 1 | {0,2} 2 | {1,0} 3 | {1,0} 4 | {1,0} 5 | {1,0} 6 | {0,3} (7 rows){code} which are consistent since if you expand the words vector: {code:java} doc 0 words (exp) 1 1 3 0 2 topic_assignment 1 1 0 1 0 doc 1 word (exp) 5 0 0 4 6 6 6 topic_assignment 0 1 0 0 1 1 1 {code} > Inconsistent lda output tables > ------------------------------ > > Key: MADLIB-1201 > URL: https://issues.apache.org/jira/browse/MADLIB-1201 > Project: Apache MADlib > Issue Type: Bug > Components: Module: Parallel Latent Dirichlet Allocation > Reporter: Jingyi Mei > Assignee: Jingyi Mei > Priority: Major > Fix For: v1.14 > > > We found an inconsistency in the LDA module between the outputs of lda_train > and lda_get_word_topic_count. > Repro Steps > {code} > DROP TABLE IF EXISTS documents; > CREATE TABLE documents(docid INT4, contents TEXT); > INSERT INTO documents VALUES > (0, ' b a a c'), > (1, ' d e f f f '); > ALTER TABLE documents ADD COLUMN words TEXT[]; > UPDATE documents SET words = regexp_split_to_array(lower(contents), > E'[\\s+\\.\\,]'); > DROP TABLE IF EXISTS my_training, my_training_vocabulary; > SELECT madlib.term_frequency('documents', 'docid', 'words', 'my_training', > TRUE); > DROP TABLE IF EXISTS my_model, my_outdata; > SELECT madlib.lda_train( 'my_training', > 'my_model', > 'my_outdata', > 7, > 2, > 1, > 5, > 0.01 > ); > select * from my_outdata order by docid; > ``` > docid | wordcount | words | counts | topic_count | topic_assignment > -------+-----------+-----------+-----------+-------------+------------------ > 0 | 5 | {2,1,0,3} | {1,2,1,1} | {2,3} | {0,1,1,1,0} > 1 | 7 | {4,5,0,6} | {1,1,2,3} | {1,6} | {1,0,1,1,1,1,1} > ``` > DROP TABLE IF EXISTS my_word_topic_count; > SELECT madlib.lda_get_word_topic_count( 'my_model', 'my_word_topic_count'); > SELECT * FROM my_word_topic_count ORDER BY wordid; > ``` > wordid | topic_count > --------+------------- > 0 | {1,2} > 1 | {0,2} > 2 | {1,0} > 3 | {0,1} > 4 | {1,0} > 5 | {0,1} > 6 | {0,3} > (7 rows) > ``` > {code} > The output of 'my_outdata' indicates that wordid 3 gets assigned only to > topic 0 but the output of my_word_topic_count indicates that wordid 3 gets > assigned only to topic 1. This output seems to be inconsistent with each > other. -- This message was sent by Atlassian JIRA (v7.6.3#76005)