[jira] [Comment Edited] (MADLIB-1201) Inconsistent lda output tables

Frank McQuillan (JIRA) Thu, 08 Feb 2018 14:23:34 -0800

    [ 
https://issues.apache.org/jira/browse/MADLIB-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357655#comment-16357655
 ]


Frank McQuillan edited comment on MADLIB-1201 at 2/8/18 10:22 PM:
------------------------------------------------------------------

Testing this it seems to work fine now.  For the small example in Jira 
description above, I get:
{code:java}
madlib=# select * from my_outdata order by docid;
docid | wordcount |   words   |  counts   | topic_count | topic_assignment
-------+-----------+-----------+-----------+-------------+------------------
     0 |         5 | {1,3,0,2} | {2,1,1,1} | {2,3}       | {1,1,0,1,0}
     1 |         7 | {5,0,4,6} | {1,2,1,3} | {3,4}       | {0,1,0,0,1,1,1}
(2 rows){code}
and
{code:java}
madlib=# SELECT * FROM my_word_topic_count ORDER BY wordid;
wordid | topic_count
--------+-------------
      0 | {1,2}
      1 | {0,2}
      2 | {1,0}
      3 | {1,0}
      4 | {1,0}
      5 | {1,0}
      6 | {0,3}
(7 rows){code}
which are consistent since if you expand the words vector:
{code:java}
doc 0
words (exp)       1 1 3 0 2
topic_assignment  1 1 0 1 0

doc 1
word (exp)        5 0 0 4 6 6 6 
topic_assignment  0 1 0 0 1 1 1 {code}
 


was (Author: fmcquillan):
Testing this it seems to work fine now.  For the small example in Jira 
description above, I get:
{code:java}
madlib=# select * from my_outdata order by docid;
docid | wordcount |   words   |  counts   | topic_count | topic_assignment
-------+-----------+-----------+-----------+-------------+------------------
     0 |         5 | {1,3,0,2} | {2,1,1,1} | {2,3}       | {1,1,0,1,0}
     1 |         7 | {5,0,4,6} | {1,2,1,3} | {3,4}       | {0,1,0,0,1,1,1}
(2 rows){code}
 

and
{code:java}
madlib=# SELECT * FROM my_word_topic_count ORDER BY wordid;
wordid | topic_count
--------+-------------
      0 | {1,2}
      1 | {0,2}
      2 | {1,0}
      3 | {1,0}
      4 | {1,0}
      5 | {1,0}
      6 | {0,3}
(7 rows){code}
which are consistent since if you expand the words vector:
{code:java}
doc 0
words (exp)       1 1 3 0 2
topic_assignment  1 1 0 1 0

doc 1
word (exp)        5 0 0 4 6 6 6 
topic_assignment  0 1 0 0 1 1 1 {code}
 

> Inconsistent lda output tables
> ------------------------------
>
>                 Key: MADLIB-1201
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1201
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Parallel Latent Dirichlet Allocation
>            Reporter: Jingyi Mei
>            Assignee: Jingyi Mei
>            Priority: Major
>             Fix For: v1.14
>
>
> We found an inconsistency in the LDA module between the outputs of lda_train 
> and lda_get_word_topic_count. 
> Repro Steps
> {code}
> DROP TABLE IF EXISTS documents;
> CREATE TABLE documents(docid INT4, contents TEXT);
> INSERT INTO documents VALUES
> (0, ' b a a c'),
> (1, ' d e f f f ');
> ALTER TABLE documents ADD COLUMN words TEXT[];
> UPDATE documents SET words = regexp_split_to_array(lower(contents), 
> E'[\\s+\\.\\,]');
> DROP TABLE IF EXISTS my_training, my_training_vocabulary;
> SELECT madlib.term_frequency('documents', 'docid', 'words', 'my_training', 
> TRUE);
> DROP TABLE IF EXISTS my_model, my_outdata;
> SELECT madlib.lda_train( 'my_training',
>                          'my_model',
>                          'my_outdata',
>                          7,
>                          2,
>                          1,
>                          5,
>                          0.01
>                        );
> select * from my_outdata order by docid;
> ```
>  docid | wordcount |   words   |  counts   | topic_count | topic_assignment
> -------+-----------+-----------+-----------+-------------+------------------
>      0 |         5 | {2,1,0,3} | {1,2,1,1} | {2,3}       | {0,1,1,1,0}
>      1 |         7 | {4,5,0,6} | {1,1,2,3} | {1,6}       | {1,0,1,1,1,1,1}
> ```
> DROP TABLE IF EXISTS my_word_topic_count;
> SELECT madlib.lda_get_word_topic_count( 'my_model', 'my_word_topic_count');
> SELECT * FROM my_word_topic_count ORDER BY wordid;
> ```
>  wordid | topic_count
> --------+-------------
>       0 | {1,2}
>       1 | {0,2}
>       2 | {1,0}
>       3 | {0,1}
>       4 | {1,0}
>       5 | {0,1}
>       6 | {0,3}
> (7 rows)
> ```
> {code}
> The output of 'my_outdata' indicates that wordid 3 gets assigned only to 
> topic 0 but the output of my_word_topic_count indicates that wordid 3 gets 
> assigned only to topic 1. This output seems to be inconsistent with each 
> other. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MADLIB-1201) Inconsistent lda output tables

Reply via email to