[ 
https://issues.apache.org/jira/browse/MADLIB-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357667#comment-16357667
 ] 

Frank McQuillan commented on MADLIB-1160:
-----------------------------------------

Helper function works fine:
{code}
DROP TABLE IF EXISTS documents;
CREATE TABLE documents(docid INT4, contents TEXT);
INSERT INTO documents VALUES
(0, ' b a a c'),
(1, ' d e f f f ');

ALTER TABLE documents ADD COLUMN words TEXT[];
UPDATE documents SET words = regexp_split_to_array(lower(contents), 
E'[\\s+\\.\\,]');

DROP TABLE IF EXISTS my_training, my_training_vocabulary;
SELECT madlib.term_frequency('documents', 'docid', 'words', 'my_training', 
TRUE);


DROP TABLE IF EXISTS my_model, my_outdata;
SELECT madlib.lda_train( 'my_training',
                         'my_model',
                         'my_outdata',
                         7,
                         2,
                         1,
                         5,
                         0.01
                       );

DROP TABLE IF EXISTS helper_output_table;
SELECT madlib.lda_get_word_topic_mapping('my_outdata', 'helper_output_table');
SELECT * FROM helper_output_table ORDER BY docid;
{code}
produces
{code}
 docid | wordid | topicid 
-------+--------+---------
     0 |      1 |       1
     0 |      2 |       0
     0 |      3 |       0
     0 |      0 |       1
     1 |      0 |       1
     1 |      4 |       0
     1 |      5 |       0
     1 |      6 |       1
     1 |      0 |       0
(9 rows)
{code}

> Usability changes for LDA
> -------------------------
>
>                 Key: MADLIB-1160
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1160
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>            Assignee: Jingyi Mei
>            Priority: Minor
>             Fix For: v1.14
>
>
> Context
> Please see this thread from the user mailing list
>  
> [http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E]
> Tasks
> 1) Term frequency
>  [http://madlib.apache.org/docs/latest/group__grp__text__utilities.html]
>  and LDA
>  [http://madlib.apache.org/docs/latest/group__grp__lda.html]
>  should both creates indexes that start at 1, to make them consistent with 
> other MADlib modules. One or both of these currently create indexes starting 
> at 0.
> 2) In the output_data_table *topic_assignment* is a dense vector but *words* 
> is a sparse vector (svec).
>  We should change *topic_assignment* to be a sparse vector to be consistent.
> Note: the reason sparse vectors were used in the first place (I think) is to 
> keep the model state as small as possible, so it is preferred to dense format 
> in this case., although svecs are a bit harder to work with. We have hit the 
> Postgres 1GB field limit size in some use cases.
> 3) The user docs could also use some cleanup at the same time. E.g., helper 
> functions are used in the examples but not described above.
> 4) The helper function `madlib.lda_get_topic_desc` should return top k words 
> (and ties).  It seems to returning the top k-1 words (and ties) now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to