[ https://issues.apache.org/jira/browse/MADLIB-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357667#comment-16357667 ]
Frank McQuillan commented on MADLIB-1160: ----------------------------------------- Helper function works fine: {code} DROP TABLE IF EXISTS documents; CREATE TABLE documents(docid INT4, contents TEXT); INSERT INTO documents VALUES (0, ' b a a c'), (1, ' d e f f f '); ALTER TABLE documents ADD COLUMN words TEXT[]; UPDATE documents SET words = regexp_split_to_array(lower(contents), E'[\\s+\\.\\,]'); DROP TABLE IF EXISTS my_training, my_training_vocabulary; SELECT madlib.term_frequency('documents', 'docid', 'words', 'my_training', TRUE); DROP TABLE IF EXISTS my_model, my_outdata; SELECT madlib.lda_train( 'my_training', 'my_model', 'my_outdata', 7, 2, 1, 5, 0.01 ); DROP TABLE IF EXISTS helper_output_table; SELECT madlib.lda_get_word_topic_mapping('my_outdata', 'helper_output_table'); SELECT * FROM helper_output_table ORDER BY docid; {code} produces {code} docid | wordid | topicid -------+--------+--------- 0 | 1 | 1 0 | 2 | 0 0 | 3 | 0 0 | 0 | 1 1 | 0 | 1 1 | 4 | 0 1 | 5 | 0 1 | 6 | 1 1 | 0 | 0 (9 rows) {code} > Usability changes for LDA > ------------------------- > > Key: MADLIB-1160 > URL: https://issues.apache.org/jira/browse/MADLIB-1160 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Utilities > Reporter: Frank McQuillan > Assignee: Jingyi Mei > Priority: Minor > Fix For: v1.14 > > > Context > Please see this thread from the user mailing list > > [http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E] > Tasks > 1) Term frequency > [http://madlib.apache.org/docs/latest/group__grp__text__utilities.html] > and LDA > [http://madlib.apache.org/docs/latest/group__grp__lda.html] > should both creates indexes that start at 1, to make them consistent with > other MADlib modules. One or both of these currently create indexes starting > at 0. > 2) In the output_data_table *topic_assignment* is a dense vector but *words* > is a sparse vector (svec). > We should change *topic_assignment* to be a sparse vector to be consistent. > Note: the reason sparse vectors were used in the first place (I think) is to > keep the model state as small as possible, so it is preferred to dense format > in this case., although svecs are a bit harder to work with. We have hit the > Postgres 1GB field limit size in some use cases. > 3) The user docs could also use some cleanup at the same time. E.g., helper > functions are used in the examples but not described above. > 4) The helper function `madlib.lda_get_topic_desc` should return top k words > (and ties). It seems to returning the top k-1 words (and ties) now. -- This message was sent by Atlassian JIRA (v7.6.3#76005)