[
https://issues.apache.org/jira/browse/MADLIB-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336558#comment-16336558
]
Jingyi Mei commented on MADLIB-1160:
------------------------------------
For task 1): currently in term_frequency `wordid` starts at 0. In lda
(`wordid`, `topicid) in topic_assignment starts at 0 too. However in
madlib.lda_get_topic_desc, topic id starts at 1. We will make them all
consistant to 1 based.
task 2): after inspecting, we found it is not possible to make topic_assignment
as sparse vector - one word in the same doc can be assigned to different topic
id, therefore we have to make it 1to1 match. At the same time, it is not
necessary to make `words` as a dense vector too - it consumes space. As a
result, we cannot make `words` and `topic_assignment` the same kind of vector.
For the case that the user mentioned in the mailing list to match each wordid
with topic id, we propose to create a helper function `lda_get_word_and_topic`,
which will list docid, wordid and corresponding topic_id in a table.
task 3): will make docs change to reflect task1) and2).
Also created Jira phase 2 to do more documentation/interface change here
https://issues.apache.org/jira/browse/MADLIB-1199
> Usability changes for LDA
> -------------------------
>
> Key: MADLIB-1160
> URL: https://issues.apache.org/jira/browse/MADLIB-1160
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Priority: Minor
> Fix For: v1.14
>
>
> Context
> Please see this thread from the user mailing list
> http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E
> Tasks
> 1) Term frequency
> http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
> and LDA
> http://madlib.apache.org/docs/latest/group__grp__lda.html
> should both creates indexes that start at 1, to make them consistent with
> other MADlib modules. One or both of these currently create indexes starting
> at 0.
> 2) In the output_data_table *topic_assignment* is a dense vector but
> *words* is a sparse vector (svec).
> We should change *topic_assignment* to be a sparse vector to be consistent.
> Note: the reason sparse vectors were used in the first place (I think) is to
> keep the model state as small as possible, so it is preferred to dense format
> in this case., although svecs are a bit harder to work with. We have hit the
> Postgres 1GB field limit size in some use cases.
> 3) The user docs could also use some cleanup at the same time. E.g., helper
> functions are used in the examples but not described above.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)