[
https://issues.apache.org/jira/browse/MADLIB-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347538#comment-16347538
]
Jingyi Mei commented on MADLIB-1160:
------------------------------------
[~fmcquillan] Yes, from the code this line
[https://github.com/apache/madlib/blob/master/src/ports/postgres/modules/lda/lda.py_in#L615]
it ranks prob and then gets rank< top_k. Since rank starts at 1, it should be
rank<=top_k to get the top k records.
Further question: do we need to use dense rank instead of rank to get the top k
words here? which kind of rank makes more sense? Difference is here:
http://www.sql-tutorial.ru/en/book_rank_dense_rank_functions.html
> Usability changes for LDA
> -------------------------
>
> Key: MADLIB-1160
> URL: https://issues.apache.org/jira/browse/MADLIB-1160
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Utilities
> Reporter: Frank McQuillan
> Assignee: Jingyi Mei
> Priority: Minor
> Fix For: v1.14
>
>
> Context
> Please see this thread from the user mailing list
> http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E
> Tasks
> 1) Term frequency
> http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
> and LDA
> http://madlib.apache.org/docs/latest/group__grp__lda.html
> should both creates indexes that start at 1, to make them consistent with
> other MADlib modules. One or both of these currently create indexes starting
> at 0.
> 2) In the output_data_table *topic_assignment* is a dense vector but
> *words* is a sparse vector (svec).
> We should change *topic_assignment* to be a sparse vector to be consistent.
> Note: the reason sparse vectors were used in the first place (I think) is to
> keep the model state as small as possible, so it is preferred to dense format
> in this case., although svecs are a bit harder to work with. We have hit the
> Postgres 1GB field limit size in some use cases.
> 3) The user docs could also use some cleanup at the same time. E.g., helper
> functions are used in the examples but not described above.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)