[jira] [Updated] (MADLIB-1160) Usability changes for LDA

Frank McQuillan (JIRA) Fri, 22 Sep 2017 12:49:53 -0700

     [ 
https://issues.apache.org/jira/browse/MADLIB-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Frank McQuillan updated MADLIB-1160:
------------------------------------
    Description: 
Context

Please see this thread from the user mailing list
http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E

1)  Term frequency
http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
and LDA
http://madlib.apache.org/docs/latest/group__grp__lda.html
should both creates indexes that start at 1, which makes them consistent with 
other MADlib modules.

2)  In the output_data_table  *topic_assignment* is a dense vector but *words* 
is a sparse vector (svec).
We should change *topic_assignment* to be a sparse vector also to be consistent.

Note:  the reason sparse vectors are used (I think) is to keep the model states 
as small as possible, so it is preferred to dense format in this case., 
although svecs are a bit harder to work with.

  was:
Context

Please see this thread from the user mailing list
http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E

Currently term frequency
http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
creates indexes that start at 0 (e.g., docid)
whereas LDA
http://madlib.apache.org/docs/latest/group__grp__lda.html
creates indexes that start at 1 (e.g., topicid)

Since these are often used together, they should be consistent.  Recommend 
changing term frequency to start at 1.

Setting to 2.0 fix in case this is a breaking change for upgrading models.


> Usability changes for LDA
> -------------------------
>
>                 Key: MADLIB-1160
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1160
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Utilities
>            Reporter: Frank McQuillan
>             Fix For: v2.0
>
>
> Context
> Please see this thread from the user mailing list
> http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201709.mbox/%3CCA%2B9JwyW78-aoe-NCQZc_iMuqW6SpKXs0H4JeTMfo3b-G4cxm0w%40mail.gmail.com%3E
> 1)  Term frequency
> http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
> and LDA
> http://madlib.apache.org/docs/latest/group__grp__lda.html
> should both creates indexes that start at 1, which makes them consistent with 
> other MADlib modules.
> 2)  In the output_data_table  *topic_assignment* is a dense vector but 
> *words* is a sparse vector (svec).
> We should change *topic_assignment* to be a sparse vector also to be 
> consistent.
> Note:  the reason sparse vectors are used (I think) is to keep the model 
> states as small as possible, so it is preferred to dense format in this 
> case., although svecs are a bit harder to work with.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MADLIB-1160) Usability changes for LDA

Reply via email to