[
https://issues.apache.org/jira/browse/MAHOUT-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13037784#comment-13037784
]
Sean Owen commented on MAHOUT-683:
----------------------------------
Question: do I read correctly from the commit log that MAHOUT-682 and
MAHOUT-683 are fixed, and for 0.5?
> LDA Vectorization
> -----------------
>
> Key: MAHOUT-683
> URL: https://issues.apache.org/jira/browse/MAHOUT-683
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Reporter: Vasil Vasilev
> Priority: Minor
> Labels: LDA., Vectorization
> Attachments: MAHOUT-683.patch
>
>
> Currently the result of LDA clustering algorithm is a state which describes
> the probability of words, part of a corpus of documents, to belong to given
> topics. This probability is calculated for the whole corpus
> It is interesting, however, what is the average number of words of a given
> document that comes from a given topic. This information comes from the gamma
> vector in the LDA inference process. This vector can be used as
> representation of the given document for further clustering purposes (using
> algorithms like KMeans, Dirichlet, etc.). In this manner the dimensions of a
> document get reduced to the number of topics that is specified to the LDA
> clustering algorithm.
> With the proposed implementation from a corpus of documents described as
> vectors and from the last state of LDA inference process a set of vectors
> with reduced dimensions is produced (a vector per a document) which represent
> the set of documents
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira