[
https://issues.apache.org/jira/browse/MAHOUT-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026316#comment-13026316
]
Jake Mannix commented on MAHOUT-683:
------------------------------------
How does this compare to what is in the latest patch in MAHOUT-458 ?
> LDA Vectorization
> -----------------
>
> Key: MAHOUT-683
> URL: https://issues.apache.org/jira/browse/MAHOUT-683
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Reporter: Vasil Vasilev
> Priority: Minor
> Labels: LDA., Vectorization
> Attachments: MAHOUT-683.patch
>
>
> Currently the result of LDA clustering algorithm is a state which describes
> the probability of words, part of a corpus of documents, to belong to given
> topics. This probability is calculated for the whole corpus
> It is interesting, however, what is the average number of words of a given
> document that comes from a given topic. This information comes from the gamma
> vector in the LDA inference process. This vector can be used as
> representation of the given document for further clustering purposes (using
> algorithms like KMeans, Dirichlet, etc.). In this manner the dimensions of a
> document get reduced to the number of topics that is specified to the LDA
> clustering algorithm.
> With the proposed implementation from a corpus of documents described as
> vectors and from the last state of LDA inference process a set of vectors
> with reduced dimensions is produced (a vector per a document) which represent
> the set of documents
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira