[ 
https://issues.apache.org/jira/browse/MAHOUT-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Mannix resolved MAHOUT-683.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5

You do indeed.  In the output directory of LDA, there should be a directory 
containing all the state-<num> intermediate states, and also a docTopics 
sequence file directory which contains the projection of the documents onto 
each topic.

> LDA Vectorization
> -----------------
>
>                 Key: MAHOUT-683
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-683
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Vasil Vasilev
>            Priority: Minor
>              Labels: LDA., Vectorization
>             Fix For: 0.5
>
>         Attachments: MAHOUT-683.patch
>
>
> Currently the result of LDA clustering algorithm is a state which describes 
> the probability of words, part of a corpus of documents, to belong to given 
> topics. This probability is calculated for the whole corpus
> It is interesting, however, what is the average number of words of a given 
> document that comes from a given topic. This information comes from the gamma 
> vector in the LDA inference process. This vector can be used as 
> representation of the given document for further clustering purposes (using 
> algorithms like KMeans, Dirichlet, etc.). In this manner the dimensions of a 
> document get reduced to the number of topics that is specified to the LDA 
> clustering algorithm.
> With the proposed implementation from a corpus of documents described as 
> vectors and from the last state of LDA inference process a set of vectors 
> with reduced dimensions is produced (a vector per a document) which represent 
> the set of documents

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to