Re: LDA related enhancements

Grant Ingersoll Wed, 20 Apr 2011 22:52:59 -0700

On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote:

> Hi Mahouters,
> 
> I was experimenting with the LDA clustering algorithm on the Reuters data
> set and I did several enhancements, which if you find interesting I could
> contribute to the project:
> 
> 1. Created term-frequency vectors pruner: LDA uses the tf vectors and not
> the tf-idf ones which result from seq2sparse. Due this fact words like
> "and", "where", etc. get also included in the resulting topics. To prevent
> that I run seq2sparse with the whole tf-idf sequence and then run the
> "pruner". It first calculates the standard deviation of the document
> frequencies of the words and then prunes all entries in the tf vectors whose
> document frequency is bigger then 3 times the calculated standard deviation.
> This ensures including most of the words population, but still pruning the
> unnecessary ones.
> 
> 2. Implemented the alpha-estimation part of the LDA algorithm as described
> in the Blei, Ng, Jordan paper. This leads to better results in maximizing
> the log-likelihood for the same number of iterations. Just an example - for
> 20 iterations on the reuters data set the enhanced algorithm reaches value
> of -6975124.693072233, compared to -7304552.275676554 with the original
> implementation
> 
> 3. Created LDA Vectorizer. It executes only the inference part of the LDA
> algorithm based on the last LDA state and the input document vectors and for
> each vector produces a vector of the gammas, that are result of the
> inference. The idea is that the vectors produced in this way can be used for
> clustering with any of the existing algorithms (like canopy, kmeans, etc.)
>


As Jake says, this all sounds great.  Please see: 
https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute

Re: LDA related enhancements

Reply via email to