Ok. I am going to try out 1) suggested by Jake, then write couple of tests
and then I will file the Jira-s.

On Thu, Apr 21, 2011 at 8:52 AM, Grant Ingersoll <[email protected]>wrote:

>
> On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote:
>
> > Hi Mahouters,
> >
> > I was experimenting with the LDA clustering algorithm on the Reuters data
> > set and I did several enhancements, which if you find interesting I could
> > contribute to the project:
> >
> > 1. Created term-frequency vectors pruner: LDA uses the tf vectors and not
> > the tf-idf ones which result from seq2sparse. Due this fact words like
> > "and", "where", etc. get also included in the resulting topics. To
> prevent
> > that I run seq2sparse with the whole tf-idf sequence and then run the
> > "pruner". It first calculates the standard deviation of the document
> > frequencies of the words and then prunes all entries in the tf vectors
> whose
> > document frequency is bigger then 3 times the calculated standard
> deviation.
> > This ensures including most of the words population, but still pruning
> the
> > unnecessary ones.
> >
> > 2. Implemented the alpha-estimation part of the LDA algorithm as
> described
> > in the Blei, Ng, Jordan paper. This leads to better results in maximizing
> > the log-likelihood for the same number of iterations. Just an example -
> for
> > 20 iterations on the reuters data set the enhanced algorithm reaches
> value
> > of -6975124.693072233, compared to -7304552.275676554 with the original
> > implementation
> >
> > 3. Created LDA Vectorizer. It executes only the inference part of the LDA
> > algorithm based on the last LDA state and the input document vectors and
> for
> > each vector produces a vector of the gammas, that are result of the
> > inference. The idea is that the vectors produced in this way can be used
> for
> > clustering with any of the existing algorithms (like canopy, kmeans,
> etc.)
> >
>
> As Jake says, this all sounds great.  Please see:
> https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute
>
>

Reply via email to