Vasil, This sounds great!
On Wed, Apr 20, 2011 at 9:08 PM, Vasil Vasilev <[email protected]> wrote: > Hi Mahouters, > > 1. Created term-frequency vectors pruner: LDA uses the tf vectors and not > the tf-idf ones which result from seq2sparse. Due this fact words like > "and", "where", etc. get also included in the resulting topics. To prevent > that I run seq2sparse with the whole tf-idf sequence and then run the > "pruner". It first calculates the standard deviation of the document > frequencies of the words and then prunes all entries in the tf vectors > whose > document frequency is bigger then 3 times the calculated standard > deviation. > This ensures including most of the words population, but still pruning the > unnecessary ones. > If you could add this to the whole seq2sparse functionality in general (optionally), this would be generally better than the minDf / maxDf way we currently do this. > 2. Implemented the alpha-estimation part of the LDA algorithm as described > in the Blei, Ng, Jordan paper. This leads to better results in maximizing > the log-likelihood for the same number of iterations. Just an example - for > 20 iterations on the reuters data set the enhanced algorithm reaches value > of -6975124.693072233, compared to -7304552.275676554 with the original > implementation > Awesome. > 3. Created LDA Vectorizer. It executes only the inference part of the LDA > algorithm based on the last LDA state and the input document vectors and > for > each vector produces a vector of the gammas, that are result of the > inference. The idea is that the vectors produced in this way can be used > for > clustering with any of the existing algorithms (like canopy, kmeans, etc.) > Yeah, I've got code which does this too, and keep meaning to clean it up for submission, but if yours is ready to go, file a JIRA, submit a patch! :) The gamma vector is totally helpful, it lets you do LSI-style search, as well. -jake
