Also the topic regularization patch is ready:
https://issues.apache.org/jira/browse/MAHOUT-684

On Thu, Apr 28, 2011 at 10:53 AM, Vasil Vasilev <[email protected]> wrote:

> Hi all,
>
> The LDA Vectorization patch is ready. You can take a look at:
> https://issues.apache.org/jira/browse/MAHOUT-683*
>
> *Regards, Vasil
> *
> *
> On Thu, Apr 21, 2011 at 9:47 AM, Vasil Vasilev <[email protected]>wrote:
>
>> Ok. I am going to try out 1) suggested by Jake, then write couple of tests
>> and then I will file the Jira-s.
>>
>>
>> On Thu, Apr 21, 2011 at 8:52 AM, Grant Ingersoll <[email protected]>wrote:
>>
>>>
>>> On Apr 21, 2011, at 6:08 AM, Vasil Vasilev wrote:
>>>
>>> > Hi Mahouters,
>>> >
>>> > I was experimenting with the LDA clustering algorithm on the Reuters
>>> data
>>> > set and I did several enhancements, which if you find interesting I
>>> could
>>> > contribute to the project:
>>> >
>>> > 1. Created term-frequency vectors pruner: LDA uses the tf vectors and
>>> not
>>> > the tf-idf ones which result from seq2sparse. Due this fact words like
>>> > "and", "where", etc. get also included in the resulting topics. To
>>> prevent
>>> > that I run seq2sparse with the whole tf-idf sequence and then run the
>>> > "pruner". It first calculates the standard deviation of the document
>>> > frequencies of the words and then prunes all entries in the tf vectors
>>> whose
>>> > document frequency is bigger then 3 times the calculated standard
>>> deviation.
>>> > This ensures including most of the words population, but still pruning
>>> the
>>> > unnecessary ones.
>>> >
>>> > 2. Implemented the alpha-estimation part of the LDA algorithm as
>>> described
>>> > in the Blei, Ng, Jordan paper. This leads to better results in
>>> maximizing
>>> > the log-likelihood for the same number of iterations. Just an example -
>>> for
>>> > 20 iterations on the reuters data set the enhanced algorithm reaches
>>> value
>>> > of -6975124.693072233, compared to -7304552.275676554 with the original
>>> > implementation
>>> >
>>> > 3. Created LDA Vectorizer. It executes only the inference part of the
>>> LDA
>>> > algorithm based on the last LDA state and the input document vectors
>>> and for
>>> > each vector produces a vector of the gammas, that are result of the
>>> > inference. The idea is that the vectors produced in this way can be
>>> used for
>>> > clustering with any of the existing algorithms (like canopy, kmeans,
>>> etc.)
>>> >
>>>
>>> As Jake says, this all sounds great.  Please see:
>>> https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute
>>>
>>>
>>
>

Reply via email to