Re: LDA and choice of number of topics

Ted Dunning Thu, 04 Mar 2010 09:07:12 -0800

This is a difficult topic that is addressed in different ways in practical
situations.  The approaches I know of include:

a) just pick a number that is probably big enough and go forward.  20, 30,
50 or 100 are all viable choices depending on the scale of your corpus.
Numbers as small as 5 might make sense for special purpose cases such as
voting histories.

b) run a parameter sweep over the number of topics and look at posterior
likelihood of your corpus.   This is pretty commonly done.

c) move to a more advanced non-parametric Bayesian approach where your
learning algorithms basically to (b) in a single learning process.  I
haven't heard of anyone doing this in applied situations yet, but it is a
very seductive goal.

Only (a) and (b) are viable in Mahout's implementation of LDA.  Option (c)
is implemented in our Dirichlet Process clustering, but that is less
powerful in some ways than LDA.

On Thu, Mar 4, 2010 at 6:56 AM, Claudio Martella <[email protected]
> wrote:

> The documents span different topics and i don't know in advance
> (and would LOVE to avoid it) their number. Do you have any advice on a
> strategy to follow?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: LDA and choice of number of topics

Reply via email to