This is a difficult topic that is addressed in different ways in practical situations. The approaches I know of include:
a) just pick a number that is probably big enough and go forward. 20, 30, 50 or 100 are all viable choices depending on the scale of your corpus. Numbers as small as 5 might make sense for special purpose cases such as voting histories. b) run a parameter sweep over the number of topics and look at posterior likelihood of your corpus. This is pretty commonly done. c) move to a more advanced non-parametric Bayesian approach where your learning algorithms basically to (b) in a single learning process. I haven't heard of anyone doing this in applied situations yet, but it is a very seductive goal. Only (a) and (b) are viable in Mahout's implementation of LDA. Option (c) is implemented in our Dirichlet Process clustering, but that is less powerful in some ways than LDA. On Thu, Mar 4, 2010 at 6:56 AM, Claudio Martella <[email protected] > wrote: > The documents span different topics and i don't know in advance > (and would LOVE to avoid it) their number. Do you have any advice on a > strategy to follow? > -- Ted Dunning, CTO DeepDyve
