If you can get bigrams or trigrams indexed as single terms then the k-means clustering should work just fine. I would recommend only including n-grams that appear to be interesting textual units (what Amazon calls SIU's ... statistically interesting phrases). There is another thread going about how to find such interesting phrases and beyond that there would be the issue of getting Lucene to take note of a dictionary of interesting n-grams.
Grant, is there a Lucene analyzer that would do that? On Wed, Jan 6, 2010 at 2:16 PM, Bogdan Vatkov <[email protected]>wrote: > In some of the post I saw something about n-grams but I am not sure how can > I get clustering with n-grams supported. > I am currently running only k-means (I picked it more or less randomly - > not > sure which algorithms is best for my data) and I only get TopTerms as > unigrams - can I get some clustering based on bigrams, trigrams, n-grams? > > Another question I have is which Mahout clustering algorithm is recommended > for big amount of relatively small-sized documents? (as I said I use > k-means > more or less by accident - it is the first algorithm I could run with my > data - I was focused on providing stop-words & stop-regex filtering to my > input text vectors). > > Best regards, > Bogdan > -- Ted Dunning, CTO DeepDyve
