If you can get bigrams or trigrams indexed as single terms then the k-means
clustering should work just fine.  I would recommend only including n-grams
that appear to be interesting textual units (what Amazon calls SIU's ...
statistically interesting phrases).  There is another thread going about how
to find such interesting phrases and beyond that there would be the issue of
getting Lucene to take note of a dictionary of interesting n-grams.

Grant, is there a Lucene analyzer that would do that?

On Wed, Jan 6, 2010 at 2:16 PM, Bogdan Vatkov <[email protected]>wrote:

> In some of the post I saw something about n-grams but I am not sure how can
> I get clustering with n-grams supported.
> I am currently running only k-means (I picked it more or less randomly -
> not
> sure which algorithms is best for my data) and I only get TopTerms as
> unigrams - can I get some clustering based on bigrams, trigrams, n-grams?
>
> Another question I have is which Mahout clustering algorithm is recommended
> for big amount of relatively small-sized documents? (as I said I use
> k-means
> more or less by accident - it is the first algorithm I could run with my
> data - I was focused on providing stop-words & stop-regex filtering to my
> input text vectors).
>
> Best regards,
> Bogdan
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to