In some of the post I saw something about n-grams but I am not sure how can I get clustering with n-grams supported. I am currently running only k-means (I picked it more or less randomly - not sure which algorithms is best for my data) and I only get TopTerms as unigrams - can I get some clustering based on bigrams, trigrams, n-grams?
Another question I have is which Mahout clustering algorithm is recommended for big amount of relatively small-sized documents? (as I said I use k-means more or less by accident - it is the first algorithm I could run with my data - I was focused on providing stop-words & stop-regex filtering to my input text vectors). Best regards, Bogdan
