In some of the post I saw something about n-grams but I am not sure how can
I get clustering with n-grams supported.
I am currently running only k-means (I picked it more or less randomly - not
sure which algorithms is best for my data) and I only get TopTerms as
unigrams - can I get some clustering based on bigrams, trigrams, n-grams?

Another question I have is which Mahout clustering algorithm is recommended
for big amount of relatively small-sized documents? (as I said I use k-means
more or less by accident - it is the first algorithm I could run with my
data - I was focused on providing stop-words & stop-regex filtering to my
input text vectors).

Best regards,
Bogdan

Reply via email to