Hi Mahout Users!
Firstly, this community is great and appreciate all the Q & A back and forth! I am currently working on Text Clustering and I am using Mahout and Clustering algorithms (kmeans, krunner, canopy etc) for that. If anyone has worked on a similar project please let me know. I have a 2 questions as below – 1. In order to choose optimal K, I am running krunner across my vectorized dataset. In order to choose the right “k”, I am trying to understand the spread of my observations across all clusters and minimize cluster 1 (which apparently looks like the catch-all bucket – can anyone confirm?), but I am observing the final count varies depending on k. See below (please ignore the blank cells) – Any idea why the final count varies depending on chosen k? [image: Inline image 1] 2. Another thing I noticed, some of my clusters have just n=1 observation? That doesn’t make sense to me. Is there a way to avoid this, any particular parameter selection I can tweak? Thank you and looking forward to your reply. Cheers, Viral