Clustering with higher level data available for the distance computation is a fine thing.
The tuning will be very different but the results can be very good when the named entity resolver gets a good hit. Since named entities tend to be relatively rare, they get high IDF scores and other terms recede a bit as a result if normalization. Sent from my iPhone > On May 12, 2014, at 6:29, David Noel <david.i.n...@gmail.com> wrote: > > I've spent a few weeks tuning Mahout to cluster news articles and have > had decent results. Decent, but still not perfect. In trying to think > of ways to improve my results I had the idea of running Mahout on > output from Stanford's Named Entity Recognizer (NER) instead of the > articles themselves, and seeing how that compared. Has anyone tried > this? Did it generate more cohesive clusters?