As some of you may know, I'm working on a book (it's a long time coming, but I'm getting there) about open source techniques for working with text. One of my chapters is on clustering and in it, I want to talk about generic clustering approaches and then show concrete examples of them in action. I've got the concrete side of it down.
Based on my research, it seems people typically divide up the clustering space into two approaches: hierarchical and flat/partitioning. In overlaying that knowledge with what we have for techniques in Mahout, I'm a bit stumped about where things like LDA and Dirichlet fit into those two approaches or is there, perhaps a third that I'm missing? They don't seem particularly hierarchical but they don't seem flat either, if that makes any sense, given the probabilistic/mixture nature of the algorithms. Perhaps I should forgo the traditional division that previous authors have taken and just talk about a suite of techniques at a little lower level? Thoughts? The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems. For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)? What algorithms worked for you? What didn't work? Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable) Thanks in advance and Happy New Year, Grant
