On Jan 1, 2010, at 5:00 AM, Ted Dunning wrote: > On Thu, Dec 31, 2009 at 10:41 PM, Bogdan Vatkov > <[email protected]>wrote: > >> >> I would like to give some feedback. And ask some questions as well :). >> > > Thank you! > > Very helpful feedback. > > >> ... Carrot2 for 2 weeks ... has great level of >> usability and simplicity but ...I had to give up on it since my very first >> practical clustering task required to cluster 23K+ documents. > > > Not too surprising.
Right, Carrot2 is designed for clustering search results, and of that mainly the title and snippet. While it can do larger docs, they are specifically not the target. Plus, C2 is an in-memory tool designed to be very fast for search results. > > >> ... >> I have managed to do some clustering on my 23 000+ docs with Mahout/k-means >> for something like 10 min (in standalone mode - no parallel processing at >> all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) but >> I >> am still learning and still trying to analyze if the result clusters are >> really meaningful for my docs. >> > > I have seen this effect before where a map-reduce program run sequentially > is much faster than an all-in-memory implementation. > > >> One thing I can tell already now is that I definitely, desperately, need >> word-stopping > > > You should be able to do this in the document -> vector conversion. You > could also do this at the vector level by multiplying the coordinates of all > stop words by zero, but that is not as nice a solution. Right, or if you are using the Lucene extraction method, at Lucene indexing time. > > >> ... But it would be valuable for me to be able >> to come back later to the complete context of a document (i.e. with the >> stopwords inside) - maybe it is a question on its own - how can I easily go >> back from clusters->original docs (an not just vectors), I do not know >> maybe >> some kind of mapper which maps vectors to the original documents somehow >> (e.g. sort of URL for a document based on the vector id/index or >> something?). >> > > To do this, you should use the document ID and just return the original > content from some other content store. Lucene or especially SOLR can help > with this. Right, Mahout's vector can take labels. > > >> ... >> I think I will get better results if I can also apply stemming. What would >> be you recommendation when using mahout? Should I do the stemming again >> somewhere in the input vector forming? > > > Yes. That is exactly correct. Again, really easy to do if you use the Lucene method for creating vectors. > > It is also really essential for me to have "updateable" algorithms as I am >> adding new documents on daily basis, and I definitely like to have them >> clustered immediately (incrementally) - I do not know if this is what is >> called "classification" in Mahout and I did not reach these examples yet (I >> wanted to really get acquainted with the clustering first). >> > > I can't comment on exactly how this should be done, but we definitely need > to support this use case. Don't people usually see if the new docs fit into an existing cluster and if they are a good fit, add them there, otherwise, maybe put them in the best match and kick off a new job. > > >> And that is not all - I do not only want to have new documents clustered >> against existing clusters but what I want in addition is that clusters >> could >> actually change with new docs coming. >> > > Exactly. This is easy algorithmically with k-means. It just needs to be > supported by the software. Makes sense and shouldn't be that hard to do. I'd imagine we just need to be able to use the centroids from the previous run as the seeds for the new run. > > >> Of course one could not observe new clusters popping up after a single new >> doc is added to the analysis but clusters should really be >> adaptable/updateable with new docs. >> > > Yes. It is eminently doable. Occasionally you should run back through all > of the document vectors so you can look at old documents in light of new > data but that should be very, very fast in your case.
