Hey Ted,
I've been able to prototype a ClusterClassifier which, like VectorModelClassifier, extends AbstractVectorClassifier but which also implements OnlineLearner and Writable. This should work (it compiles) in the KMeansClusterer in place of Iterable<Cluster> in the sequential code using train(). I've also been able to add a unit test of it in ModelSerializerTest (it compiles too). If this could be completed it would seem to allow kmeans, fuzzyk, dirichlet and maybe even meanshift cluster classifiers to be used with SGD. Going the other way (using a trained classifier as the prior of a clustering run) should also be possible though I haven’t got it sorted out yet. The challenge would be to use AVC.classify() in the various clusterers or to extract initial centers for kmeans & fuzzyk. Dirichlet might be adaptable more directly since its models only have to produce the pi vector of pdfs. Still lots of loose ends in this all. Certainly not for 0.5. Does any of this make sense? From: Ted Dunning [mailto:[email protected]] Sent: Tuesday, April 12, 2011 3:58 PM To: Jeff Eastman Subject: Re: Converging Clustering and Classification Cool. On Tue, Apr 12, 2011 at 3:56 PM, Jeff Eastman <[email protected]<mailto:[email protected]>> wrote: Ok, let me wrap my mind around that. I’ve almost got the token offering part since any Cluster can be used as the prior for kmeans, fuzzyK and dirichlet. A post processing step to serialize a set of clusters a’la ModelSerializer shouldn’t be out of the question either. I’ve got some time this weekend to tinker with it. From: Ted Dunning [mailto:[email protected]<mailto:[email protected]>] Sent: Tuesday, April 12, 2011 3:36 PM To: [email protected]<mailto:[email protected]> Cc: Jeff Eastman Subject: Re: Converging Clustering and Classification I will respond from the standpoint of an SGD partisan first. What I think is needed next is some way to save clusterings as models that are interoperable with SGD models. That is, ModelSerializer.readBinary should return a usable classifier when applied to whatever the clustering algorithm saved. This makes clustering models as deployable as SGD models already are and abstracts away the origin of the clustering model. It would be a kick if the clustering driver could be merged with the classification driver (if any) so that it would apply any model supported by ModelSerializer.readBinary to specified data. Then as a token offering to the gods of inter-operability it would be kind of cool if the initial state of k-means or other clustering algorithms could also be such a serialized model. That would allow an SGD model to be the initial state for clustering which would give a vague kind of semi-supervised learning at little cost. On Tue, Apr 12, 2011 at 2:57 PM, Jeff Eastman <[email protected]<mailto:[email protected]>> wrote: Hi Ted, We've been discussing this on and off and I'd like to pick up the thread again. Currently we have AbstractVectorClassifier (in pkg classifier) and VectorModelClassifier (in pkg clustering). This allows any set of Cluster Models (List< Model<VectorWritable>>) to function as a classifier. In your last email you indicated this as a step in the right direction. What else is needed? One thought I've had is this: Most clustering algorithms - the older ones anyway - have static Driver methods "buildClusters()" and "clusterData()". Would it help with the convergence process if these were simply renamed to "trainClusters()" and "classifyData()" (or something similar) respectively? I know it took me a while to see the isomorphism between clustering and classification, so perhaps something simple like this would be an improvement.
