Hey Ted,


I've been able to prototype a ClusterClassifier which, like 
VectorModelClassifier, extends AbstractVectorClassifier but which also 
implements OnlineLearner and Writable. This should work (it compiles) in the 
KMeansClusterer in place of Iterable<Cluster> in the sequential code using 
train(). I've also been able to add a unit test of it in ModelSerializerTest 
(it compiles too). If this could be completed it would seem to allow kmeans, 
fuzzyk, dirichlet and maybe even meanshift cluster classifiers to be used with 
SGD.



Going the other way (using a trained classifier as the prior of a clustering 
run) should also be possible though I haven’t got it sorted out yet. The 
challenge would be to use AVC.classify() in the various clusterers or to 
extract initial centers for kmeans & fuzzyk. Dirichlet might be adaptable more 
directly since its models only have to produce the pi vector of pdfs.



Still lots of loose ends in this all. Certainly not for 0.5. Does any of this 
make sense?


From: Ted Dunning [mailto:[email protected]]
Sent: Tuesday, April 12, 2011 3:58 PM
To: Jeff Eastman
Subject: Re: Converging Clustering and Classification

Cool.
On Tue, Apr 12, 2011 at 3:56 PM, Jeff Eastman 
<[email protected]<mailto:[email protected]>> wrote:
Ok, let me wrap my mind around that. I’ve almost got the token offering part 
since any Cluster can be used as the prior for kmeans, fuzzyK and dirichlet. A 
post processing step to serialize a set of clusters a’la ModelSerializer 
shouldn’t be out of the question either. I’ve got some time this weekend to 
tinker with it.

From: Ted Dunning [mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, April 12, 2011 3:36 PM
To: [email protected]<mailto:[email protected]>
Cc: Jeff Eastman
Subject: Re: Converging Clustering and Classification

I will respond from the standpoint of an SGD partisan first.
What I think is needed next is some way to save clusterings as models that are 
interoperable with SGD models.  That is, ModelSerializer.readBinary should 
return a usable classifier when applied to whatever the clustering algorithm 
saved.  This makes clustering models as deployable as SGD models already are 
and abstracts away the origin of the clustering model.
It would be a kick if the clustering driver could be merged with the 
classification driver (if any) so that it would apply any model supported by 
ModelSerializer.readBinary to specified data.
Then as a token offering to the gods of inter-operability it would be kind of 
cool if the initial state of k-means or other clustering algorithms could also 
be such a serialized model.  That would allow an SGD model to be the initial 
state for clustering which would give a vague kind of semi-supervised learning 
at little cost.



On Tue, Apr 12, 2011 at 2:57 PM, Jeff Eastman 
<[email protected]<mailto:[email protected]>> wrote:
Hi Ted,

We've been discussing this on and off and I'd like to pick up the thread again. 
Currently we have AbstractVectorClassifier (in pkg classifier) and 
VectorModelClassifier (in pkg clustering). This allows any set of Cluster 
Models (List< Model<VectorWritable>>) to function as a classifier. In your last 
email you indicated this as a step in the right direction. What else is needed?

One thought I've had is this: Most clustering algorithms - the older ones 
anyway - have static Driver methods "buildClusters()" and "clusterData()". 
Would it help with the convergence process if these were simply renamed to 
"trainClusters()" and "classifyData()" (or something similar) respectively? I 
know it took me a while to see the isomorphism between clustering and 
classification, so perhaps something simple like this would be an improvement.


Reply via email to