Re: FW: Converging Clustering and Classification

Jeff Eastman Wed, 13 Apr 2011 21:24:48 -0700

If this isn't all a crock, it could potentially collapse kmeans, fuzzykand Dirichlet into a single implementation too:

- Begin with a prior ClusterClassifier containing the appropriate sortof Cluster, in clusters-n

- For each input Vector, compute the pdf vector using CC.classify()
-- For kmeans, train the most likely model from the pdf vector

-- For Dirichlet, train the model selected by the multinomial of the pfdvector * mixture vector-- For fuzzyk, train each model by its normalized pdf (would need a newclassify method for this)

- Close the CC, computing all posterior model parameters
- Serialize the CC into clusters-n+1


Now that would really be cool

On 4/13/11 9:00 PM, Jeff Eastman wrote:

Lol, not too surprising considering the source. Here's how I got there:
- ClusterClassifier holds a "List<Cluster> models;" field as its onlystate just like VectorModelClassifier does- Started with ModelSerializerTest since you suggested beingcompatible with ModelSerializer- This tests OnlineLogisticRegression, CrossFoldLearner andAdaptiveLogisticRegression- The first two are also subclasses of AbstractVectorClassifier justlike ClusterClassifier- The tests pass OLR and CFL learners to train(OnlineLearner) so itmade sense for a CC to be an OL too- The new CC.train(...) methods map to "models.get(actual).observe()"in Cluster.observe(V)- CC.close() maps to cluster.computeParameters() for each model whichcomputes the posterior cluster parameters
- Now the CC is ready for another iteration or to classify, etc.
So, the cluster iteration process starts with a prior List<Cluster>which is used to construct the ClusterClassifier. Then in eachiteration each point is passed to CC.classify() and the maximumprobability element index in the returned Vector is used to train()the CC. Since all the DistanceMeasureClusters contain theirappropriate DistanceMeasure, the one with the maximum pdf() is theclosest. Just what kmeans already does but done less efficiently (ituses just the minimum distance, but pdf() = e^-distance so the closestcluster has the largest pdf()).
Finally, instead of passing in a List<Cluster> in the KMeansClustererI can just carry around a CC which wraps it. Instead of serializing aList<Cluster> at the end of each iteration I can just serialize theCC. At the beginning of the next iteration, I just deserialize it and go.
I was so easy it surely must be wrong :)



On 4/13/11 7:54 PM, Ted Dunning wrote:
On Wed, Apr 13, 2011 at 6:24 PM, Jeff Eastman<[email protected]>wrote:
I've been able to prototype a ClusterClassifier which, like
VectorModelClassifier, extends AbstractVectorClassifier but which also
implements OnlineLearner and Writable.
Implementing OnlineLearner is a surprise here.
Have to think about it since the learning doesn't have a targetvariable.
... If this could be completed it would seem to allow kmeans, fuzzyk,
dirichlet and maybe even meanshift cluster classifiers to be usedwith SGD.
Very cool.
... The challenge would be to use AVC.classify() in the variousclusterers
or to extract initial centers for kmeans&  fuzzyk. Dirichlet might be
adaptable more directly since its models only have to produce the pivector
of pdfs.
Yes.  Dirichlet is the one where this makes sense.

Re: FW: Converging Clustering and Classification

Reply via email to