If this isn't all a crock, it could potentially collapse kmeans, fuzzyk
and Dirichlet into a single implementation too:
- Begin with a prior ClusterClassifier containing the appropriate sort
of Cluster, in clusters-n
- For each input Vector, compute the pdf vector using CC.classify()
-- For kmeans, train the most likely model from the pdf vector
-- For Dirichlet, train the model selected by the multinomial of the pfd
vector * mixture vector
-- For fuzzyk, train each model by its normalized pdf (would need a new
classify method for this)
- Close the CC, computing all posterior model parameters
- Serialize the CC into clusters-n+1
Now that would really be cool
On 4/13/11 9:00 PM, Jeff Eastman wrote:
Lol, not too surprising considering the source. Here's how I got there:
- ClusterClassifier holds a "List<Cluster> models;" field as its only
state just like VectorModelClassifier does
- Started with ModelSerializerTest since you suggested being
compatible with ModelSerializer
- This tests OnlineLogisticRegression, CrossFoldLearner and
AdaptiveLogisticRegression
- The first two are also subclasses of AbstractVectorClassifier just
like ClusterClassifier
- The tests pass OLR and CFL learners to train(OnlineLearner) so it
made sense for a CC to be an OL too
- The new CC.train(...) methods map to "models.get(actual).observe()"
in Cluster.observe(V)
- CC.close() maps to cluster.computeParameters() for each model which
computes the posterior cluster parameters
- Now the CC is ready for another iteration or to classify, etc.
So, the cluster iteration process starts with a prior List<Cluster>
which is used to construct the ClusterClassifier. Then in each
iteration each point is passed to CC.classify() and the maximum
probability element index in the returned Vector is used to train()
the CC. Since all the DistanceMeasureClusters contain their
appropriate DistanceMeasure, the one with the maximum pdf() is the
closest. Just what kmeans already does but done less efficiently (it
uses just the minimum distance, but pdf() = e^-distance so the closest
cluster has the largest pdf()).
Finally, instead of passing in a List<Cluster> in the KMeansClusterer
I can just carry around a CC which wraps it. Instead of serializing a
List<Cluster> at the end of each iteration I can just serialize the
CC. At the beginning of the next iteration, I just deserialize it and go.
I was so easy it surely must be wrong :)
On 4/13/11 7:54 PM, Ted Dunning wrote:
On Wed, Apr 13, 2011 at 6:24 PM, Jeff Eastman<[email protected]>
wrote:
I've been able to prototype a ClusterClassifier which, like
VectorModelClassifier, extends AbstractVectorClassifier but which also
implements OnlineLearner and Writable.
Implementing OnlineLearner is a surprise here.
Have to think about it since the learning doesn't have a target
variable.
... If this could be completed it would seem to allow kmeans, fuzzyk,
dirichlet and maybe even meanshift cluster classifiers to be used
with SGD.
Very cool.
... The challenge would be to use AVC.classify() in the various
clusterers
or to extract initial centers for kmeans& fuzzyk. Dirichlet might be
adaptable more directly since its models only have to produce the pi
vector
of pdfs.
Yes. Dirichlet is the one where this makes sense.