Lol, not too surprising considering the source. Here's how I got there:
- ClusterClassifier holds a "List<Cluster> models;" field as its only
state just like VectorModelClassifier does
- Started with ModelSerializerTest since you suggested being compatible
with ModelSerializer
- This tests OnlineLogisticRegression, CrossFoldLearner and
AdaptiveLogisticRegression
- The first two are also subclasses of AbstractVectorClassifier just
like ClusterClassifier
- The tests pass OLR and CFL learners to train(OnlineLearner) so it made
sense for a CC to be an OL too
- The new CC.train(...) methods map to "models.get(actual).observe()" in
Cluster.observe(V)
- CC.close() maps to cluster.computeParameters() for each model which
computes the posterior cluster parameters
- Now the CC is ready for another iteration or to classify, etc.
So, the cluster iteration process starts with a prior List<Cluster>
which is used to construct the ClusterClassifier. Then in each iteration
each point is passed to CC.classify() and the maximum probability
element index in the returned Vector is used to train() the CC. Since
all the DistanceMeasureClusters contain their appropriate
DistanceMeasure, the one with the maximum pdf() is the closest. Just
what kmeans already does but done less efficiently (it uses just the
minimum distance, but pdf() = e^-distance so the closest cluster has the
largest pdf()).
Finally, instead of passing in a List<Cluster> in the KMeansClusterer I
can just carry around a CC which wraps it. Instead of serializing a
List<Cluster> at the end of each iteration I can just serialize the CC.
At the beginning of the next iteration, I just deserialize it and go.
I was so easy it surely must be wrong :)
On 4/13/11 7:54 PM, Ted Dunning wrote:
On Wed, Apr 13, 2011 at 6:24 PM, Jeff Eastman<[email protected]> wrote:
I've been able to prototype a ClusterClassifier which, like
VectorModelClassifier, extends AbstractVectorClassifier but which also
implements OnlineLearner and Writable.
Implementing OnlineLearner is a surprise here.
Have to think about it since the learning doesn't have a target variable.
... If this could be completed it would seem to allow kmeans, fuzzyk,
dirichlet and maybe even meanshift cluster classifiers to be used with SGD.
Very cool.
... The challenge would be to use AVC.classify() in the various clusterers
or to extract initial centers for kmeans& fuzzyk. Dirichlet might be
adaptable more directly since its models only have to produce the pi vector
of pdfs.
Yes. Dirichlet is the one where this makes sense.