Great, let me see what I can build this weekend as a separate universal 
clusterer using these ideas

-----Original Message-----
From: Ted Dunning [mailto:[email protected]] 
Sent: Wednesday, April 13, 2011 9:46 PM
To: [email protected]
Cc: Jeff Eastman
Subject: Re: FW: Converging Clustering and Classification

Yeah... this is what I had in mind when I said grand unified theory.

On Wed, Apr 13, 2011 at 9:24 PM, Jeff Eastman <[email protected]>wrote:

> If this isn't all a crock, it could potentially collapse kmeans, fuzzyk and
> Dirichlet into a single implementation too:
>
> - Begin with a prior ClusterClassifier containing the appropriate sort of
> Cluster, in clusters-n
> - For each input Vector, compute the pdf vector using CC.classify()
> -- For kmeans, train the most likely model from the pdf vector
> -- For Dirichlet, train the model selected by the multinomial of the pfd
> vector * mixture vector
> -- For fuzzyk, train each model by its normalized pdf (would need a new
> classify method for this)
> - Close the CC, computing all posterior model parameters
> - Serialize the CC into clusters-n+1
>
> Now that would really be cool
>
>
> On 4/13/11 9:00 PM, Jeff Eastman wrote:
>
>> Lol, not too surprising considering the source. Here's how I got there:
>>
>> - ClusterClassifier holds a "List<Cluster> models;" field as its only
>> state just like VectorModelClassifier does
>> - Started with ModelSerializerTest since you suggested being compatible
>> with ModelSerializer
>> - This tests OnlineLogisticRegression, CrossFoldLearner and
>> AdaptiveLogisticRegression
>> - The first two are also subclasses of AbstractVectorClassifier just like
>> ClusterClassifier
>> - The tests pass OLR and CFL learners to train(OnlineLearner) so it made
>> sense for a CC to be an OL too
>> - The new CC.train(...) methods map to "models.get(actual).observe()" in
>> Cluster.observe(V)
>> - CC.close() maps to cluster.computeParameters() for each model which
>> computes the posterior cluster parameters
>> - Now the CC is ready for another iteration or to classify, etc.
>>
>> So, the cluster iteration process starts with a prior List<Cluster> which
>> is used to construct the ClusterClassifier. Then in each iteration each
>> point is passed to CC.classify() and the maximum probability element index
>> in the returned Vector is used to train() the CC. Since all the
>> DistanceMeasureClusters contain their appropriate DistanceMeasure, the one
>> with the maximum pdf() is the closest. Just what kmeans already does but
>> done less efficiently (it uses just the minimum distance, but pdf() =
>> e^-distance so the closest cluster has the largest pdf()).
>>
>> Finally, instead of passing in a List<Cluster> in the KMeansClusterer I
>> can just carry around a CC which wraps it. Instead of serializing a
>> List<Cluster> at the end of each iteration I can just serialize the CC. At
>> the beginning of the next iteration, I just deserialize it and go.
>>
>> I was so easy it surely must be wrong :)
>>
>>
>>
>> On 4/13/11 7:54 PM, Ted Dunning wrote:
>>
>>> On Wed, Apr 13, 2011 at 6:24 PM, Jeff Eastman<[email protected]>
>>>  wrote:
>>>
>>>  I've been able to prototype a ClusterClassifier which, like
>>>> VectorModelClassifier, extends AbstractVectorClassifier but which also
>>>> implements OnlineLearner and Writable.
>>>>
>>>>  Implementing OnlineLearner is a surprise here.
>>>
>>> Have to think about it since the learning doesn't have a target variable.
>>>
>>>
>>>  ... If this could be completed it would seem to allow kmeans, fuzzyk,
>>>> dirichlet and maybe even meanshift cluster classifiers to be used with
>>>> SGD.
>>>>
>>>>  Very cool.
>>>
>>> ... The challenge would be to use AVC.classify() in the various
>>> clusterers
>>>
>>>> or to extract initial centers for kmeans&  fuzzyk. Dirichlet might be
>>>> adaptable more directly since its models only have to produce the pi
>>>> vector
>>>> of pdfs.
>>>>
>>>>  Yes.  Dirichlet is the one where this makes sense.
>>>
>>>
>>
>

Reply via email to