[jira] [Commented] (MAHOUT-479) Streamline classification/ clustering data structures

Jeff Eastman (JIRA) Fri, 15 Apr 2011 09:00:50 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020336#comment-13020336
 ]


Jeff Eastman commented on MAHOUT-479:
-------------------------------------

Ted: Yeah... this is what I had in mind when I said grand unified theory.

On Wed, Apr 13, 2011 at 9:24 PM, Jeff Eastman <j...@windwardsolutions.com>wrote:

> This could potentially collapse kmeans, fuzzyk and Dirichlet into a single 
> implementation too:
>
> - Begin with a prior ClusterClassifier containing the appropriate sort of 
> Cluster, in clusters-n
> - For each input Vector, compute the pdf vector using CC.classify()
> -- For kmeans, train the most likely model from the pdf vector
> -- For Dirichlet, train the model selected by the multinomial of the pfd 
> vector * mixture vector
> -- For fuzzyk, train each model by its normalized pdf (would need a 
> new classify method for this)
> - Close the CC, computing all posterior model parameters
> - Serialize the CC into clusters-n+1
>
> Now that would really be cool
>
>
> On 4/13/11 9:00 PM, Jeff Eastman wrote:
>
>> Here's how I got there:
>>
>> - ClusterClassifier holds a "List<Cluster> models;" field as its only 
>> state just like VectorModelClassifier does
>> - Started with ModelSerializerTest since you suggested being 
>> compatible with ModelSerializer
>> - This tests OnlineLogisticRegression, CrossFoldLearner and 
>> AdaptiveLogisticRegression
>> - The first two are also subclasses of AbstractVectorClassifier just 
>> like ClusterClassifier
>> - The tests pass OLR and CFL learners to train(OnlineLearner) so it 
>> made sense for a CC to be an OL too
>> - The new CC.train(...) methods map to "models.get(actual).observe()" 
>> in
>> Cluster.observe(V)
>> - CC.close() maps to cluster.computeParameters() for each model which 
>> computes the posterior cluster parameters
>> - Now the CC is ready for another iteration or to classify, etc.
>>
>> So, the cluster iteration process starts with a prior List<Cluster> 
>> which is used to construct the ClusterClassifier. Then in each 
>> iteration each point is passed to CC.classify() and the maximum 
>> probability element index in the returned Vector is used to train() 
>> the CC. Since all the DistanceMeasureClusters contain their 
>> appropriate DistanceMeasure, the one with the maximum pdf() is the 
>> closest. Just what kmeans already does but done less efficiently (it 
>> uses just the minimum distance, but pdf() = e^-distance so the closest 
>> cluster has the largest pdf()).
>>
>> Finally, instead of passing in a List<Cluster> in the KMeansClusterer 
>> I can just carry around a CC which wraps it. Instead of serializing a 
>> List<Cluster> at the end of each iteration I can just serialize the 
>> CC. At the beginning of the next iteration, I just deserialize it and go.


> Streamline classification/ clustering data structures
> -----------------------------------------------------
>
>                 Key: MAHOUT-479
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-479
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.1, 0.2, 0.3, 0.4
>            Reporter: Isabel Drost
>            Assignee: Isabel Drost
>
> Opening this JIRA issue to collect ideas on how to streamline our 
> classification and clustering algorithms to make integration for users easier 
> as per mailing list thread http://markmail.org/message/pnzvrqpv5226twfs
> {quote}
> Jake and Robin and I were talking the other evening and a common lament was 
> that our classification (and clustering) stuff was all over the map in terms 
> of data structures.  Driving that to rest and getting those comments even 
> vaguely as plug and play as our much more advanced recommendation components 
> would be very, very helpful.
> {quote}
> This issue probably also realates to MAHOUT-287 (intention there is to make 
> naive bayes run on vectors as input).
> Ted, Jake, Robin: Would be great if someone of you could add a comment on 
> some of the issues you discussed "the other evening" and (if applicable) any 
> minor or major changes you think could help solve this issue.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-479) Streamline classification/ clustering data structures

Reply via email to