[ https://issues.apache.org/jira/browse/MAHOUT-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020336#comment-13020336 ]
Jeff Eastman commented on MAHOUT-479: ------------------------------------- Ted: Yeah... this is what I had in mind when I said grand unified theory. On Wed, Apr 13, 2011 at 9:24 PM, Jeff Eastman <j...@windwardsolutions.com>wrote: > This could potentially collapse kmeans, fuzzyk and Dirichlet into a single > implementation too: > > - Begin with a prior ClusterClassifier containing the appropriate sort of > Cluster, in clusters-n > - For each input Vector, compute the pdf vector using CC.classify() > -- For kmeans, train the most likely model from the pdf vector > -- For Dirichlet, train the model selected by the multinomial of the pfd > vector * mixture vector > -- For fuzzyk, train each model by its normalized pdf (would need a > new classify method for this) > - Close the CC, computing all posterior model parameters > - Serialize the CC into clusters-n+1 > > Now that would really be cool > > > On 4/13/11 9:00 PM, Jeff Eastman wrote: > >> Here's how I got there: >> >> - ClusterClassifier holds a "List<Cluster> models;" field as its only >> state just like VectorModelClassifier does >> - Started with ModelSerializerTest since you suggested being >> compatible with ModelSerializer >> - This tests OnlineLogisticRegression, CrossFoldLearner and >> AdaptiveLogisticRegression >> - The first two are also subclasses of AbstractVectorClassifier just >> like ClusterClassifier >> - The tests pass OLR and CFL learners to train(OnlineLearner) so it >> made sense for a CC to be an OL too >> - The new CC.train(...) methods map to "models.get(actual).observe()" >> in >> Cluster.observe(V) >> - CC.close() maps to cluster.computeParameters() for each model which >> computes the posterior cluster parameters >> - Now the CC is ready for another iteration or to classify, etc. >> >> So, the cluster iteration process starts with a prior List<Cluster> >> which is used to construct the ClusterClassifier. Then in each >> iteration each point is passed to CC.classify() and the maximum >> probability element index in the returned Vector is used to train() >> the CC. Since all the DistanceMeasureClusters contain their >> appropriate DistanceMeasure, the one with the maximum pdf() is the >> closest. Just what kmeans already does but done less efficiently (it >> uses just the minimum distance, but pdf() = e^-distance so the closest >> cluster has the largest pdf()). >> >> Finally, instead of passing in a List<Cluster> in the KMeansClusterer >> I can just carry around a CC which wraps it. Instead of serializing a >> List<Cluster> at the end of each iteration I can just serialize the >> CC. At the beginning of the next iteration, I just deserialize it and go. > Streamline classification/ clustering data structures > ----------------------------------------------------- > > Key: MAHOUT-479 > URL: https://issues.apache.org/jira/browse/MAHOUT-479 > Project: Mahout > Issue Type: Improvement > Components: Classification, Clustering > Affects Versions: 0.1, 0.2, 0.3, 0.4 > Reporter: Isabel Drost > Assignee: Isabel Drost > > Opening this JIRA issue to collect ideas on how to streamline our > classification and clustering algorithms to make integration for users easier > as per mailing list thread http://markmail.org/message/pnzvrqpv5226twfs > {quote} > Jake and Robin and I were talking the other evening and a common lament was > that our classification (and clustering) stuff was all over the map in terms > of data structures. Driving that to rest and getting those comments even > vaguely as plug and play as our much more advanced recommendation components > would be very, very helpful. > {quote} > This issue probably also realates to MAHOUT-287 (intention there is to make > naive bayes run on vectors as input). > Ted, Jake, Robin: Would be great if someone of you could add a comment on > some of the issues you discussed "the other evening" and (if applicable) any > minor or major changes you think could help solve this issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira