[
https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589060#action_12589060
]
karl.wettin edited comment on MAHOUT-19 at 4/15/08 5:52 AM:
------------------------------------------------------------
Ok, here is the current map reduce strategy for building the tree:
{noformat}
add all instances to tree as leaf nodes with no parents;
while (tree contains more than 2 nodes with no parent) {
create file with all permutation of nodes with no parent;
execute job to find what node pairs are closest to each other;
add nodes pairs as siblings to each other; (create new branches with node
pairs as children)
calculate mean vectors in all new branches; (another map reduce job)
}
place last two parentless nodes in root node;
calculate root mean vector;
{noformat}
Any comments to that?
was (Author: karl.wettin):
Ok, here is the current map reduce strategy for building the tree:
{noformat}
add all instances to tree as leaf nodes with no parents;
while (tree contains more than 2 nodes with no parent) {
create file with all permutation of nodes with no parent;
execute job to find what node pairs are closes to each other;
add nodes pairs as siblings to each other;
calculate mean vectors in all branches; (another map reduce job)
}
place last two parentless nodes in root node;
calculate root mean vector;
{noformat}
Any comments to that?
> Hierarchial clusterer
> ---------------------
>
> Key: MAHOUT-19
> URL: https://issues.apache.org/jira/browse/MAHOUT-19
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Reporter: Karl Wettin
> Assignee: Karl Wettin
> Priority: Minor
> Attachments: MAHOUT-19.txt, TestBottomFeed.test.png,
> TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where
> branch nodes contains the mean features of and the distance between its
> children.
> For performance reasons I always trained trees from the top->down. I have
> been told that it can cause various effects I never encountered. And I
> believe Huffman solved his problem by training bottom->up? The thing is, I
> don't think it is possible to train the tree top->down using map reduce. I do
> however think it is possible to train it bottom->up. I would very much
> appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean
> distance between all instances is usually a good maximum distance to allow
> between nodes when navigating the tree in search for a cluster.
> Navigating the tree and gather nodes that are not too far away from each
> other is usually instant if the tree is available in memory or persisted in a
> smart way. In my experience there is not much to win from extracting all
> clusters from start. Also, it usually makes sense to allow for the user to
> modify the cluster boundary variables in real time using a slider or perhaps
> present the named summary of neighbouring clusters, blacklist paths in the
> tree, etc. It is also not to bad to use secondary classification on the
> instances to create worm holes in the tree. I always thought it would be cool
> to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature
> in search engines and use Tanimoto similarity on the vector spaces to
> calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a
> hierarchial clusterer.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.