[jira] Issue Comment Edited: (MAHOUT-19) Hierarchial clusterer

Karl Wettin (JIRA) Tue, 15 Apr 2008 05:56:14 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12589060#action_12589060
 ]


karl.wettin edited comment on MAHOUT-19 at 4/15/08 5:52 AM:
------------------------------------------------------------

Ok, here is the current map reduce strategy for building the tree:

{noformat}
add all instances to tree as leaf nodes with no parents;
while (tree contains more than 2 nodes with no parent) {
  create file with all permutation of nodes with no parent;
  execute job to find what node pairs are closest to each other;
  add nodes pairs as siblings to each other; (create new branches with node 
pairs as children)
  calculate mean vectors in all new branches; (another map reduce job)
}
place last two parentless nodes in root node;
calculate root mean vector;
{noformat}

Any comments to that?

      was (Author: karl.wettin):
    Ok, here is the current map reduce strategy for building the tree:

{noformat}
add all instances to tree as leaf nodes with no parents;
while (tree contains more than 2 nodes with no parent) {
  create file with all permutation of nodes with no parent;
  execute job to find what node pairs are closes to each other;
  add nodes pairs as siblings to each other;
  calculate mean vectors in all branches; (another map reduce job)
}
place last two parentless nodes in root node;
calculate root mean vector;
{noformat}

Any comments to that?
  
> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, TestBottomFeed.test.png, 
> TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where 
> branch nodes contains the mean features of and the distance between its 
> children.
> For performance reasons I always trained trees from the top->down. I have 
> been told that it can cause various effects I never encountered. And I 
> believe Huffman solved his problem by training bottom->up? The thing is, I 
> don't think it is possible to train the tree top->down using map reduce. I do 
> however think it is possible to train it bottom->up. I would very much 
> appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean 
> distance between all instances is usually a good maximum distance to allow 
> between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each 
> other is usually instant if the tree is available in memory or persisted in a 
> smart way. In my experience there is not much to win from extracting all 
> clusters from start. Also, it usually makes sense to allow for the user to 
> modify the cluster boundary variables in real time using a slider or perhaps 
> present the named summary of neighbouring clusters, blacklist paths in the 
> tree, etc. It is also not to bad to use secondary classification on the 
> instances to create worm holes in the tree. I always thought it would be cool 
> to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature 
> in search engines and use Tanimoto similarity on the vector spaces to 
> calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a 
> hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-19) Hierarchial clusterer

Reply via email to