[ 
https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663797#action_12663797
 ] 

Karl Wettin commented on MAHOUT-19:
-----------------------------------

bq. 1. Assuming you are training the tree top-down, what is the division 
criteria you are using ?

The distributed code use bottom-up training, one could however add some code 
that trains it top-down.

bq. 2. How well does it scale ?

It scales OK, the tree grows rather large though.

bq. 3. Was the data on which this was tried, sparse ?

I've tried a bit of everything. My aim is to produce a tree representing a 
Lucene index in order to supply instant "more like this" results. Such data is 
definetly sparse.

bq. 4. What is the distance metric that has been used ?

It can use any of the metrics in Mahout. For my text tests I use Tanimoto 
(Jaccard).

bq. Basically I have a use -case where-in I have a set of 5 - 10 million urls 
which have an inherent hierarchical relationship and a set of user-clicks on 
them. I would like to cluster them in a tree and use the model to answer the 
near neighborhood type queries i.e. what urls are related to what other urls. I 
did implement a sequential bottom-up hierarchical clustering algorithm but the 
complexity is too bad for my data-set. I then thought about implementing a 
top-down hierarchical clustering algorithm using Jaccard co-efficient as my 
distance measure and came across this patch. Can you suggest if this patch will 
help?

The tree will probably be huge, but disk space is cheap. There are a few ways 
of improving speed, one could for instance find the closest clusters and 
navigate to the closest leaf rather than iterating all leafs. That should speed 
things up a lot. I've calculated it to be something like factor 10 on text 
data. There is however no support for that right now but I think there are some 
comments about it in the code.

If 5-10 million instances will take a long time to train, that I do not know. 
But once trained it will only take a few milliseconds to extract a cluster for 
a given trained instance.

There is a bug in this code, a tree will only grow in one child of the root. 
Not a big problem and it should be rather easy to fix too. Just need to find 
some time to trace it out.

Another problem is the tree persistency. Currently it use PersistentHashMap, 
something I hacked up for this patch. There is also a contrib module that use 
BerkeleyDB JE, and that is much more nice. PersistentHashMap has actually been 
factored out, improved and posted as a brand new project called BananaDB, 
available in Apache Labs: 

http://svn.apache.org/repos/asf/labs/bananadb/

The BerkeleyDB dependency (read: the Sleepycat License) might not be a problem 
anymore as we are moving towards Maven. We should probably talk about that on 
the dev list.

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt, 
> MAHOUT-19.txt, MAHOUT-19.txt, TestBottomFeed.test.png, TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where 
> branch nodes contains the mean features of and the distance between its 
> children.
> For performance reasons I always trained trees from the top->down. I have 
> been told that it can cause various effects I never encountered. And I 
> believe Huffman solved his problem by training bottom->up? The thing is, I 
> don't think it is possible to train the tree top->down using map reduce. I do 
> however think it is possible to train it bottom->up. I would very much 
> appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean 
> distance between all instances is usually a good maximum distance to allow 
> between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each 
> other is usually instant if the tree is available in memory or persisted in a 
> smart way. In my experience there is not much to win from extracting all 
> clusters from start. Also, it usually makes sense to allow for the user to 
> modify the cluster boundary variables in real time using a slider or perhaps 
> present the named summary of neighbouring clusters, blacklist paths in the 
> tree, etc. It is also not to bad to use secondary classification on the 
> instances to create worm holes in the tree. I always thought it would be cool 
> to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature 
> in search engines and use Tanimoto similarity on the vector spaces to 
> calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a 
> hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to