Re: [jira] Updated: (MAHOUT-19) Hierarchial clusterer

Karl Wettin Fri, 25 Apr 2008 04:54:23 -0700

Goel,

the persistency is only locally on the machine executing the job. It"needs" to spawn a new job for each instance to be inserted in the tree.Where to insert a new instance is what the spawned jobs figure out,they do however never touch the tree. It is only touched by the Driverclass on local machine. Never DFS.

The reason for me not use MapFile is that it loads the completehashtable to memory (not the values, only the entriestable) and thatmeans it is not scalable in eternity. A tree with 10 million instanceswill probably contain ten times as many tree nodes and each one containsa mean instances of its children. So 10 million training instancesequals some 200 million object instances to keep track of.


If maths is right that means something like 700MB RAM.

That is why I use local object storage and not DFS based.

     karl


Goel, Ankur skrev:

Karl,
      Did you try using Hadoop MapFile ? Currently HBase (Hadoop Simple
database)
uses them for their indexing requirements for data lying in HDFS. I
think it would
make a better choice for Hadoop version of the algorithm.

JDBM would work fine for the non-hadoop version but for the hadoop

version the JDBM source code would require modification to talk tothe underlying HDFS for persistence. This would include changing theserialization code of JDBM to fit the hadoop writable model.This would be quite involved I think.


-Ankur

-----Original Message-----

From: Karl Wettin (JIRA) [mailto:[EMAIL PROTECTED]Sent: Wednesday, April 23, 2008 3:37 AM

To: [email protected]
Subject: [jira] Updated: (MAHOUT-19) Hierarchial clusterer

     [
https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.
plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-19:
------------------------------

    Attachment: MAHOUT-19.txt

This works now.  Just needs a bit better tests and the tree->dot
graphviz code needs to be fixed again before I want to commit anything.
Also want to try out those training optimization strategies I've written
about earlier (find closest cluster or leaf node first and then the
closest instance) and have a few small todos.

It uses my quick and dirty PersistentMap, a Map<Writable, Writable> that
keeps all data on disk at all time using RandomAccessFile (local
storage, not dfs) but will probably be replaced by BSDed
jdbm.sourceforge.net that Andrzej pointed out.

Hierarchial clusterer
---------------------

                Key: MAHOUT-19
                URL: https://issues.apache.org/jira/browse/MAHOUT-19
            Project: Mahout
         Issue Type: New Feature
         Components: Clustering
           Reporter: Karl Wettin
           Assignee: Karl Wettin
           Priority: Minor

Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt,TestBottomFeed.test.png, TestTopFeed.test.png



In a hierarchial clusterer the instances are the leaf nodes in a tree

where branch nodes contains the mean features of and the distance
between its children.

For performance reasons I always trained trees from the top->down. I

have been told that it can cause various effects I never encountered.
And I believe Huffman solved his problem by training bottom->up? The
thing is, I don't think it is possible to train the tree top->down using
map reduce. I do however think it is possible to train it bottom->up. I
would very much appreciate any thoughts on this.

Once this tree is trained one can extract clusters in various ways.

The mean distance between all instances is usually a good maximum
distance to allow between nodes when navigating the tree in search for a

cluster.

Navigating the tree and gather nodes that are not too far away from

each other is usually instant if the tree is available in memory or
persisted in a smart way. In my experience there is not much to win from
extracting all clusters from start. Also, it usually makes sense to
allow for the user to modify the cluster boundary variables in real time
using a slider or perhaps present the named summary of neighbouring
clusters, blacklist paths in the tree, etc. It is also not to bad to use
secondary classification on the instances to create worm holes in the
tree. I always thought it would be cool to visualize it using
Touchgraph.

My focus is on clustering text documents for instant "more like

this"-feature in search engines and use Tanimoto similarity on the
vector spaces to calculate the distance.

See LUCENE-1025 for a single threaded all in memory proof of concept

of a hierarchial clusterer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (MAHOUT-19) Hierarchial clusterer

Reply via email to