Goel,
the persistency is only locally on the machine executing the job. It
"needs" to spawn a new job for each instance to be inserted in the tree.
Where to insert a new instance is what the spawned jobs figure out,
they do however never touch the tree. It is only touched by the Driver
class on local machine. Never DFS.
The reason for me not use MapFile is that it loads the complete
hashtable to memory (not the values, only the entriestable) and that
means it is not scalable in eternity. A tree with 10 million instances
will probably contain ten times as many tree nodes and each one contains
a mean instances of its children. So 10 million training instances
equals some 200 million object instances to keep track of.
If maths is right that means something like 700MB RAM.
That is why I use local object storage and not DFS based.
karl
Goel, Ankur skrev:
Karl,
Did you try using Hadoop MapFile ? Currently HBase (Hadoop Simple
database)
uses them for their indexing requirements for data lying in HDFS. I
think it would
make a better choice for Hadoop version of the algorithm.
JDBM would work fine for the non-hadoop version but for the hadoop
version the JDBM source code would require modification to talk to
the underlying HDFS for persistence. This would include changing the
serialization code of JDBM to fit the hadoop writable model.
This would be quite involved I think.
-Ankur
-----Original Message-----
From: Karl Wettin (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Wednesday, April 23, 2008 3:37 AM
To: [email protected]
Subject: [jira] Updated: (MAHOUT-19) Hierarchial clusterer
[
https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.
plugin.system.issuetabpanels:all-tabpanel ]
Karl Wettin updated MAHOUT-19:
------------------------------
Attachment: MAHOUT-19.txt
This works now. Just needs a bit better tests and the tree->dot
graphviz code needs to be fixed again before I want to commit anything.
Also want to try out those training optimization strategies I've written
about earlier (find closest cluster or leaf node first and then the
closest instance) and have a few small todos.
It uses my quick and dirty PersistentMap, a Map<Writable, Writable> that
keeps all data on disk at all time using RandomAccessFile (local
storage, not dfs) but will probably be replaced by BSDed
jdbm.sourceforge.net that Andrzej pointed out.
Hierarchial clusterer
---------------------
Key: MAHOUT-19
URL: https://issues.apache.org/jira/browse/MAHOUT-19
Project: Mahout
Issue Type: New Feature
Components: Clustering
Reporter: Karl Wettin
Assignee: Karl Wettin
Priority: Minor
Attachments: MAHOUT-19.txt, MAHOUT-19.txt, MAHOUT-19.txt,
TestBottomFeed.test.png, TestTopFeed.test.png
In a hierarchial clusterer the instances are the leaf nodes in a tree
where branch nodes contains the mean features of and the distance
between its children.
For performance reasons I always trained trees from the top->down. I
have been told that it can cause various effects I never encountered.
And I believe Huffman solved his problem by training bottom->up? The
thing is, I don't think it is possible to train the tree top->down using
map reduce. I do however think it is possible to train it bottom->up. I
would very much appreciate any thoughts on this.
Once this tree is trained one can extract clusters in various ways.
The mean distance between all instances is usually a good maximum
distance to allow between nodes when navigating the tree in search for a
cluster.
Navigating the tree and gather nodes that are not too far away from
each other is usually instant if the tree is available in memory or
persisted in a smart way. In my experience there is not much to win from
extracting all clusters from start. Also, it usually makes sense to
allow for the user to modify the cluster boundary variables in real time
using a slider or perhaps present the named summary of neighbouring
clusters, blacklist paths in the tree, etc. It is also not to bad to use
secondary classification on the instances to create worm holes in the
tree. I always thought it would be cool to visualize it using
Touchgraph.
My focus is on clustering text documents for instant "more like
this"-feature in search engines and use Tanimoto similarity on the
vector spaces to calculate the distance.
See LUCENE-1025 for a single threaded all in memory proof of concept
of a hierarchial clusterer.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.