[ 
https://issues.apache.org/jira/browse/MAHOUT-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated MAHOUT-19:
------------------------------

    Attachment: TestBottomFeed.test.png
                TestTopFeed.test.png
                MAHOUT-19.txt

Implementation with non distributed top feed and bottom feed. The latter is 
used by the map reduced tree feed when the tree is too you for it to make sense 
to distribute. Top feed is most a proof of concept to show the difference.

Patch also contains MAHOUT-20 and MAHOUT-36.

Attached is also the graphviz output from the generated test data I've placed 
in test/resources.

There are a couple of changes and additions to the core that should be 
extracted to own patches:

 * build.xml creates a job file like nutch does.
 * Mht - reads ARFFish text files to a Matrix and map ordinal values to 
doubles. 
 * Vector#fill(double)
 * Vector#fill(double, int, int)
 * VectorStringUtils - converts a vector to a string and vice verse
 * DoubleWritable - see HADOOP-3243
 * SparseVectorWritable - accepts any vector but resolves as SparseVector


I simulate the hadoop framework to test DistributedBottomFeed as all driver 
code is inside of the same class and will not run for obvious reasons. I'll 
refactor and get it running in hadoop for real soon.

The tree is an abstract relation storage that needs lots of refactoring.

>From Package.html:

h3. Feeding a tree with instances

All nodes in the tree are either braches with two children, a leaf (instance) 
or the root.

{noformat}
      root
      2,75
     /   \
  leaf1  branch1
   4.0     1,5
          /   \
       leaf2  leaf3
        1,0    2,0
{noformat}

A new instance is inserted in the tree as the child of a new branch next to the 
closes existing node.

Insert new instance 1,2:

{noformat}
      root
      2,775
     /   \
  leaf1  branch1
   4.0     1,55
          /   \
       brach2 leaf3
         1,1   2,0
        /   \
     leaf2 leaf4
      1,0   1,2
{noformat}

<p>
  Adding an instance requires sequencially recalculate the mean value (vector 
space) of all the nodes from the new leaf
  to root. This is a <b>non distributable operation</b>. Finding the where to 
insert the new leaf is however.
  Traditionally this is done by top or bottom feeding it, i.e. find the closest 
leaf node by navigating from root
  towards the closest child until a leaf is reached, or by comparing against 
all leafs and from there navigate the the
  closest node.
</p>

<p>
  The first distributed implementation will assert you have an eternal amount 
of Hadoop nodes and calculate the distance
  to all nodes: root, branches and leafs, in one large job in search for the 
closest one.
</p>

<p>
  A second implementation will spawn multiple jobs before it knows where to 
insert an instance:
  <ul>
    <li>find closest leaf node</li>
    <li>find closest node between closest leaf node and root</li>
  </ul>
</p>

<p>
  Adding instances to a young tree does not make sense to distribute! The first 
couple of hundred (or so depending on
  vector size) instances could be bottom fed non distributed. (By the way, what 
is a good antonym to distributed?)
</p>

h3. Clustering

<p>
  Clustering is to navigate the tree from a given node using some set of rules. 
Usually this means looking at the
  distance between children of a node or the parent of a node. The default max 
distance computations are based on
  mean distance between leafs that are siblings to other leafs, or the mean 
distance to all leaf node siblings. I'm
  certain there are much better solutions based on distance to root squared out 
in ways together with previous mentioned
  values. Or find the value using classes in training data? What algorithm 
could that be?
</p>

<p>
  If set very sensitive for what to include in a cluster one could train use a 
classifier for each cluster to allow
  neighbouring clusters to join with the classifying cluster. And perhaps even 
connect clusters that seems to contain
  the same class even if the clusters are far away from each other. But that 
has to wait until we have classifiers..
</p>

h3. Persistency

<p>
  The tree is an abstract relational object storage for distances, vectores, 
clusters, et c., accessed via primary keys
  of the nodes. Only a HashMap implementation is available.
</p>
{html}

> Hierarchial clusterer
> ---------------------
>
>                 Key: MAHOUT-19
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-19
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-19.txt, TestBottomFeed.test.png, 
> TestTopFeed.test.png
>
>
> In a hierarchial clusterer the instances are the leaf nodes in a tree where 
> branch nodes contains the mean features of and the distance between its 
> children.
> For performance reasons I always trained trees from the top->down. I have 
> been told that it can cause various effects I never encountered. And I 
> believe Huffman solved his problem by training bottom->up? The thing is, I 
> don't think it is possible to train the tree top->down using map reduce. I do 
> however think it is possible to train it bottom->up. I would very much 
> appreciate any thoughts on this.
> Once this tree is trained one can extract clusters in various ways. The mean 
> distance between all instances is usually a good maximum distance to allow 
> between nodes when navigating the tree in search for a cluster. 
> Navigating the tree and gather nodes that are not too far away from each 
> other is usually instant if the tree is available in memory or persisted in a 
> smart way. In my experience there is not much to win from extracting all 
> clusters from start. Also, it usually makes sense to allow for the user to 
> modify the cluster boundary variables in real time using a slider or perhaps 
> present the named summary of neighbouring clusters, blacklist paths in the 
> tree, etc. It is also not to bad to use secondary classification on the 
> instances to create worm holes in the tree. I always thought it would be cool 
> to visualize it using Touchgraph.
> My focus is on clustering text documents for instant "more like this"-feature 
> in search engines and use Tanimoto similarity on the vector spaces to 
> calculate the distance.
> See LUCENE-1025 for a single threaded all in memory proof of concept of a 
> hierarchial clusterer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to