[jira] [Issue Comment Edited] (MAHOUT-843) Top Down Clustering

Paritosh Ranjan (Issue Comment Edited) (JIRA) Thu, 03 Nov 2011 10:09:57 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143296#comment-13143296
 ]


Paritosh Ranjan edited comment on MAHOUT-843 at 11/3/11 5:08 PM:
-----------------------------------------------------------------

This patch implements TopDownClustering. The class to use it is 
@TopDownClusteringDriver.

Top Level Clustering can be done by implementations of @TopLevelClusterConfig 
and bottom level clustering can be done by all implementations of 
@BottomLevelClusterConfig which are marker interfaces.

The concept is, to use different implementations of @ClusterConfig to specify 
parameters of different clustering algorithms. These @ClusterConfig 
implementations are passed as parameters specifying top level clustering 
configuration and bottom level clustering configuration.

The top level clustering output is post processed using 
@TopLevelClusterOutputPostProcessor which groups the vectors of similar 
clusters together. All of these clusters are further processed by bottom level 
clustering.

There is a specific implementation of @ClusterExecutor associated with each 
implementation of @ClusterConfig which uses the cluster config parameters to 
execute the specific algorithm.

The output of top level clustering is kept in <output path>/topLevelCluster and 
the output of bottom level clustering is kept in <output 
path>/bottomLevelCluster.

The post processed output of top level cluster is kept in <output 
path>/topLevelCluster/topLevelClusterPostProcessed/clusterId. 

Both the top and bottom level cluster use the clusterId as the name of the 
clusters produced.

I have added javadocs whereever it felt necessary so it would also help you 
guide through the code. I have tested using @CanopyClusterConfig as top and 
bottom level cluster config and it works.The other configs should work out of 
box.
                
      was (Author: paritoshranjan):
    This patch implements TopDownClustering. The class to use it is 
@TopDownClusteringDriver.

Top Level Clustering can be done by implementations of @TopLevelClusterConfig 
and bottom level clustering can be done by all implementations of 
@BottomLevelClusterConfig which are marker interfaces.

The concept is, to use different implementations of @ClusterConfig to specify 
parameters of different clustering algorithms. These @ClusterConfig 
implementations are passed as parameters specifying top level clustering 
configuration and bottom level clustering configuration.

The top level clustering output is post processed using 
@TopLevelClusterOutputPostProcessor which groups the vectors of similar 
clusters together. All of these clusters are further processed by bottom level 
clustering.

There is a specific implementation of @ClusterExecutor associated with each 
implementation of @ClusterConfig which uses the cluster config parameters to 
execute the specific algorithm.

The output of top level clustering is kept in <output path>/topLevelCluster and 
the output of bottom level clustering is kept in <output 
path>/bottomLevelCluster.

The post processed output of top level cluster is kept in <output 
path>/topLevelCluster/topLevelClusterPostProcessed/clusterId. 

Both the top and bottom level cluster use the clusterId as the name of the 
clusters produced.

I have added javadocs whereever it felt necessary so it would also help you 
guide through the code. I have done clustering using Canopy as top and bottom 
level cluster config and the other configs should work out of box.
                  
> Top Down Clustering
> -------------------
>
>                 Key: MAHOUT-843
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-843
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>              Labels: clustering, patch
>             Fix For: 0.6
>
>         Attachments: MAHOUT-843-patch, Top-Down-Clustering-patch
>
>
> Top Down Clustering works in multiple steps. The first step is to find 
> comparative bigger clusters. The second step is to cluster the bigger chunks 
> into meaningful clusters. This can performance while clustering big amount of 
> data. And, it also removes the dependency of providing input clusters/numbers 
> to the clustering algorithm.
> The "big" is a relative term, as well as the smaller "meaningful" terms. So, 
> the control of this "bigger" and "smaller/meaningful" clusters will be 
> controlled by the user.
> Which clustering algorithm to be used in the top level and which to use in 
> the bottom level can also be selected by the user. Initially, it can be done 
> for only one/few clustering algorithms, and later, option can be provided to 
> use all the algorithms ( which suits the case ). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-843) Top Down Clustering

Reply via email to