[jira] [Commented] (MAHOUT-843) Top Down Clustering

Paritosh Ranjan (Commented) (JIRA) Sun, 16 Oct 2011 00:54:42 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128364#comment-13128364
 ]


Paritosh Ranjan commented on MAHOUT-843:
----------------------------------------

After doing the top level clustering, the output is of the form of "clusterid, 
vectorid". The problem is, that, the bottom level clustering would need input 
as a directory of points. So, the points belonging to different clusters should 
be in different directories.

This can be done as a post processing step ( after runClustering ). Or it can 
also be done in the MapReduce Step, if its already known that it is a topdown 
clustering. 

The MapReduce approach will need some change in all clustering algorithm. But, 
it will give better performance. The postProcessing approach will not touch any 
clustering algorithm, but, it will just be an extra step.

To start with, I am beginning with, the post processing step. As, this will 
make this patcha a completely clean patch, which could not have any regression. 

Any ideas/suggestions on  how to approach this problem?
                
> Top Down Clustering
> -------------------
>
>                 Key: MAHOUT-843
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-843
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>              Labels: clustering, patch
>             Fix For: 0.6
>
>         Attachments: Top-Down-Clustering-patch
>
>
> Top Down Clustering works in multiple steps. The first step is to find 
> comparative bigger clusters. The second step is to cluster the bigger chunks 
> into meaningful clusters. This can performance while clustering big amount of 
> data. And, it also removes the dependency of providing input clusters/numbers 
> to the clustering algorithm.
> The "big" is a relative term, as well as the smaller "meaningful" terms. So, 
> the control of this "bigger" and "smaller/meaningful" clusters will be 
> controlled by the user.
> Which clustering algorithm to be used in the top level and which to use in 
> the bottom level can also be selected by the user. Initially, it can be done 
> for only one/few clustering algorithms, and later, option can be provided to 
> use all the algorithms ( which suits the case ). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-843) Top Down Clustering

Reply via email to