[ 
https://issues.apache.org/jira/browse/MAHOUT-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129033#comment-13129033
 ] 

Jeff Eastman commented on MAHOUT-843:
-------------------------------------

I can't get the patch to do anything. It runs ok, but does not add any of the 
files. I'm left reading over the original patch file in a browser which is not 
that great. 

What I get from looking at the patch is you are building a ClusterConfig for 
the top level clustering step and also for the bottom level clustering step. 
These capture the various parameters for each clustering algorithm. Then, the 
driver gets an executor from each config that knows how to invoke the top and 
bottom clustering steps. On the surface, that seems to be a workable approach.

All this is pure Java though and there is no CLI interface. This seems like the 
really challenging part, as each of the clustering configs will need a complete 
set of CLI arguments (e.g. /bin/mahout topdownclustering <top configs> <bottom 
configs>). For any given combination of top/bottom configs you are going to 
need a different CLI argument list. Since top and bottom may be the same 
algorithm but with different parameters (e.g. t1-top, t1-bottom), or different 
algorithms with overlapping argument names (e.g. dm-top, dm-bottom), I can't 
think of a good way to approach this, can you?

Isn't this something that could also be done with a set of shell scripts?

                
> Top Down Clustering
> -------------------
>
>                 Key: MAHOUT-843
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-843
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>              Labels: clustering, patch
>             Fix For: 0.6
>
>         Attachments: Top-Down-Clustering-patch
>
>
> Top Down Clustering works in multiple steps. The first step is to find 
> comparative bigger clusters. The second step is to cluster the bigger chunks 
> into meaningful clusters. This can performance while clustering big amount of 
> data. And, it also removes the dependency of providing input clusters/numbers 
> to the clustering algorithm.
> The "big" is a relative term, as well as the smaller "meaningful" terms. So, 
> the control of this "bigger" and "smaller/meaningful" clusters will be 
> controlled by the user.
> Which clustering algorithm to be used in the top level and which to use in 
> the bottom level can also be selected by the user. Initially, it can be done 
> for only one/few clustering algorithms, and later, option can be provided to 
> use all the algorithms ( which suits the case ). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to