[jira] [Issue Comment Edited] (MAHOUT-843) Top Down Clustering

Jeff Eastman (Issue Comment Edited) (JIRA) Wed, 09 Nov 2011 11:26:13 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147228#comment-13147228
 ]


Jeff Eastman edited comment on MAHOUT-843 at 11/9/11 7:25 PM:
--------------------------------------------------------------

After completing the above, I'd recommend creating a mapreduce version of the 
postprocessor. If you have TBs of vectors to cluster you will be most unhappy 
with the performance of the sequential version. I have some ideas on how to do 
this:
- Mapper reads clusteredPoints output and emits each VectorWritable to its 
clusterId
- Driver needs to set numReducers to be the number of clusters present in the 
clusteredPoints. You could probably compute this but it would be easier to have 
it as an argument (-Dmapred.reduce.tasks=k).
- Each reducer will receive all the VW points for a single cluster and will 
output a part file with key=clusterId value= {VW points}
- Subsequent driver code needs to move each part-r-xxx file into its own 
directory so the bottom clustering job can take that as input. This will likely 
be a whopping large file so make sure it is splittable (I think sequenceFiles 
are already).
- Implement -xm option on your postprocessor driver so that it can run either 
sequentially or mapreduce. Both should produce the same results.
                
      was (Author: jeastman):
    After completing the above, I'd recommend creating a mapreduce version of 
the postprocessor. If you have TBs of vectors to cluster you will be most 
unhappy with the performance of the sequential version. I have some ideas on 
how to do this:
- Mapper reads clusteredPoints output and emits each WeightedVectorWritable to 
its clusterId
- Driver needs to set numReducers to be the number of clusters present in the 
clusteredPoints. You could probably compute this but it would be easier to have 
it as an argument (-Dmapred.reduce.tasks=k).
- Each reducer will receive all the WVW points for a single cluster and will 
output a part file with key=clusterId
- Subsequent driver code needs to move each part-r-xxx file into its own 
directory so the bottom clustering job can take that as input. This will likely 
be a whopping large file so make sure it is splittable (I think sequenceFiles 
are already).
- Implement -xm option on your postprocessor driver so that it can run either 
sequentially or mapreduce. Both should produce the same results.
                  
> Top Down Clustering
> -------------------
>
>                 Key: MAHOUT-843
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-843
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>              Labels: clustering, patch
>             Fix For: 0.6
>
>         Attachments: MAHOUT-843-patch, MAHOUT-843-patch-v1, 
> Top-Down-Clustering-patch
>
>
> Top Down Clustering works in multiple steps. The first step is to find 
> comparative bigger clusters. The second step is to cluster the bigger chunks 
> into meaningful clusters. This can performance while clustering big amount of 
> data. And, it also removes the dependency of providing input clusters/numbers 
> to the clustering algorithm.
> The "big" is a relative term, as well as the smaller "meaningful" terms. So, 
> the control of this "bigger" and "smaller/meaningful" clusters will be 
> controlled by the user.
> Which clustering algorithm to be used in the top level and which to use in 
> the bottom level can also be selected by the user. Initially, it can be done 
> for only one/few clustering algorithms, and later, option can be provided to 
> use all the algorithms ( which suits the case ). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-843) Top Down Clustering

Reply via email to