[
https://issues.apache.org/jira/browse/MAHOUT-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147228#comment-13147228
]
Jeff Eastman edited comment on MAHOUT-843 at 11/9/11 7:25 PM:
--------------------------------------------------------------
After completing the above, I'd recommend creating a mapreduce version of the
postprocessor. If you have TBs of vectors to cluster you will be most unhappy
with the performance of the sequential version. I have some ideas on how to do
this:
- Mapper reads clusteredPoints output and emits each VectorWritable to its
clusterId
- Driver needs to set numReducers to be the number of clusters present in the
clusteredPoints. You could probably compute this but it would be easier to have
it as an argument (-Dmapred.reduce.tasks=k).
- Each reducer will receive all the VW points for a single cluster and will
output a part file with key=clusterId value= {VW points}
- Subsequent driver code needs to move each part-r-xxx file into its own
directory so the bottom clustering job can take that as input. This will likely
be a whopping large file so make sure it is splittable (I think sequenceFiles
are already).
- Implement -xm option on your postprocessor driver so that it can run either
sequentially or mapreduce. Both should produce the same results.
was (Author: jeastman):
After completing the above, I'd recommend creating a mapreduce version of
the postprocessor. If you have TBs of vectors to cluster you will be most
unhappy with the performance of the sequential version. I have some ideas on
how to do this:
- Mapper reads clusteredPoints output and emits each WeightedVectorWritable to
its clusterId
- Driver needs to set numReducers to be the number of clusters present in the
clusteredPoints. You could probably compute this but it would be easier to have
it as an argument (-Dmapred.reduce.tasks=k).
- Each reducer will receive all the WVW points for a single cluster and will
output a part file with key=clusterId
- Subsequent driver code needs to move each part-r-xxx file into its own
directory so the bottom clustering job can take that as input. This will likely
be a whopping large file so make sure it is splittable (I think sequenceFiles
are already).
- Implement -xm option on your postprocessor driver so that it can run either
sequentially or mapreduce. Both should produce the same results.
> Top Down Clustering
> -------------------
>
> Key: MAHOUT-843
> URL: https://issues.apache.org/jira/browse/MAHOUT-843
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.6
> Reporter: Paritosh Ranjan
> Labels: clustering, patch
> Fix For: 0.6
>
> Attachments: MAHOUT-843-patch, MAHOUT-843-patch-v1,
> Top-Down-Clustering-patch
>
>
> Top Down Clustering works in multiple steps. The first step is to find
> comparative bigger clusters. The second step is to cluster the bigger chunks
> into meaningful clusters. This can performance while clustering big amount of
> data. And, it also removes the dependency of providing input clusters/numbers
> to the clustering algorithm.
> The "big" is a relative term, as well as the smaller "meaningful" terms. So,
> the control of this "bigger" and "smaller/meaningful" clusters will be
> controlled by the user.
> Which clustering algorithm to be used in the top level and which to use in
> the bottom level can also be selected by the user. Initially, it can be done
> for only one/few clustering algorithms, and later, option can be provided to
> use all the algorithms ( which suits the case ).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira