[
https://issues.apache.org/jira/browse/MAHOUT-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150882#comment-13150882
]
Jeff Eastman commented on MAHOUT-843:
-------------------------------------
I've had some time to poke around in the code and single-step through the test
execution. It seems to work as advertised but I had a difficult time reading
the code and I'm concerned that the implementation has some performance
challenges. Much of this involves personal style so take it with a grain of
salt:
- The code is very fine grain, with many small methods. Usually I'm faced with
the opposite - huge methods that need to be decomposed - but here the smallness
works against readability. Consider distributeVectors() which calls
putVectorsInRespectiveClusters() which calls findClusterAndAddVector() which
calls addVectorToItsClusterFile() which calls writeVectorToCluster(). These
small methods are only called in a single place and could be inlined to provide
more context for how the point is finally written.
- The ClusterOutputPostProcessor has a number of private fields which are
initialized and used within this method chain rather than passing needed values
in as method arguments. If I'm stopped in the debugger as I am now, it is
challenging for me to identify where and how a particular field was initialized
because it is not visible in the stack frames of the call chain.
- WriteVectorToCluster() creates a SequenceFile.Writer for each point, then
appends to it syncs it and closes it. The path for the output file is passed in
two levels. Seems to me that will thrash the GC and I'd consider passing the
writer down instead of the path so you only open/append/sync/close one for each
cluster.
- Minor now, but the run method has one -i argument which is immediately
assigned to clusterOutputToBeProcessed. And the inputWriter is actually writing
output. Seems backwards. I'd suggest adding a -o argument so callers have full
control over where the points come from and where they get written.
- Finally, I'd also suggest adding a -ci argument so that the clusters can be
read in the beginning. This will facilitate setting numReducers later and would
allow you to create all the output writers at the beginning rather than lazily
as they are encountered.
All this aside however, it seems to work and is a good step forward. Once the
mapreduce version gets fleshed out you will likely want to refactor the whole
thing anyway.
> Top Down Clustering
> -------------------
>
> Key: MAHOUT-843
> URL: https://issues.apache.org/jira/browse/MAHOUT-843
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.6
> Reporter: Paritosh Ranjan
> Labels: clustering, patch
> Fix For: 0.6
>
> Attachments: MAHOUT-843-patch, MAHOUT-843-patch-only-postprocessor,
> MAHOUT-843-patch-only-postprocessor-v1,
> MAHOUT-843-patch-only-postprocessor-v2, MAHOUT-843-patch-v1,
> Top-Down-Clustering-patch
>
>
> Top Down Clustering works in multiple steps. The first step is to find
> comparative bigger clusters. The second step is to cluster the bigger chunks
> into meaningful clusters. This can performance while clustering big amount of
> data. And, it also removes the dependency of providing input clusters/numbers
> to the clustering algorithm.
> The "big" is a relative term, as well as the smaller "meaningful" terms. So,
> the control of this "bigger" and "smaller/meaningful" clusters will be
> controlled by the user.
> Which clustering algorithm to be used in the top level and which to use in
> the bottom level can also be selected by the user. Initially, it can be done
> for only one/few clustering algorithms, and later, option can be provided to
> use all the algorithms ( which suits the case ).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira