[
https://issues.apache.org/jira/browse/MAHOUT-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147884#comment-13147884
]
Paritosh Ranjan edited comment on MAHOUT-843 at 11/10/11 6:25 PM:
------------------------------------------------------------------
I have added the Junit test as suggested by you. The output processor is
running properly. Which is also evident from the Junit Test.
Regarding the bottom level clustering through Canopy Clustering. Through Junit
test, I have found that, CanopyDriver is reopening SequenceFile.Writer on the
clustered files. Since SequenceFile.Writer does not support appending data
after reopening Writer, so, the data is being overridden over there.
This overwriting issue is present only in the sequential version of
clusterData. clusterDataMR method overwrites it. I used the Java API on hadoop
cluster and it worked fine.
was (Author: paritoshranjan):
I have added the Junit test as suggested by you. The output processor is
running properly. Which is also evident from the Junit Test.
Regarding the bottom level clustering through Canopy Clustering. Through Junit
test, I have found that, CanopyDriver is reopening SequenceFile.Writer on the
clustered files. Since SequenceFile.Writer does not support appending data, so,
the data is being overridden over there.
This overwriting issue is present only in the sequential version of
clusterData. clusterDataMR method overwrites it. I used the Java API on hadoop
cluster and it worked fine.
> Top Down Clustering
> -------------------
>
> Key: MAHOUT-843
> URL: https://issues.apache.org/jira/browse/MAHOUT-843
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.6
> Reporter: Paritosh Ranjan
> Labels: clustering, patch
> Fix For: 0.6
>
> Attachments: MAHOUT-843-patch, MAHOUT-843-patch-only-postprocessor,
> MAHOUT-843-patch-v1, Top-Down-Clustering-patch
>
>
> Top Down Clustering works in multiple steps. The first step is to find
> comparative bigger clusters. The second step is to cluster the bigger chunks
> into meaningful clusters. This can performance while clustering big amount of
> data. And, it also removes the dependency of providing input clusters/numbers
> to the clustering algorithm.
> The "big" is a relative term, as well as the smaller "meaningful" terms. So,
> the control of this "bigger" and "smaller/meaningful" clusters will be
> controlled by the user.
> Which clustering algorithm to be used in the top level and which to use in
> the bottom level can also be selected by the user. Initially, it can be done
> for only one/few clustering algorithms, and later, option can be provided to
> use all the algorithms ( which suits the case ).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira