[
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122674#comment-13122674
]
Paritosh Ranjan edited comment on MAHOUT-825 at 10/7/11 10:03 AM:
------------------------------------------------------------------
I have added the patch canopy-outlier-elimination. This patch uses the flag
clusterStrictness, both for outlier elimination while calculating centroids as
well as while outlier elimination from the canopy clustering. The previous
boolean variable (clusterStrictly ) has been changed to double, as we needed a
parameter to control the quality of the Cluster. Outlier elimination ( both
centroid calculation and elimination from cluster), are switched off by default.
Both steps i.e. using radius instead of t1, and eliminitating outlier while
calculating centroid have increased the quality of the result. Many points,
which were not being clustered earlier ( while using t1, and even when using
outlier point in centroid calculation ), are being clustered now. The quality
is also controllable by tuning the value of clusterStrictness.
Using a percentile to reject 10-30% of outliers looks like a good option, but
it is a post processing step which will impact the performance. I think, the
same functionality is achievable using a higher clusterStrictness value.
was (Author: paritoshranjan):
I have added the patch canopy-outlier-elimination. This patch uses the flag
clusterStrictness, both for outlier elimination while calculating centroids as
wells as while outlier elimination from the canopy. The previous boolean
variable has been changed to double, as we needed a parameter to control the
quality of the Cluster. Outlier elimination ( both centroid calculation and
elimination from cluster), are switched off by default.
Both steps i.e. using radius instead of t1, and eliminitating outlier while
calculating centroid have increased the quality of the result. Many points,
which were not being clustered earlier ( while using t1 ), are being clustered
now. The quality is also controllable by tuning the value of clusterStrictness.
Using a percentile to reject 10-30% of outliers looks like a good option, but
it is a post processing step which will impact the performance. I think, the
same functionality is achievable using a higher clusterStrictness value.
> Canopies grouping records outside t1
> ------------------------------------
>
> Key: MAHOUT-825
> URL: https://issues.apache.org/jira/browse/MAHOUT-825
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6
> Environment: windows, linux
> Reporter: Paritosh Ranjan
> Labels: features, newbie, patch
> Fix For: 0.6
>
> Attachments: Clustering Remote Points - Two Big, Useless
> Clusters.txt, Not Clustering Remote Points - Two Meaningful Clusters.txt,
> canopy-clusterFilter-t1, canopy-outlier-elimination,
> canopy-outside-t1-points-patch-1, canopy-strict-clustering-flag
>
>
> While finding closest canopy, there is no check to ensure that it returns
> canopies which are within distance t1 from the point. This results in
> incorrect result i.e. Points outside t1 are grouped in canopies.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira