[ 
https://issues.apache.org/jira/browse/MAHOUT-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paritosh Ranjan updated MAHOUT-825:
-----------------------------------

    Attachment: canopy-outlier-elimination

I have added the patch canopy-outlier-elimination. This patch uses the flag 
clusterStrictness, both for outlier elimination while calculating centroids as 
wells as while outlier elimination from the canopy. The previous boolean 
variable has been changed to double, as we needed a parameter to control the 
quality of the Cluster. Outlier elimination ( both centroid calculation and 
elimination from cluster), are switched off by default.

Both steps i.e. using radius instead of t1, and eliminitating outlier while 
calculating centroid have increased the quality of the result. Many points, 
which were not being clustered earlier ( while using t1 ), are being clustered 
now. The quality is also controllable by tuning the value of clusterStrictness.

Using a percentile to reject 10-30% of outliers looks like a good option, but 
it is a post processing step which will impact the performance. I think, the 
same functionality is achievable using a higher clusterStrictness value.
                
> Canopies grouping records outside t1
> ------------------------------------
>
>                 Key: MAHOUT-825
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-825
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>         Environment: windows, linux
>            Reporter: Paritosh Ranjan
>              Labels: features, newbie, patch
>             Fix For: 0.6
>
>         Attachments: Clustering Remote Points - Two Big, Useless 
> Clusters.txt, Not Clustering Remote Points - Two Meaningful Clusters.txt, 
> canopy-clusterFilter-t1, canopy-outlier-elimination, 
> canopy-outside-t1-points-patch-1, canopy-strict-clustering-flag
>
>
> While finding closest canopy, there is no check to ensure that it returns 
> canopies which are within distance t1 from the point. This results in 
> incorrect result i.e. Points outside t1 are grouped in canopies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to