[ https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088746#comment-15088746 ]
Rakesh Chalasani commented on SPARK-8540: ----------------------------------------- I see that this hasn't moved forward, so trying to revive it. I will pick this up. After taking a fleeting glance at the KMeans API, we have two options: 1. Add this to KMeans/KMeansModel itself (which I don't like after what [~josephkb] said above) (or) 2. We need KMeansOutlier and KMeansOutlierModel as separate classes; KMeansOutlier can extend KMeans itself with additional parameters for supporting the above mentioned (a) and (b). KMeansOutlierModel might have to duplicate some parts of KMeansModel For (a) setThreshold/getThreshold param need to be added and can be implemented using simple 'where'; (b) setNumOutliers/getNumOutliers param need to be added and requires orderBy followed by take (or something better?). (a) takes precedence over (b). Please let me know your thoughts. > KMeans-based outlier detection > ------------------------------ > > Key: SPARK-8540 > URL: https://issues.apache.org/jira/browse/SPARK-8540 > Project: Spark > Issue Type: Sub-task > Components: ML > Reporter: Joseph K. Bradley > Original Estimate: 336h > Remaining Estimate: 336h > > Proposal for K-Means-based outlier detection: > * Cluster data using K-Means > * Provide prediction/filtering functionality which returns outliers/anomalies > ** This can take some threshold parameter which specifies either (a) how far > off a point needs to be to be considered an outlier or (b) how many outliers > should be returned. > Note this will require a bit of API design, which should probably be posted > and discussed on this JIRA before implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org