[jira] [Commented] (SPARK-8540) KMeans-based outlier detection
[ https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098516#comment-15098516 ] Rakesh Chalasani commented on SPARK-8540: - [~josephkb] is this JIRA still of interest? > KMeans-based outlier detection > -- > > Key: SPARK-8540 > URL: https://issues.apache.org/jira/browse/SPARK-8540 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > Original Estimate: 336h > Remaining Estimate: 336h > > Proposal for K-Means-based outlier detection: > * Cluster data using K-Means > * Provide prediction/filtering functionality which returns outliers/anomalies > ** This can take some threshold parameter which specifies either (a) how far > off a point needs to be to be considered an outlier or (b) how many outliers > should be returned. > Note this will require a bit of API design, which should probably be posted > and discussed on this JIRA before implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8540) KMeans-based outlier detection
[ https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088746#comment-15088746 ] Rakesh Chalasani commented on SPARK-8540: - I see that this hasn't moved forward, so trying to revive it. I will pick this up. After taking a fleeting glance at the KMeans API, we have two options: 1. Add this to KMeans/KMeansModel itself (which I don't like after what [~josephkb] said above) (or) 2. We need KMeansOutlier and KMeansOutlierModel as separate classes; KMeansOutlier can extend KMeans itself with additional parameters for supporting the above mentioned (a) and (b). KMeansOutlierModel might have to duplicate some parts of KMeansModel For (a) setThreshold/getThreshold param need to be added and can be implemented using simple 'where'; (b) setNumOutliers/getNumOutliers param need to be added and requires orderBy followed by take (or something better?). (a) takes precedence over (b). Please let me know your thoughts. > KMeans-based outlier detection > -- > > Key: SPARK-8540 > URL: https://issues.apache.org/jira/browse/SPARK-8540 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > Original Estimate: 336h > Remaining Estimate: 336h > > Proposal for K-Means-based outlier detection: > * Cluster data using K-Means > * Provide prediction/filtering functionality which returns outliers/anomalies > ** This can take some threshold parameter which specifies either (a) how far > off a point needs to be to be considered an outlier or (b) how many outliers > should be returned. > Note this will require a bit of API design, which should probably be posted > and discussed on this JIRA before implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8540) KMeans-based outlier detection
[ https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633601#comment-14633601 ] Rakesh Chalasani commented on SPARK-8540: - I think clubbing an algorithm with a specific use case might not be a good idea, in this case KMeans with anomaly detection. Why not just return the distances to KMean centers and then the user can write a simple operations over that column to get the anomalies? If we return distances, finding anomalies will be just one more line of code and we can have an example showing that. KMeans-based outlier detection -- Key: SPARK-8540 URL: https://issues.apache.org/jira/browse/SPARK-8540 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Original Estimate: 336h Remaining Estimate: 336h Proposal for K-Means-based outlier detection: * Cluster data using K-Means * Provide prediction/filtering functionality which returns outliers/anomalies ** This can take some threshold parameter which specifies either (a) how far off a point needs to be to be considered an outlier or (b) how many outliers should be returned. Note this will require a bit of API design, which should probably be posted and discussed on this JIRA before implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8540) KMeans-based outlier detection
[ https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633883#comment-14633883 ] Joseph K. Bradley commented on SPARK-8540: -- On the one hand, I agree this could potentially be solved with a good code example. On the other hand, it is another cognitive step for users looking to do outlier detection. Also, I suspect we will eventually want complex algorithms specialized for outlier/anomaly detection. If we only put complex outlier detection algorithms under the name outlier detection, then users may use those unnecessarily complex algorithms by default. E.g., I suspect this happens a lot in sklearn, where the only one explicitly under outlier detection is 1-class SVM, which is surely overkill for many use cases. KMeans-based outlier detection -- Key: SPARK-8540 URL: https://issues.apache.org/jira/browse/SPARK-8540 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Original Estimate: 336h Remaining Estimate: 336h Proposal for K-Means-based outlier detection: * Cluster data using K-Means * Provide prediction/filtering functionality which returns outliers/anomalies ** This can take some threshold parameter which specifies either (a) how far off a point needs to be to be considered an outlier or (b) how many outliers should be returned. Note this will require a bit of API design, which should probably be posted and discussed on this JIRA before implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8540) KMeans-based outlier detection
[ https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615596#comment-14615596 ] Venkata Vineel commented on SPARK-8540: --- [~josephkb] Yes, I looked there, but there is no clarity on which issues can be worked upon.(Some are open and unassigned ,but still had people working on them, please consider helping me pick some thing up). KMeans-based outlier detection -- Key: SPARK-8540 URL: https://issues.apache.org/jira/browse/SPARK-8540 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Original Estimate: 336h Remaining Estimate: 336h Proposal for K-Means-based outlier detection: * Cluster data using K-Means * Provide prediction/filtering functionality which returns outliers/anomalies ** This can take some threshold parameter which specifies either (a) how far off a point needs to be to be considered an outlier or (b) how many outliers should be returned. Note this will require a bit of API design, which should probably be posted and discussed on this JIRA before implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8540) KMeans-based outlier detection
[ https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615497#comment-14615497 ] Joseph K. Bradley commented on SPARK-8540: -- If this is your first Spark contribution, I'd recommend starting with a smaller patch, rather than a new feature. Can you please look at [https://issues.apache.org/jira/browse/SPARK-8445] for details and instructions? Thanks! KMeans-based outlier detection -- Key: SPARK-8540 URL: https://issues.apache.org/jira/browse/SPARK-8540 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Original Estimate: 336h Remaining Estimate: 336h Proposal for K-Means-based outlier detection: * Cluster data using K-Means * Provide prediction/filtering functionality which returns outliers/anomalies ** This can take some threshold parameter which specifies either (a) how far off a point needs to be to be considered an outlier or (b) how many outliers should be returned. Note this will require a bit of API design, which should probably be posted and discussed on this JIRA before implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8540) KMeans-based outlier detection
[ https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614630#comment-14614630 ] Venkata Vineel commented on SPARK-8540: --- [~josephkb] Can I please work on this(if you can mentor me with design etc.). KMeans-based outlier detection -- Key: SPARK-8540 URL: https://issues.apache.org/jira/browse/SPARK-8540 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Original Estimate: 336h Remaining Estimate: 336h Proposal for K-Means-based outlier detection: * Cluster data using K-Means * Provide prediction/filtering functionality which returns outliers/anomalies ** This can take some threshold parameter which specifies either (a) how far off a point needs to be to be considered an outlier or (b) how many outliers should be returned. Note this will require a bit of API design, which should probably be posted and discussed on this JIRA before implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8540) KMeans-based outlier detection
[ https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599188#comment-14599188 ] Gurjot Singh commented on SPARK-8540: - Can you please elaborate, what does b) do? Will it simply return the specified number of outliers/datapoints which are at farthest distance from their cluster mean, even if they are not outlier in statistical terms? KMeans-based outlier detection -- Key: SPARK-8540 URL: https://issues.apache.org/jira/browse/SPARK-8540 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Original Estimate: 336h Remaining Estimate: 336h Proposal for K-Means-based outlier detection: * Cluster data using K-Means * Provide prediction/filtering functionality which returns outliers/anomalies ** This can take some threshold parameter which specifies either (a) how far off a point needs to be to be considered an outlier or (b) how many outliers should be returned. Note this will require a bit of API design, which should probably be posted and discussed on this JIRA before implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8540) KMeans-based outlier detection
[ https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600076#comment-14600076 ] Joseph K. Bradley commented on SPARK-8540: -- That's correct: For (b), the user would specify wanting the K most anomalous data points (or perhaps some fraction). (a) seems more reasonable statistically, but (b) would let users collect the results without fear of blowing up the master node. KMeans-based outlier detection -- Key: SPARK-8540 URL: https://issues.apache.org/jira/browse/SPARK-8540 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Original Estimate: 336h Remaining Estimate: 336h Proposal for K-Means-based outlier detection: * Cluster data using K-Means * Provide prediction/filtering functionality which returns outliers/anomalies ** This can take some threshold parameter which specifies either (a) how far off a point needs to be to be considered an outlier or (b) how many outliers should be returned. Note this will require a bit of API design, which should probably be posted and discussed on this JIRA before implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org