[jira] [Commented] (SPARK-8540) KMeans-based outlier detection

2016-01-14 Thread Rakesh Chalasani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098516#comment-15098516
 ] 

Rakesh Chalasani commented on SPARK-8540:
-

[~josephkb] is this JIRA still of interest? 

> KMeans-based outlier detection
> --
>
> Key: SPARK-8540
> URL: https://issues.apache.org/jira/browse/SPARK-8540
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Proposal for K-Means-based outlier detection:
> * Cluster data using K-Means
> * Provide prediction/filtering functionality which returns outliers/anomalies
> ** This can take some threshold parameter which specifies either (a) how far 
> off a point needs to be to be considered an outlier or (b) how many outliers 
> should be returned.
> Note this will require a bit of API design, which should probably be posted 
> and discussed on this JIRA before implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8540) KMeans-based outlier detection

2016-01-07 Thread Rakesh Chalasani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088746#comment-15088746
 ] 

Rakesh Chalasani commented on SPARK-8540:
-

I see that this hasn't moved forward, so trying to revive it. I will pick this 
up.

After taking a fleeting glance at the KMeans API, we have two options:

1. Add this to KMeans/KMeansModel itself (which I don't like after what 
[~josephkb] said above)
 
(or)

2. We need KMeansOutlier and KMeansOutlierModel as separate classes; 
KMeansOutlier can extend KMeans itself with additional parameters for 
supporting the above mentioned (a) and (b). KMeansOutlierModel might have to 
duplicate some parts of KMeansModel

For (a) setThreshold/getThreshold param need to be added and can be implemented 
using simple 'where'; (b) setNumOutliers/getNumOutliers param need to be added 
and requires orderBy followed by take (or something better?). (a) takes 
precedence over (b).

Please let me know your thoughts.


> KMeans-based outlier detection
> --
>
> Key: SPARK-8540
> URL: https://issues.apache.org/jira/browse/SPARK-8540
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Proposal for K-Means-based outlier detection:
> * Cluster data using K-Means
> * Provide prediction/filtering functionality which returns outliers/anomalies
> ** This can take some threshold parameter which specifies either (a) how far 
> off a point needs to be to be considered an outlier or (b) how many outliers 
> should be returned.
> Note this will require a bit of API design, which should probably be posted 
> and discussed on this JIRA before implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8540) KMeans-based outlier detection

2015-07-20 Thread Rakesh Chalasani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633601#comment-14633601
 ] 

Rakesh Chalasani commented on SPARK-8540:
-

I think clubbing an algorithm with a specific use case might not be a good 
idea, in this case KMeans with anomaly detection. Why not just return the 
distances to KMean centers and then the user can write a simple operations over 
that column to get the anomalies? If we return distances, finding anomalies 
will be just one more line of code and we can have an example showing that.  

 KMeans-based outlier detection
 --

 Key: SPARK-8540
 URL: https://issues.apache.org/jira/browse/SPARK-8540
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
   Original Estimate: 336h
  Remaining Estimate: 336h

 Proposal for K-Means-based outlier detection:
 * Cluster data using K-Means
 * Provide prediction/filtering functionality which returns outliers/anomalies
 ** This can take some threshold parameter which specifies either (a) how far 
 off a point needs to be to be considered an outlier or (b) how many outliers 
 should be returned.
 Note this will require a bit of API design, which should probably be posted 
 and discussed on this JIRA before implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8540) KMeans-based outlier detection

2015-07-20 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633883#comment-14633883
 ] 

Joseph K. Bradley commented on SPARK-8540:
--

On the one hand, I agree this could potentially be solved with a good code 
example.

On the other hand, it is another cognitive step for users looking to do outlier 
detection.  Also, I suspect we will eventually want complex algorithms 
specialized for outlier/anomaly detection.  If we only put complex outlier 
detection algorithms under the name outlier detection, then users may use 
those unnecessarily complex algorithms by default.  E.g., I suspect this 
happens a lot in sklearn, where the only one explicitly under outlier 
detection is 1-class SVM, which is surely overkill for many use cases.

 KMeans-based outlier detection
 --

 Key: SPARK-8540
 URL: https://issues.apache.org/jira/browse/SPARK-8540
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
   Original Estimate: 336h
  Remaining Estimate: 336h

 Proposal for K-Means-based outlier detection:
 * Cluster data using K-Means
 * Provide prediction/filtering functionality which returns outliers/anomalies
 ** This can take some threshold parameter which specifies either (a) how far 
 off a point needs to be to be considered an outlier or (b) how many outliers 
 should be returned.
 Note this will require a bit of API design, which should probably be posted 
 and discussed on this JIRA before implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8540) KMeans-based outlier detection

2015-07-06 Thread Venkata Vineel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615596#comment-14615596
 ] 

Venkata Vineel commented on SPARK-8540:
---

[~josephkb] Yes, I looked there, but there is no clarity on which issues can be 
worked upon.(Some are open and unassigned ,but still had people working on 
them, please consider helping me pick some thing up).

 KMeans-based outlier detection
 --

 Key: SPARK-8540
 URL: https://issues.apache.org/jira/browse/SPARK-8540
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
   Original Estimate: 336h
  Remaining Estimate: 336h

 Proposal for K-Means-based outlier detection:
 * Cluster data using K-Means
 * Provide prediction/filtering functionality which returns outliers/anomalies
 ** This can take some threshold parameter which specifies either (a) how far 
 off a point needs to be to be considered an outlier or (b) how many outliers 
 should be returned.
 Note this will require a bit of API design, which should probably be posted 
 and discussed on this JIRA before implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8540) KMeans-based outlier detection

2015-07-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615497#comment-14615497
 ] 

Joseph K. Bradley commented on SPARK-8540:
--

If this is your first Spark contribution, I'd recommend starting with a smaller 
patch, rather than a new feature.  Can you please look at 
[https://issues.apache.org/jira/browse/SPARK-8445] for details and 
instructions?  Thanks!

 KMeans-based outlier detection
 --

 Key: SPARK-8540
 URL: https://issues.apache.org/jira/browse/SPARK-8540
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
   Original Estimate: 336h
  Remaining Estimate: 336h

 Proposal for K-Means-based outlier detection:
 * Cluster data using K-Means
 * Provide prediction/filtering functionality which returns outliers/anomalies
 ** This can take some threshold parameter which specifies either (a) how far 
 off a point needs to be to be considered an outlier or (b) how many outliers 
 should be returned.
 Note this will require a bit of API design, which should probably be posted 
 and discussed on this JIRA before implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8540) KMeans-based outlier detection

2015-07-06 Thread Venkata Vineel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614630#comment-14614630
 ] 

Venkata Vineel commented on SPARK-8540:
---

[~josephkb] Can I please work on this(if you can mentor me with design etc.).

 KMeans-based outlier detection
 --

 Key: SPARK-8540
 URL: https://issues.apache.org/jira/browse/SPARK-8540
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
   Original Estimate: 336h
  Remaining Estimate: 336h

 Proposal for K-Means-based outlier detection:
 * Cluster data using K-Means
 * Provide prediction/filtering functionality which returns outliers/anomalies
 ** This can take some threshold parameter which specifies either (a) how far 
 off a point needs to be to be considered an outlier or (b) how many outliers 
 should be returned.
 Note this will require a bit of API design, which should probably be posted 
 and discussed on this JIRA before implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8540) KMeans-based outlier detection

2015-06-24 Thread Gurjot Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599188#comment-14599188
 ] 

Gurjot Singh commented on SPARK-8540:
-

Can you please elaborate, what does b) do? Will it simply return the specified 
number of outliers/datapoints which are at farthest distance from their cluster 
mean, even if they are not outlier in statistical terms? 

 KMeans-based outlier detection
 --

 Key: SPARK-8540
 URL: https://issues.apache.org/jira/browse/SPARK-8540
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
   Original Estimate: 336h
  Remaining Estimate: 336h

 Proposal for K-Means-based outlier detection:
 * Cluster data using K-Means
 * Provide prediction/filtering functionality which returns outliers/anomalies
 ** This can take some threshold parameter which specifies either (a) how far 
 off a point needs to be to be considered an outlier or (b) how many outliers 
 should be returned.
 Note this will require a bit of API design, which should probably be posted 
 and discussed on this JIRA before implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8540) KMeans-based outlier detection

2015-06-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600076#comment-14600076
 ] 

Joseph K. Bradley commented on SPARK-8540:
--

That's correct: For (b), the user would specify wanting the K most anomalous 
data points (or perhaps some fraction).

(a) seems more reasonable statistically, but (b) would let users collect the 
results without fear of blowing up the master node.

 KMeans-based outlier detection
 --

 Key: SPARK-8540
 URL: https://issues.apache.org/jira/browse/SPARK-8540
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
   Original Estimate: 336h
  Remaining Estimate: 336h

 Proposal for K-Means-based outlier detection:
 * Cluster data using K-Means
 * Provide prediction/filtering functionality which returns outliers/anomalies
 ** This can take some threshold parameter which specifies either (a) how far 
 off a point needs to be to be considered an outlier or (b) how many outliers 
 should be returned.
 Note this will require a bit of API design, which should probably be posted 
 and discussed on this JIRA before implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org