[ 
https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597166#comment-14597166
 ] 

Joseph K. Bradley edited comment on SPARK-4038 at 6/23/15 5:00 AM:
-------------------------------------------------------------------

K-Means seemed like the easiest choice for implementation + general usefulness.

For AVF and LOF, it'd be good to get feedback about use cases since I'm not 
that familiar with those.  (Are they among the most commonly used methods?  In 
what applications?)
* I noticed someone wrote AVF for Spark, though I have not looked at the code 
yet: [https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark]

KNN sounds expensive in a distributed setting.  That should probably come later.

For my records, linking some papers here:
* [AVF | 
http://enriquegortiz.com/wordpress/enriquegortiz/research/undergraduate/outlier-detection/]
* [LOF: Identifying Density-Based Local Outliers | 
http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf]
* [about distributed outlier detection | 
http://etd.fcla.edu/CF/CFE0002734/Koufakou_Anna_200908_PhD.pdf]

(If others have references, please link them too!)


was (Author: josephkb):
K-Means seemed like the easiest choice for implementation + general usefulness.

For AVF and LOF, it'd be good to get feedback about use cases since I'm not 
that familiar with those.  (Are they among the most commonly used methods?  In 
what applications?)
* I noticed someone wrote AVF for Spark, though I have not looked at the code 
yet: [https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark]

KNN sounds expensive in a distributed setting.  That should probably come later.


> Outlier Detection Algorithm for MLlib
> -------------------------------------
>
>                 Key: SPARK-4038
>                 URL: https://issues.apache.org/jira/browse/SPARK-4038
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Ashutosh Trivedi
>            Priority: Minor
>
> The aim of this JIRA is to discuss about which parallel outlier detection 
> algorithms can be included in MLlib. 
> The one which I am familiar with is Attribute Value Frequency (AVF). It 
> scales linearly with the number of data points and attributes, and relies on 
> a single data scan. It is not distance based and well suited for categorical 
> data. In original paper  a parallel version is also given, which is not 
> complected to implement.  I am working on the implementation and soon submit 
> the initial code for review.
> Here is the Link for the paper
> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382
> As pointed out by Xiangrui in discussion 
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html
> There are other algorithms also. Lets discuss about which will be more 
> general and easily paralleled.
>    



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to