[ 
https://issues.apache.org/jira/browse/SPARK-22867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16300919#comment-16300919
 ] 

Fangzhou Yang commented on SPARK-22867:
---------------------------------------


We design and implement a distributed iForest on Spark, which is trained via 
model-wise parallelism, and predicts a new Dataset via data-wise parallelism. 
It is implemented in the following steps:
  1. Sampling data from a Dataset. Data instances are sampled and grouped for 
each iTree. 
  As indicated in the paper, the number samples for constructing each tree is 
usually not very large (default value 256). 
  Thus we can construct sampled paired RDD, where each row key is tree index 
and row value is a group of sampled data instances for a tree.
  2. Training and constructing each iTree on parallel via a map operation and 
collect all iTrees to construct a iForest model.
  3. Predict a new Dataset on parallel via a map operation with the collected 
iForest model.

More details about Spark IForest can be found in my github repository:
https://github.com/titicaca/spark-iforest

> Add Isolation Forest algorithm to MLlib
> ---------------------------------------
>
>                 Key: SPARK-22867
>                 URL: https://issues.apache.org/jira/browse/SPARK-22867
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>    Affects Versions: 2.2.1
>            Reporter: Fangzhou Yang
>
> Isolation Forest (iForest) is an effective model that focuses on anomaly 
> isolation. 
> iForest uses tree structure for modeling data, iTree isolates anomalies 
> closer to the root of the tree as compared to normal points. 
> A anomaly score is calculated by iForest model to measure the abnormality of 
> the data instances. The lower, the more abnormal.
> More details about iForest can be found in the following papers: 
> <a href="https://dl.acm.org/citation.cfm?id=1511387";>Isolation Forest</a> [1] 
> and <a href="https://dl.acm.org/citation.cfm?id=2133363";>Isolation-Based 
> Anomaly Detection</a> [2].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to