[ https://issues.apache.org/jira/browse/SPARK-22867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16300919#comment-16300919 ]
Fangzhou Yang commented on SPARK-22867: --------------------------------------- We design and implement a distributed iForest on Spark, which is trained via model-wise parallelism, and predicts a new Dataset via data-wise parallelism. It is implemented in the following steps: 1. Sampling data from a Dataset. Data instances are sampled and grouped for each iTree. As indicated in the paper, the number samples for constructing each tree is usually not very large (default value 256). Thus we can construct sampled paired RDD, where each row key is tree index and row value is a group of sampled data instances for a tree. 2. Training and constructing each iTree on parallel via a map operation and collect all iTrees to construct a iForest model. 3. Predict a new Dataset on parallel via a map operation with the collected iForest model. More details about Spark IForest can be found in my github repository: https://github.com/titicaca/spark-iforest > Add Isolation Forest algorithm to MLlib > --------------------------------------- > > Key: SPARK-22867 > URL: https://issues.apache.org/jira/browse/SPARK-22867 > Project: Spark > Issue Type: New Feature > Components: MLlib > Affects Versions: 2.2.1 > Reporter: Fangzhou Yang > > Isolation Forest (iForest) is an effective model that focuses on anomaly > isolation. > iForest uses tree structure for modeling data, iTree isolates anomalies > closer to the root of the tree as compared to normal points. > A anomaly score is calculated by iForest model to measure the abnormality of > the data instances. The lower, the more abnormal. > More details about iForest can be found in the following papers: > <a href="https://dl.acm.org/citation.cfm?id=1511387">Isolation Forest</a> [1] > and <a href="https://dl.acm.org/citation.cfm?id=2133363">Isolation-Based > Anomaly Detection</a> [2]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org