shub-kris commented on code in PR #23497: URL: https://github.com/apache/beam/pull/23497#discussion_r991009346
########## website/www/site/content/en/documentation/ml/anomaly-detection.md: ########## @@ -0,0 +1,230 @@ +--- +title: "Anomaly Detection" +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# Anomaly Detection Example + +The AnomalyDetection example demonstrates how to setup an anomaly detection pipeline that reads text from PubSub in real-time, and then detects anomaly using a trained HDBSCAN clustering model. + + +### Dataset for Anomaly Detection +For the example, we use a dataset called [emotion](https://huggingface.co/datasets/emotion). It comprises of 20000 English Twitter messages with 6 basic emotions: anger, fear, joy, love, sadness, and surprise. The dataset has three splits: train (for training), validation and test (for performance evaluation). It is a supervised dataset as it contains the text and the category(class) of the dataset. This dataset can easily be accessed using [HuggingFace Datasets](https://huggingface.co/docs/datasets/index). + +To have a better understanding of the dataset, here are some examples from the train split of the dataset: + + +| Text | Type of emotion | +| :--- | :----: | +| im grabbing a minute to post i feel greedy wrong | Anger | +| i am ever feeling nostalgic about the fireplace i will know that it is still on the property | Love | +| ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny | Fear | +| on a boat trip to denmark | Joy | +| i feel you know basically like a fake in the realm of science fiction | Sadness | +| i began having them several times a week feeling tortured by the hallucinations moving people and figures sounds and vibrations | Fear | + +### Anomaly Detection Algorithm +[HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) is a clustering algorithm which extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters. Once trained, when predicting the cluster for a new data point if outlier will output -1 else will predict one of the existing clusters. + + +## Ingestion to PubSub +We first ingest the data into [PubSub](https://cloud.google.com/pubsub/docs/overview) so that while clustering we can read the tweets from PubSub. PubSub is a messaging service for exchanging event data among applications and services. It is used for streaming analytics and data integration pipelines to ingest and distribute data. + +The full example code for ingesting data to PubSub can be found [here](sdks/python/apache_beam/examples/inference/anomaly_detection/write_data_to_pubsub_pipeline/) + +The file structure for ingestion pipeline is: + + write_data_to_pubsub_pipeline/ + ├── pipeline/ + │ ├── __init__.py + │ ├── options.py + │ └── utils.py + ├── __init__.py + ├── config.py + ├── main.py + └── setup.py + +`pipeline/utils.py` contains the code for loading the emotion dataset and two `beam.DoFn` that are used for data transformation +`pipeline/options.py` contains the pipeline options to configure the Dataflow pipeline +`config.py` defines some variables like GCP PROJECT_ID, NUM_WORKERS that are used multiple times +`setup.py` defines the packages/requirements for the pipeline to run +`main.py` contains the pipeline code and some additional function used for running the pipeline + +### How to Run the Pipeline ? +First, make sure you have installed the required packages. + +1. Locally on your machine: `python main.py` +2. On GCP for Dataflow: `python main.py --mode cloud` Review Comment: Okay adding that in documentation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
