shub-kris commented on code in PR #23497:
URL: https://github.com/apache/beam/pull/23497#discussion_r988664664


##########
website/www/site/content/en/documentation/ml/anomaly-detection.md:
##########
@@ -0,0 +1,230 @@
+---
+title: "Anomaly Detection"
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Anomaly Detection Example
+
+The AnomalyDetection example demonstrates how to setup an anomaly detection 
pipeline that reads text from PubSub in real-time, and then detects anomaly 
using a trained HDBSCAN clustering model.
+
+
+### Dataset for Anomaly Detection
+For the example, we use a dataset called 
[emotion](https://huggingface.co/datasets/emotion). It comprises of 20000 
English Twitter messages with 6 basic emotions: anger, fear, joy, love, 
sadness, and surprise. The dataset has three splits: train (for training), 
validation and test (for performance evaluation). It is a supervised dataset as 
it contains the text and the category(class) of the dataset. This dataset can 
easily be accessed using [HuggingFace 
Datasets](https://huggingface.co/docs/datasets/index).
+
+To have a better understanding of the dataset, here are some examples from the 
train split of the dataset:
+
+
+| Text        | Type of emotion |
+| :---        |    :----:   |
+| im grabbing a minute to post i feel greedy wrong      | Anger       |
+| i am ever feeling nostalgic about the fireplace i will know that it is still 
on the property   | Love        |
+| ive been taking or milligrams or times recommended amount and ive fallen 
asleep a lot faster but i also feel like so funny | Fear |
+| on a boat trip to denmark | Joy |
+| i feel you know basically like a fake in the realm of science fiction | 
Sadness |
+| i began having them several times a week feeling tortured by the 
hallucinations moving people and figures sounds and vibrations | Fear |
+
+### Anomaly Detection Algorithm
+[HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) is 
a clustering algorithm which extends DBSCAN by converting it into a 
hierarchical clustering algorithm, and then using a technique to extract a flat 
clustering based in the stability of clusters. Once trained, when predicting 
the cluster for a new data point if outlier will output -1 else will predict 
one of the existing clusters.
+
+
+## Ingestion to PubSub
+We first ingest the data into 
[PubSub](https://cloud.google.com/pubsub/docs/overview) so that while 
clustering we can read the tweets from PubSub. PubSub is a messaging service 
for exchanging event data among applications and services. It is used for 
streaming analytics and data integration pipelines to ingest and distribute 
data.
+
+The full example code for ingesting data to PubSub can be found 
[here](sdks/python/apache_beam/examples/inference/anomaly_detection/write_data_to_pubsub_pipeline/)
+
+The file structure for ingestion pipeline is:
+
+    write_data_to_pubsub_pipeline/
+    ├── pipeline/
+    │   ├── __init__.py
+    │   ├── options.py
+    │   └── utils.py
+    ├── __init__.py
+    ├── config.py
+    ├── main.py
+    └── setup.py
+
+`pipeline/utils.py` contains the code for loading the emotion dataset and two 
`beam.DoFn` that are used for data transformation
+`pipeline/options.py` contains the pipeline options to configure the Dataflow 
pipeline
+`config.py` defines some variables like GCP PROJECT_ID, NUM_WORKERS that are 
used multiple times
+`setup.py` defines the packages/requirements for the pipeline to run
+`main.py` contains the pipeline code and some additional function used for 
running the pipeline
+
+### How to Run the Pipeline ?
+First, make sure you have installed the required packages.
+
+1. Locally on your machine: `python main.py`
+2. On GCP for Dataflow: `python main.py --mode cloud`

Review Comment:
   The variables that are in `config.py` are something that one won't change 
frequently in my opinion. Those are the variables that are usually configured 
once at the start and then usually not updated. So, I am not sure if it really 
helps. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to