damccorm commented on code in PR #30421:
URL: https://github.com/apache/beam/pull/30421#discussion_r1511720034
##########
learning/prompts/documentation-lookup-nolinks/34_ml_data_preprocessing.md:
##########
@@ -0,0 +1,87 @@
+Prompt:
+How can I preprocess data for ML models in Apache Beam?
+
+Response:
+Apache Beam provides a robust framework for creating data processing pipelines
for machine learning (ML) applications, offering various capabilities for
preprocessing and analyzing data. Alongside powerful transforms, Apache Beam
provides a rich set of I/O connectors, facilitating seamless integration with
existing file systems, databases, or messaging queues.
+
+In AI/ML projects, the following stages of data processing are essential:
+* **Data exploration**: analyzing and understanding the characteristics,
patterns, and distributions within a dataset to gain insight and understand the
relationship between different variables.
+* **Data preprocessing**: cleaning, transforming, and preparing raw data to
make it suitable for machine learning algorithms.
+* **Data postprocessing**: applying additional transformations to the output
of a machine learning model after inference for interpretation and readability.
+* **Data validation**: assessing the quality, consistency, and correctness of
the data to ensure that the data meets certain standards or criteria and is
suitable for the intended analysis or application.
+
+You can implement all these data processing stages in Apache Beam pipelines.
+
+A typical data preprocessing pipeline involves several steps:
+
+* **Reading and writing data**: reading from and writing to various data
sources and sinks.
+* **Data cleaning**: filtering and cleaning data, removing duplicates,
correcting errors, handling missing values, or filtering outliers.
+* **Data transformations**: scaling, encoding, or vectorizing data to prepare
it for model input.
+* **Data enrichment**: incorporating external data sources to enhance the
dataset's richness and context.
+* **Data validation and metrics**: validating data quality and calculating
metrics such as class distributions.
+
+The following example demonstrates an Apache Beam pipeline that implements all
these steps:
+
+```python
+import apache_beam as beam
+from apache_beam.metrics import Metrics
+
+with beam.Pipeline() as pipeline:
+ # Create data
+ input_data = (
+ pipeline
+ | beam.Create([
+ {'age': 25, 'height': 176, 'weight': 60, 'city': 'London'},
+ {'age': 61, 'height': 192, 'weight': 95, 'city': 'Brussels'},
+ {'age': 48, 'height': 163, 'weight': None, 'city': 'Berlin'}]))
+
+ # Clean data
+ def filter_missing_data(row):
+ return row['weight'] is not None
+
+ cleaned_data = input_data | beam.Filter(filter_missing_data)
+
+ # Transform data
+ def scale_min_max_data(row):
+ row['age'] = (row['age']/100)
+ row['height'] = (row['height']-150)/50
+ row['weight'] = (row['weight']-50)/50
+ yield row
+
+ transformed_data = cleaned_data | beam.FlatMap(scale_min_max_data)
+
+ # Enrich data
+ side_input = pipeline | beam.io.ReadFromText('coordinates.csv')
+ def coordinates_lookup(row, coordinates):
+ row['coordinates'] = coordinates.get(row['city'], (0, 0))
+ del row['city']
+ yield row
+
+ enriched_data = (
+ transformed_data
+ | beam.FlatMap(coordinates_lookup,
coordinates=beam.pvalue.AsDict(side_input)))
+
+ # Metrics
+ counter = Metrics.counter('main', 'counter')
+
+ def count_data(row):
+ counter.inc()
+ yield row
+
+ output_data = enriched_data | beam.FlatMap(count_data)
+
+ # Write data
+ output_data | beam.io.WriteToText('output.csv')
+```
+
+In this example, the Apache Beam pipeline performs the following steps:
+* Creates data.
+* Cleans data by filtering missing values.
+* Transforms data by scaling it.
+* Enriches data by adding coordinates from an external source.
+* Collects metrics and counts data instances.
+* Writes the processed data to an output file.
+
+In addition to standard data processing transforms, Apache Beam also provides
a set of specialized transforms for preprocessing and transforming data,
consolidating them into the `MLTransform` class. This class simplifies your
workflow and ensures data consistency by enabling the use of the same steps for
training and inference. You can use `MLTransform` to generate text embeddings
and implement specialized processing modules provided by the TensorFlow
Transforms (TFT) library for machine learning tasks.
Review Comment:
```suggestion
In addition to standard data processing transforms, Apache Beam also
provides a set of specialized transforms for preprocessing and transforming
data, consolidating them into the `MLTransform` class. This class simplifies
your workflow and ensures data consistency by enabling the use of the same
steps for training and inference. You can use `MLTransform` to generate text
embeddings and implement specialized processing modules provided by the
TensorFlow Transforms (TFT) library for machine learning tasks like computing
and applying a vocabulary, scaling your data using z-scores, bucketizing your
data, and more.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]