damccorm commented on code in PR #24437:
URL: https://github.com/apache/beam/pull/24437#discussion_r1036443160
##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -38,29 +38,23 @@
"\n",
"For rapid execution, Pandas loads all of the data into memory on a
single machine (one node). This configuration works well when dealing with
small-scale datasets. However, many projects involve datasets that are too big
to fit in memory. These use cases generally require parallel data processing
frameworks, such as Apache Beam.\n",
"\n",
- "\n",
- "## Apache Beam DataFrames\n",
- "\n",
- "\n",
- "Beam DataFrames provide a pandas-like\n",
+ "Beam DataFrames provide a Pandas-like\n",
"API to declare and define Beam processing pipelines. It provides a
familiar interface for machine learning practioners to build complex
data-processing pipelines by only invoking standard pandas commands.\n",
"\n",
"To learn more about Apache Beam DataFrames, see the\n",
"[Beam DataFrames
overview](https://beam.apache.org/documentation/dsls/dataframes/overview)
page.\n",
"\n",
- "## Goal\n",
- "The goal of this notebook is to explore a dataset preprocessed with
the Beam DataFrame API for machine learning model training.\n",
+ "## Overview\n",
+ "The goal of this example is to explore a dataset preprocessed with
the Beam DataFrame API for machine learning model training.\n",
"\n",
- "\n",
- "## Tutorial outline\n",
- "\n",
- "This notebook demonstrates the use of the Apache Beam DataFrames API
to perform common data exploration as well as the preprocessing steps that are
necessary to prepare your dataset for machine learning model training and
inference. These steps include the following: \n",
+ "This example demonstrates the use of the Apache Beam DataFrames API
to perform common data exploration as well as the preprocessing steps that are
necessary to prepare your dataset for machine learning model training and
inference. This example includes the following steps: \n",
"\n",
"* Removing unwanted columns.\n",
"* One-hot encoding categorical columns.\n",
"* Normalizing numerical columns.\n",
"\n",
- "\n"
+ "In this example, the first section demonstrates how to build and
execute a pipeline locally using the interactive runner.\n",
+ "The second section uses a distributed runner to demonstrate how to
run the pipeline on the full dataset.\n",
Review Comment:
```suggestion
"The second section uses a distributed runner to demonstrate how to
run the pipeline on the full dataset.\n"
```
Looks like this notebook shows up as invalid -
https://github.com/apache/beam/blob/d29825911841635ba03c7d2d80943d0326d149de/examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb
- this may be the problem?
##########
examples/notebooks/beam-ml/run_inference_pytorch.ipynb:
##########
@@ -84,6 +84,13 @@
"metadata": {
"id": "loxD-rOVchRn"
},
+ "outputs": [],
+ "source": [
+ "!pip install apache_beam[gcp,dataframe] --quiet"
+ ],
+ "metadata": {
+ "id": "loxD-rOVchRn"
+ },
Review Comment:
Was this change intentional? It looks like it no-ops
##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -38,29 +38,23 @@
"\n",
"For rapid execution, Pandas loads all of the data into memory on a
single machine (one node). This configuration works well when dealing with
small-scale datasets. However, many projects involve datasets that are too big
to fit in memory. These use cases generally require parallel data processing
frameworks, such as Apache Beam.\n",
"\n",
- "\n",
- "## Apache Beam DataFrames\n",
- "\n",
- "\n",
- "Beam DataFrames provide a pandas-like\n",
+ "Beam DataFrames provide a Pandas-like\n",
"API to declare and define Beam processing pipelines. It provides a
familiar interface for machine learning practioners to build complex
data-processing pipelines by only invoking standard pandas commands.\n",
"\n",
"To learn more about Apache Beam DataFrames, see the\n",
"[Beam DataFrames
overview](https://beam.apache.org/documentation/dsls/dataframes/overview)
page.\n",
"\n",
- "## Goal\n",
- "The goal of this notebook is to explore a dataset preprocessed with
the Beam DataFrame API for machine learning model training.\n",
+ "## Overview\n",
+ "The goal of this example is to explore a dataset preprocessed with
the Beam DataFrame API for machine learning model training.\n",
"\n",
- "\n",
- "## Tutorial outline\n",
- "\n",
- "This notebook demonstrates the use of the Apache Beam DataFrames API
to perform common data exploration as well as the preprocessing steps that are
necessary to prepare your dataset for machine learning model training and
inference. These steps include the following: \n",
+ "This example demonstrates the use of the Apache Beam DataFrames API
to perform common data exploration as well as the preprocessing steps that are
necessary to prepare your dataset for machine learning model training and
inference. This example includes the following steps: \n",
"\n",
"* Removing unwanted columns.\n",
"* One-hot encoding categorical columns.\n",
"* Normalizing numerical columns.\n",
"\n",
- "\n"
+ "In this example, the first section demonstrates how to build and
execute a pipeline locally using the interactive runner.\n",
+ "The second section uses a distributed runner to demonstrate how to
run the pipeline on the full dataset.\n",
Review Comment:
Until we have tests, we should probably hand validate these notebooks before
merging every time
##########
examples/notebooks/beam-ml/run_inference_multi_model.ipynb:
##########
@@ -520,7 +520,6 @@
"\n",
" def process(self, element):\n",
" image_url, image = element \n",
- " # Update this step when this ticket is resolved:
https://github.com/apache/beam/issues/21863\n",
Review Comment:
If not, we should at least comment on the issue itself with a note to update
the notebook when it is fixed
##########
examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb:
##########
@@ -41,25 +36,23 @@
"# KIND, either express or implied. See the License for the\n",
"# specific language governing permissions and limitations\n",
"# under the License"
- ]
- },
- {
- "cell_type": "markdown",
+ ],
"metadata": {
+ "cellView": "form",
"id": "faayYQYrQzY3"
- },
- "source": [
- "## Use RunInference in Apache Beam"
- ]
+ },
+ "execution_count": null,
+ "outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "JjAt1GesQ9sg"
},
"source": [
- "Starting with Apache Beam 2.40.0, you can use Apache Beam with the
RunInference API to use machine learning (ML) models for local and remote
inference with batch and streaming pipelines.\n",
- "The RunInference API leverages Apache Beam concepts, such as the
BatchElements transform and the Shared class, to support models in your
pipelines that create transforms optimized for machine learning inferences.\n",
+ "# Use RunInference in Apache Beam\n",
+ "You can use Apache Beam versions 2.40.0 and later with the
[RunInference
API](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference)
to use machine learning (ML) models for local and remote inference with batch
and streaming pipelines.\n",
Review Comment:
"You can use Apache Beam versions 2.40.0 and later with the RunInference API
to use machine learning (ML) models for local and remote inference with batch
and streaming pipelines." - this reads a little awkwardly to me, specifically
"you can use X to use Y"
Maybe "You can use Apache Beam versions 2.40.0 and later with the
RunInference API for local and remote inference with batch and streaming
pipelines" instead? The "to use machine learning (ML) models" bit is probably
implied
##########
examples/notebooks/beam-ml/custom_remote_inference.ipynb:
##########
@@ -234,20 +238,20 @@
"id": "HLy7VKJhLrmT"
},
"source": [
- "### Custom DoFn\n",
+ "### Create a custom DoFn\n",
"\n",
"In order to implement remote inference, create a DoFn class. This
class sends a batch of images to the Cloud vision API.\n",
"\n",
"The custom DoFn makes it possible to initialize the API. In case of a
custom model, a model can also be loaded in the `setup` function. \n",
"\n",
- "The `process` function is the most interesting part. In this function
we implement the model call and return its results.\n",
+ "The `process` function is the most interesting part. In this
function, we implement the model call and return its results.\n",
"\n",
- "**Caution:** When running remote inference, prepare to encounter,
identify, and handle failure as gracefully as possible. We recommend using the
following techniques: \n",
+ "When running remote inference, prepare to encounter, identify, and
handle failure as gracefully as possible. We recommend using the following
techniques: \n",
"\n",
"* **Exponential backoff:** Retry failed remote calls with
exponentially growing pauses between retries. Using exponential backoff ensures
that failures don't lead to an overwhelming number of retries in quick
succession. \n",
"\n",
- "* **Dead letter queues:** Route failed inferences to a separate
`PCollection` without failing the whole transform. You can continue execution
without failing the job (batch jobs' default behavior) or retrying indefinitely
(streaming jobs' default behavior).\n",
- "You can then run custom pipeline logic on the deadletter queue to log
the failure, alert, and push the failed message to temporary storage so that it
can eventually be reprocessed. "
+ "* **Dead-letter queues:** Route failed inferences to a separate
`PCollection` without failing the whole transform. You can continue execution
without failing the job (batch jobs' default behavior) or retrying indefinitely
(streaming jobs' default behavior).\n",
+ "You can then run custom pipeline logic on the dead-letter queue
(unprocessed messages queue) to log the failure, alert, and push the failed
message to temporary storage so that it can eventually be reprocessed. "
Review Comment:
```suggestion
"You can then run custom pipeline logic on the dead-letter queue
(unprocessed messages queue) to log the failure, alert, and push the failed
message to temporary storage so that it can eventually be reprocessed."
```
Nit
##########
examples/notebooks/beam-ml/run_inference_multi_model.ipynb:
##########
@@ -520,7 +520,6 @@
"\n",
" def process(self, element):\n",
" image_url, image = element \n",
- " # Update this step when this ticket is resolved:
https://github.com/apache/beam/issues/21863\n",
Review Comment:
Hm, is there a better way to communicate these kinds of workarounds are
temporary?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]