[GitHub] [beam] damccorm commented on a diff in pull request #24437: ML notebook formatting and text updates

GitBox Wed, 30 Nov 2022 13:23:32 -0800


damccorm commented on code in PR #24437:
URL: https://github.com/apache/beam/pull/24437#discussion_r1036443160



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -38,29 +38,23 @@
         "\n",
         "For rapid execution, Pandas loads all of the data into memory on a 
single machine (one node). This configuration works well when dealing with 
small-scale datasets. However, many projects involve datasets that are too big 
to fit in memory. These use cases generally require parallel data processing 
frameworks, such as Apache Beam.\n",
         "\n",
-        "\n",
-        "## Apache Beam DataFrames\n",
-        "\n",
-        "\n",
-        "Beam DataFrames provide a pandas-like\n",
+        "Beam DataFrames provide a Pandas-like\n",
         "API to declare and define Beam processing pipelines. It provides a 
familiar interface for machine learning practioners to build complex 
data-processing pipelines by only invoking standard pandas commands.\n",
         "\n",
         "To learn more about Apache Beam DataFrames, see the\n",
         "[Beam DataFrames 
overview](https://beam.apache.org/documentation/dsls/dataframes/overview) 
page.\n",
         "\n",
-        "## Goal\n",
-        "The goal of this notebook is to explore a dataset preprocessed with 
the Beam DataFrame API for machine learning model training.\n",
+        "## Overview\n",
+        "The goal of this example is to explore a dataset preprocessed with 
the Beam DataFrame API for machine learning model training.\n",
         "\n",
-        "\n",
-        "## Tutorial outline\n",
-        "\n",
-        "This notebook demonstrates the use of the Apache Beam DataFrames API 
to perform common data exploration as well as the preprocessing steps that are 
necessary to prepare your dataset for machine learning model training and 
inference. These steps include the following:  \n",
+        "This example demonstrates the use of the Apache Beam DataFrames API 
to perform common data exploration as well as the preprocessing steps that are 
necessary to prepare your dataset for machine learning model training and 
inference. This example includes the following steps:  \n",
         "\n",
         "*   Removing unwanted columns.\n",
         "*   One-hot encoding categorical columns.\n",
         "*   Normalizing numerical columns.\n",
         "\n",
-        "\n"
+        "In this example, the first section demonstrates how to build and 
execute a pipeline locally using the interactive runner.\n",
+        "The second section uses a distributed runner to demonstrate how to 
run the pipeline on the full dataset.\n",

Review Comment:
   ```suggestion
           "The second section uses a distributed runner to demonstrate how to 
run the pipeline on the full dataset.\n"
   ```
   
   Looks like this notebook shows up as invalid - 
https://github.com/apache/beam/blob/d29825911841635ba03c7d2d80943d0326d149de/examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb
 - this may be the problem?



##########
examples/notebooks/beam-ml/run_inference_pytorch.ipynb:
##########
@@ -84,6 +84,13 @@
       "metadata": {
         "id": "loxD-rOVchRn"
       },
+      "outputs": [],
+      "source": [
+        "!pip install apache_beam[gcp,dataframe] --quiet"
+      ],
+      "metadata": {
+        "id": "loxD-rOVchRn"
+      },

Review Comment:
   Was this change intentional? It looks like it no-ops



##########
examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb:
##########
@@ -38,29 +38,23 @@
         "\n",
         "For rapid execution, Pandas loads all of the data into memory on a 
single machine (one node). This configuration works well when dealing with 
small-scale datasets. However, many projects involve datasets that are too big 
to fit in memory. These use cases generally require parallel data processing 
frameworks, such as Apache Beam.\n",
         "\n",
-        "\n",
-        "## Apache Beam DataFrames\n",
-        "\n",
-        "\n",
-        "Beam DataFrames provide a pandas-like\n",
+        "Beam DataFrames provide a Pandas-like\n",
         "API to declare and define Beam processing pipelines. It provides a 
familiar interface for machine learning practioners to build complex 
data-processing pipelines by only invoking standard pandas commands.\n",
         "\n",
         "To learn more about Apache Beam DataFrames, see the\n",
         "[Beam DataFrames 
overview](https://beam.apache.org/documentation/dsls/dataframes/overview) 
page.\n",
         "\n",
-        "## Goal\n",
-        "The goal of this notebook is to explore a dataset preprocessed with 
the Beam DataFrame API for machine learning model training.\n",
+        "## Overview\n",
+        "The goal of this example is to explore a dataset preprocessed with 
the Beam DataFrame API for machine learning model training.\n",
         "\n",
-        "\n",
-        "## Tutorial outline\n",
-        "\n",
-        "This notebook demonstrates the use of the Apache Beam DataFrames API 
to perform common data exploration as well as the preprocessing steps that are 
necessary to prepare your dataset for machine learning model training and 
inference. These steps include the following:  \n",
+        "This example demonstrates the use of the Apache Beam DataFrames API 
to perform common data exploration as well as the preprocessing steps that are 
necessary to prepare your dataset for machine learning model training and 
inference. This example includes the following steps:  \n",
         "\n",
         "*   Removing unwanted columns.\n",
         "*   One-hot encoding categorical columns.\n",
         "*   Normalizing numerical columns.\n",
         "\n",
-        "\n"
+        "In this example, the first section demonstrates how to build and 
execute a pipeline locally using the interactive runner.\n",
+        "The second section uses a distributed runner to demonstrate how to 
run the pipeline on the full dataset.\n",

Review Comment:
   Until we have tests, we should probably hand validate these notebooks before 
merging every time



##########
examples/notebooks/beam-ml/run_inference_multi_model.ipynb:
##########
@@ -520,7 +520,6 @@
         "\n",
         "  def process(self, element):\n",
         "    image_url, image = element \n",
-        "    # Update this step when this ticket is resolved: 
https://github.com/apache/beam/issues/21863\n";,

Review Comment:
   If not, we should at least comment on the issue itself with a note to update 
the notebook when it is fixed



##########
examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb:
##########
@@ -41,25 +36,23 @@
         "# KIND, either express or implied. See the License for the\n",
         "# specific language governing permissions and limitations\n",
         "# under the License"
-      ]
-    },
-    {
-      "cell_type": "markdown",
+      ],
       "metadata": {
+        "cellView": "form",
         "id": "faayYQYrQzY3"
-      },
-      "source": [
-        "## Use RunInference in Apache Beam"
-      ]
+            },
+      "execution_count": null,
+      "outputs": []
     },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "JjAt1GesQ9sg"
       },
       "source": [
-        "Starting with Apache Beam 2.40.0, you can use Apache Beam with the 
RunInference API to use machine learning (ML) models for local and remote 
inference with batch and streaming pipelines.\n",
-        "The RunInference API leverages Apache Beam concepts, such as the 
BatchElements transform and the Shared class, to support models in your 
pipelines that create transforms optimized for machine learning inferences.\n",
+        "# Use RunInference in Apache Beam\n",
+        "You can use Apache Beam versions 2.40.0 and later with the 
[RunInference 
API](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference)
 to use machine learning (ML) models for local and remote inference with batch 
and streaming pipelines.\n",

Review Comment:
   "You can use Apache Beam versions 2.40.0 and later with the RunInference API 
to use machine learning (ML) models for local and remote inference with batch 
and streaming pipelines." - this reads a little awkwardly to me, specifically 
"you can use X to use Y"
   
   Maybe "You can use Apache Beam versions 2.40.0 and later with the 
RunInference API for local and remote inference with batch and streaming 
pipelines" instead? The "to use machine learning (ML) models" bit is probably 
implied



##########
examples/notebooks/beam-ml/custom_remote_inference.ipynb:
##########
@@ -234,20 +238,20 @@
         "id": "HLy7VKJhLrmT"
       },
       "source": [
-        "### Custom DoFn\n",
+        "### Create a custom DoFn\n",
         "\n",
         "In order to implement remote inference, create a DoFn class. This 
class sends a batch of images to the Cloud vision API.\n",
         "\n",
         "The custom DoFn makes it possible to initialize the API. In case of a 
custom model, a model can also be loaded in the `setup` function. \n",
         "\n",
-        "The `process` function is the most interesting part. In this function 
we implement the model call and return its results.\n",
+        "The `process` function is the most interesting part. In this 
function, we implement the model call and return its results.\n",
         "\n",
-        "**Caution:** When running remote inference, prepare to encounter, 
identify, and handle failure as gracefully as possible. We recommend using the 
following techniques: \n",
+        "When running remote inference, prepare to encounter, identify, and 
handle failure as gracefully as possible. We recommend using the following 
techniques: \n",
         "\n",
         "* **Exponential backoff:** Retry failed remote calls with 
exponentially growing pauses between retries. Using exponential backoff ensures 
that failures don't lead to an overwhelming number of retries in quick 
succession. \n",
         "\n",
-        "* **Dead letter queues:** Route failed inferences to a separate 
`PCollection` without failing the whole transform. You can continue execution 
without failing the job (batch jobs' default behavior) or retrying indefinitely 
(streaming jobs' default behavior).\n",
-        "You can then run custom pipeline logic on the deadletter queue to log 
the failure, alert, and push the failed message to temporary storage so that it 
can eventually be reprocessed. "
+        "* **Dead-letter queues:** Route failed inferences to a separate 
`PCollection` without failing the whole transform. You can continue execution 
without failing the job (batch jobs' default behavior) or retrying indefinitely 
(streaming jobs' default behavior).\n",
+        "You can then run custom pipeline logic on the dead-letter queue 
(unprocessed messages queue) to log the failure, alert, and push the failed 
message to temporary storage so that it can eventually be reprocessed. "

Review Comment:
   ```suggestion
           "You can then run custom pipeline logic on the dead-letter queue 
(unprocessed messages queue) to log the failure, alert, and push the failed 
message to temporary storage so that it can eventually be reprocessed."
   ```
   
   Nit



##########
examples/notebooks/beam-ml/run_inference_multi_model.ipynb:
##########
@@ -520,7 +520,6 @@
         "\n",
         "  def process(self, element):\n",
         "    image_url, image = element \n",
-        "    # Update this step when this ticket is resolved: 
https://github.com/apache/beam/issues/21863\n";,

Review Comment:
   Hm, is there a better way to communicate these kinds of workarounds are 
temporary?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm commented on a diff in pull request #24437: ML notebook formatting and text updates

Reply via email to