Re: [PR] [YAML] A Streaming Inference Pipeline - Taxi Fare Estimation [beam]

via GitHub Tue, 15 Jul 2025 11:28:19 -0700


derrickaw commented on code in PR #35568:
URL: https://github.com/apache/beam/pull/35568#discussion_r2208306255



##########
sdks/python/apache_beam/yaml/examples/transforms/ml/taxi-fare/custom_nyc_taxifare_model_deployment.ipynb:
##########
@@ -0,0 +1,809 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "source": [
+    "# @title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License"
+   ],
+   "metadata": {
+    "id": "ZpDmaAwXuRnG"
+   },
+   "id": "ZpDmaAwXuRnG",
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "# NYC Taxi Fare Prediction - Model Training and Deployment\n",
+    "\n",
+    "<table><tbody><tr>\n",
+    "  <td style=\"text-align: center\">\n",
+    "    <a 
href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2Fapache%2Fbeam%2Fblob%2Fmaster%2Fsdks%2Fpython%2Fapache_beam%2Fyaml%2Fexamples%2Ftransforms%2Fml%2Ftaxi-fare%2Fcustom_nyc_taxifare_model_deployment.ipynb\";>\n",
+    "      <img alt=\"Google Cloud Colab Enterprise logo\" 
src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\";
 width=\"32px\"><br> Run in Colab Enterprise\n",
+    "    </a>\n",
+    "  </td>\n",
+    "  <td style=\"text-align: center\">\n",
+    "    <a 
href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/taxi-fare/custom_nyc_taxifare_model_deployment.ipynb\";>\n",
+    "      <img alt=\"GitHub logo\" 
src=\"https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png\"; 
width=\"32px\"><br> View on GitHub\n",
+    "    </a>\n",
+    "  </td>\n",
+    "</tr></tbody></table>\n"
+   ],
+   "metadata": {
+    "id": "m916RPCn0NSS"
+   },
+   "id": "m916RPCn0NSS"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Overview\n",
+    "\n",
+    "This notebook demonstrates the training and deployment of a custom 
tabular regression model for online prediction.\n",
+    "\n",
+    "We'll train a [gradient-boosted decision tree (GBDT) 
model](https://en.wikipedia.org/wiki/Gradient_boosting) using 
[XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) to predict the 
fare of a taxi trip in New York City, given the information such as pick-up 
date and time, pick-up location, drop-off location and passenger count. The 
dataset is from the Kaggle competition 
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction organized by Google 
Cloud.\n",
+    "\n",
+    "After model training and evaluation, we'll use Vertex AI Python SDK to 
upload this custom model to Vertex AI Model Registry and deploy it to perform 
remote inferences at scale. The prefered way to run this notebook is within 
Colab Enterprise.\n",
+    "\n",
+    "## Outline\n",
+    "1. Dataset\n",
+    "\n",
+    "2. Training\n",
+    "\n",
+    "3. Evaluation\n",
+    "\n",
+    "4. Deployment\n",
+    "\n",
+    "5. Reference"
+   ],
+   "metadata": {
+    "id": "jGbLxUoraooN"
+   },
+   "id": "jGbLxUoraooN"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "We first install and import the necessary libraries to run this notebook."
+   ],
+   "metadata": {
+    "id": "e6zO5wWaMhaX"
+   },
+   "id": "e6zO5wWaMhaX"
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "!pip3 install --quiet --upgrade \\\n",
+    "  opendatasets \\\n",
+    "  google-cloud-storage \\\n",
+    "  google-cloud-aiplatform \\\n",
+    "  scikit-learn \\\n",
+    "  xgboost \\\n",
+    "  pandas"
+   ],
+   "metadata": {
+    "id": "weUpgu9Y1OoF"
+   },
+   "id": "weUpgu9Y1OoF",
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "import opendatasets as od\n",
+    "import pandas as pd\n",
+    "import random\n",
+    "import time\n",
+    "import os\n",
+    "\n",
+    "from xgboost import XGBRegressor\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.metrics import root_mean_squared_error\n",
+    "\n",
+    "import google.cloud.storage as storage\n",
+    "import google.cloud.aiplatform as vertex"
+   ],
+   "metadata": {
+    "id": "KJTsSdQKSN_m"
+   },
+   "id": "KJTsSdQKSN_m",
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Dataset\n",
+    "\n",
+    "We use the `opendatasets` library to programmatically download the 
dataset from Kaggle.\n",
+    "\n",
+    "We'll first need a Kaggle account and register for this competition. 
We'll also need the API key which is stored in `kaggle.json` file automatically 
downloaded when you create an API token. Go to *Profile* picture -> *Settings* 
-> *API* -> *Create New Token*.\n",
+    "\n",
+    "The dataset download will prompt you to enter your Kaggle username and 
key. Copy this information from `kaggle.json`.\n"
+   ],
+   "metadata": {
+    "id": "DVWcleCz1AVl"
+   },
+   "id": "DVWcleCz1AVl"
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "dataset_url = 
'https://www.kaggle.com/c/new-york-city-taxi-fare-prediction'\n",
+    "od.download(dataset_url)"
+   ],
+   "metadata": {
+    "id": "8D-KUYKD1lg4"
+   },
+   "id": "8D-KUYKD1lg4",
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Among the downloaded files, we will only make use of `test.csv` testing 
dataset and primarily `train.csv` training dataset for the purpose of training 
and evaluating our model."
+   ],
+   "metadata": {
+    "id": "NMCdiinpTF0W"
+   },
+   "id": "NMCdiinpTF0W"
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "data_dir = 'new-york-city-taxi-fare-prediction'\n",
+    "!dir -l {data_dir}"
+   ],
+   "metadata": {
+    "id": "rmlERXShR457"
+   },
+   "id": "rmlERXShR457",
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "The training dataset contains approx. 55M rows. Reading the entire 
dataset into a pandas DataFrame (i.e. loading the entire dataset into memory) 
is slow and memory-consuming that can affect operations in later parts of the 
notebook. And for the purpose of experimenting with our model, it is also 
unnecessary.\n",
+    "\n",
+    "A good practice is to sample some percentages of the training dataset."
+   ],
+   "metadata": {
+    "id": "Yv9Kq0v2T_1g"
+   },
+   "id": "Yv9Kq0v2T_1g"
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "p = 0.01\n",
+    "# keep the header, then take only 1% of rows\n",
+    "# if random from [0,1] interval is greater than 0.01 the row will be 
skipped\n",
+    "df_train_val = pd.read_csv(\n",
+    "    data_dir + \"/train.csv\",\n",
+    "    header=0,\n",
+    "    parse_dates = ['pickup_datetime'],\n",
+    "    skiprows=lambda i: i > 0 and random.random() > p\n",
+    ")\n",
+    "df_train_val.shape"
+   ],
+   "metadata": {
+    "id": "epJNJkp1W7P_"
+   },
+   "id": "epJNJkp1W7P_",
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "The training dataset, now as a DataFrame table, can be further inspected."
+   ],
+   "metadata": {
+    "id": "bzRYmrc-YDdd"
+   },
+   "id": "bzRYmrc-YDdd"
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "df_train_val.columns"
+   ],
+   "metadata": {
+    "id": "AkJ2-w3BW7dD"
+   },
+   "id": "AkJ2-w3BW7dD",
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "df_train_val.info()"
+   ],
+   "metadata": {
+    "id": "AxAMXNTiKe2D"
+   },
+   "id": "AxAMXNTiKe2D",
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "df_train_val"
+   ],
+   "metadata": {
+    "id": "4LFxT3Zec8tX"
+   },
+   "id": "4LFxT3Zec8tX",
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "The testing dataset is a lot smaller in size and doesn't have the 
`fare_amount` column. Likewise, we can read the dataset as a DataFrame and 
inspect the data."
+   ],
+   "metadata": {
+    "id": "xxPAnGR1ZDwf"
+   },
+   "id": "xxPAnGR1ZDwf"
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "df_test = pd.read_csv(data_dir + \"/test.csv\", parse_dates = 
['pickup_datetime'])\n",
+    "df_test.columns"
+   ],
+   "metadata": {
+    "id": "cWBexIlsW_4u"
+   },
+   "id": "cWBexIlsW_4u",
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "df_test"
+   ],
+   "metadata": {
+    "id": "bch6SYLxL51_"
+   },
+   "id": "bch6SYLxL51_",
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "We'll set aside 20% of the training data as the validation set, to 
evaluate the model on previously unseen data."
+   ],
+   "metadata": {
+    "id": "SOeJDlsCZcY0"
+   },
+   "id": "SOeJDlsCZcY0"
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "df_train, df_val = train_test_split(\n",
+    "    df_train_val,\n",
+    "    test_size=0.2,\n",
+    "    random_state=42 # set random_state to some constant so we always have 
the same training and validation data\n",
+    ")\n",
+    "\n",
+    "print(\"Training dataset's shape: \", df_train.shape)\n",
+    "print(\"Validation dataset's shape: \", df_val.shape)"
+   ],
+   "metadata": {
+    "id": "qVN1ygVGOH33"
+   },
+   "id": "qVN1ygVGOH33",
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Training\n",
+    "\n",
+    "For a quick '0-to-1' model serving on Vertex AI, the model training 
process below is kept straighforward using the simple yet very effective 
[tree-based, gradient 
boosting](https://en.wikipedia.org/wiki/Gradient_boosting) algorithm. We start 
of with a simple feature engineering idea, before moving on to the actual 
training of the model using the 
[XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) library.\n"
+   ],
+   "metadata": {
+    "id": "4Ov7Efuy1Gyj"
+   },
+   "id": "4Ov7Efuy1Gyj"
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Simple Feature Engineering\n",
+    "\n",
+    "One of the columns in the dataset is the `pickup_datetime` column, which 
is of 
[datetimelike](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html)
 type. This makes it incredibly easy for performing data analysis on 
time-series data such as this. However, ML models don't accept feature columns 
with such a custom data type that is not a number. Some sort of conversion is 
needed, and here we'll choose to break this datetime column into multiple 
feature columns.\n"

Review Comment:
   ```suggestion
       "One of the columns in the dataset is the `pickup_datetime` column, 
which is of [datetime 
like](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html) type. 
This makes it incredibly easy for performing data analysis on time-series data 
such as this. However, ML models don't accept feature columns with such a 
custom data type that is not a number. Some sort of conversion is needed, and 
here we'll choose to break this datetime column into multiple feature 
columns.\n"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [YAML] A Streaming Inference Pipeline - Taxi Fare Estimation [beam]

Reply via email to