derrickaw commented on code in PR #35568:
URL: https://github.com/apache/beam/pull/35568#discussion_r2208306255
##########
sdks/python/apache_beam/yaml/examples/transforms/ml/taxi-fare/custom_nyc_taxifare_model_deployment.ipynb:
##########
@@ -0,0 +1,809 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "source": [
+ "# @title ###### Licensed to the Apache Software Foundation (ASF), Version
2.0 (the \"License\")\n",
+ "\n",
+ "# Licensed to the Apache Software Foundation (ASF) under one\n",
+ "# or more contributor license agreements. See the NOTICE file\n",
+ "# distributed with this work for additional information\n",
+ "# regarding copyright ownership. The ASF licenses this file\n",
+ "# to you under the Apache License, Version 2.0 (the\n",
+ "# \"License\"); you may not use this file except in compliance\n",
+ "# with the License. You may obtain a copy of the License at\n",
+ "#\n",
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing,\n",
+ "# software distributed under the License is distributed on an\n",
+ "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ "# KIND, either express or implied. See the License for the\n",
+ "# specific language governing permissions and limitations\n",
+ "# under the License"
+ ],
+ "metadata": {
+ "id": "ZpDmaAwXuRnG"
+ },
+ "id": "ZpDmaAwXuRnG",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# NYC Taxi Fare Prediction - Model Training and Deployment\n",
+ "\n",
+ "<table><tbody><tr>\n",
+ " <td style=\"text-align: center\">\n",
+ " <a
href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2Fapache%2Fbeam%2Fblob%2Fmaster%2Fsdks%2Fpython%2Fapache_beam%2Fyaml%2Fexamples%2Ftransforms%2Fml%2Ftaxi-fare%2Fcustom_nyc_taxifare_model_deployment.ipynb\">\n",
+ " <img alt=\"Google Cloud Colab Enterprise logo\"
src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\"
width=\"32px\"><br> Run in Colab Enterprise\n",
+ " </a>\n",
+ " </td>\n",
+ " <td style=\"text-align: center\">\n",
+ " <a
href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/taxi-fare/custom_nyc_taxifare_model_deployment.ipynb\">\n",
+ " <img alt=\"GitHub logo\"
src=\"https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png\"
width=\"32px\"><br> View on GitHub\n",
+ " </a>\n",
+ " </td>\n",
+ "</tr></tbody></table>\n"
+ ],
+ "metadata": {
+ "id": "m916RPCn0NSS"
+ },
+ "id": "m916RPCn0NSS"
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Overview\n",
+ "\n",
+ "This notebook demonstrates the training and deployment of a custom
tabular regression model for online prediction.\n",
+ "\n",
+ "We'll train a [gradient-boosted decision tree (GBDT)
model](https://en.wikipedia.org/wiki/Gradient_boosting) using
[XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) to predict the
fare of a taxi trip in New York City, given the information such as pick-up
date and time, pick-up location, drop-off location and passenger count. The
dataset is from the Kaggle competition
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction organized by Google
Cloud.\n",
+ "\n",
+ "After model training and evaluation, we'll use Vertex AI Python SDK to
upload this custom model to Vertex AI Model Registry and deploy it to perform
remote inferences at scale. The prefered way to run this notebook is within
Colab Enterprise.\n",
+ "\n",
+ "## Outline\n",
+ "1. Dataset\n",
+ "\n",
+ "2. Training\n",
+ "\n",
+ "3. Evaluation\n",
+ "\n",
+ "4. Deployment\n",
+ "\n",
+ "5. Reference"
+ ],
+ "metadata": {
+ "id": "jGbLxUoraooN"
+ },
+ "id": "jGbLxUoraooN"
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We first install and import the necessary libraries to run this notebook."
+ ],
+ "metadata": {
+ "id": "e6zO5wWaMhaX"
+ },
+ "id": "e6zO5wWaMhaX"
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "!pip3 install --quiet --upgrade \\\n",
+ " opendatasets \\\n",
+ " google-cloud-storage \\\n",
+ " google-cloud-aiplatform \\\n",
+ " scikit-learn \\\n",
+ " xgboost \\\n",
+ " pandas"
+ ],
+ "metadata": {
+ "id": "weUpgu9Y1OoF"
+ },
+ "id": "weUpgu9Y1OoF",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import opendatasets as od\n",
+ "import pandas as pd\n",
+ "import random\n",
+ "import time\n",
+ "import os\n",
+ "\n",
+ "from xgboost import XGBRegressor\n",
+ "from sklearn.model_selection import train_test_split\n",
+ "from sklearn.metrics import root_mean_squared_error\n",
+ "\n",
+ "import google.cloud.storage as storage\n",
+ "import google.cloud.aiplatform as vertex"
+ ],
+ "metadata": {
+ "id": "KJTsSdQKSN_m"
+ },
+ "id": "KJTsSdQKSN_m",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Dataset\n",
+ "\n",
+ "We use the `opendatasets` library to programmatically download the
dataset from Kaggle.\n",
+ "\n",
+ "We'll first need a Kaggle account and register for this competition.
We'll also need the API key which is stored in `kaggle.json` file automatically
downloaded when you create an API token. Go to *Profile* picture -> *Settings*
-> *API* -> *Create New Token*.\n",
+ "\n",
+ "The dataset download will prompt you to enter your Kaggle username and
key. Copy this information from `kaggle.json`.\n"
+ ],
+ "metadata": {
+ "id": "DVWcleCz1AVl"
+ },
+ "id": "DVWcleCz1AVl"
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "dataset_url =
'https://www.kaggle.com/c/new-york-city-taxi-fare-prediction'\n",
+ "od.download(dataset_url)"
+ ],
+ "metadata": {
+ "id": "8D-KUYKD1lg4"
+ },
+ "id": "8D-KUYKD1lg4",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Among the downloaded files, we will only make use of `test.csv` testing
dataset and primarily `train.csv` training dataset for the purpose of training
and evaluating our model."
+ ],
+ "metadata": {
+ "id": "NMCdiinpTF0W"
+ },
+ "id": "NMCdiinpTF0W"
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "data_dir = 'new-york-city-taxi-fare-prediction'\n",
+ "!dir -l {data_dir}"
+ ],
+ "metadata": {
+ "id": "rmlERXShR457"
+ },
+ "id": "rmlERXShR457",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The training dataset contains approx. 55M rows. Reading the entire
dataset into a pandas DataFrame (i.e. loading the entire dataset into memory)
is slow and memory-consuming that can affect operations in later parts of the
notebook. And for the purpose of experimenting with our model, it is also
unnecessary.\n",
+ "\n",
+ "A good practice is to sample some percentages of the training dataset."
+ ],
+ "metadata": {
+ "id": "Yv9Kq0v2T_1g"
+ },
+ "id": "Yv9Kq0v2T_1g"
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "p = 0.01\n",
+ "# keep the header, then take only 1% of rows\n",
+ "# if random from [0,1] interval is greater than 0.01 the row will be
skipped\n",
+ "df_train_val = pd.read_csv(\n",
+ " data_dir + \"/train.csv\",\n",
+ " header=0,\n",
+ " parse_dates = ['pickup_datetime'],\n",
+ " skiprows=lambda i: i > 0 and random.random() > p\n",
+ ")\n",
+ "df_train_val.shape"
+ ],
+ "metadata": {
+ "id": "epJNJkp1W7P_"
+ },
+ "id": "epJNJkp1W7P_",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The training dataset, now as a DataFrame table, can be further inspected."
+ ],
+ "metadata": {
+ "id": "bzRYmrc-YDdd"
+ },
+ "id": "bzRYmrc-YDdd"
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train_val.columns"
+ ],
+ "metadata": {
+ "id": "AkJ2-w3BW7dD"
+ },
+ "id": "AkJ2-w3BW7dD",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train_val.info()"
+ ],
+ "metadata": {
+ "id": "AxAMXNTiKe2D"
+ },
+ "id": "AxAMXNTiKe2D",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train_val"
+ ],
+ "metadata": {
+ "id": "4LFxT3Zec8tX"
+ },
+ "id": "4LFxT3Zec8tX",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "The testing dataset is a lot smaller in size and doesn't have the
`fare_amount` column. Likewise, we can read the dataset as a DataFrame and
inspect the data."
+ ],
+ "metadata": {
+ "id": "xxPAnGR1ZDwf"
+ },
+ "id": "xxPAnGR1ZDwf"
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_test = pd.read_csv(data_dir + \"/test.csv\", parse_dates =
['pickup_datetime'])\n",
+ "df_test.columns"
+ ],
+ "metadata": {
+ "id": "cWBexIlsW_4u"
+ },
+ "id": "cWBexIlsW_4u",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_test"
+ ],
+ "metadata": {
+ "id": "bch6SYLxL51_"
+ },
+ "id": "bch6SYLxL51_",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We'll set aside 20% of the training data as the validation set, to
evaluate the model on previously unseen data."
+ ],
+ "metadata": {
+ "id": "SOeJDlsCZcY0"
+ },
+ "id": "SOeJDlsCZcY0"
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train, df_val = train_test_split(\n",
+ " df_train_val,\n",
+ " test_size=0.2,\n",
+ " random_state=42 # set random_state to some constant so we always have
the same training and validation data\n",
+ ")\n",
+ "\n",
+ "print(\"Training dataset's shape: \", df_train.shape)\n",
+ "print(\"Validation dataset's shape: \", df_val.shape)"
+ ],
+ "metadata": {
+ "id": "qVN1ygVGOH33"
+ },
+ "id": "qVN1ygVGOH33",
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Training\n",
+ "\n",
+ "For a quick '0-to-1' model serving on Vertex AI, the model training
process below is kept straighforward using the simple yet very effective
[tree-based, gradient
boosting](https://en.wikipedia.org/wiki/Gradient_boosting) algorithm. We start
of with a simple feature engineering idea, before moving on to the actual
training of the model using the
[XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) library.\n"
+ ],
+ "metadata": {
+ "id": "4Ov7Efuy1Gyj"
+ },
+ "id": "4Ov7Efuy1Gyj"
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Simple Feature Engineering\n",
+ "\n",
+ "One of the columns in the dataset is the `pickup_datetime` column, which
is of
[datetimelike](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html)
type. This makes it incredibly easy for performing data analysis on
time-series data such as this. However, ML models don't accept feature columns
with such a custom data type that is not a number. Some sort of conversion is
needed, and here we'll choose to break this datetime column into multiple
feature columns.\n"
Review Comment:
```suggestion
"One of the columns in the dataset is the `pickup_datetime` column,
which is of [datetime
like](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.html) type.
This makes it incredibly easy for performing data analysis on time-series data
such as this. However, ML models don't accept feature columns with such a
custom data type that is not a number. Some sort of conversion is needed, and
here we'll choose to break this datetime column into multiple feature
columns.\n"
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]