[GitHub] [beam] bzablocki commented on a diff in pull request #27284: Yaml API: Day Zero tutorial notebook

via GitHub Tue, 19 Sep 2023 03:36:19 -0700


bzablocki commented on code in PR #27284:
URL: https://github.com/apache/beam/pull/27284#discussion_r1329917400



##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,582 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a 
href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\";
 target=\"_parent\"><img 
src=\"https://colab.research.google.com/assets/colab-badge.svg\"; alt=\"Open In 
Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 
2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - YAML\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data 
processing pipelines, it still has a high barrier for getting started and 
authoring simple pipelines. Even setting up the environment, installing the 
dependencies, and setting up the project can be a challenge.\n",
+    "\n",
+    "Here we provide a simple YAML syntax for describing pipelines that does 
not require coding experience or learning how to use an SDK. You can use any 
text editor.\n",
+    "\n",
+    "Please note: YAML API is still EXPERIMENTAL and subject to change.\n",
+    "\n",
+    "In this notebook, you set up your development environment and write a 
simple pipeline using YAML. Then you run it locally, using the 
[DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can 
explore other runners with the [Beam Capatibility 
Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From 
**View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, click the **Run cell** button at the top left of the 
cell, or select it and press **`Shift+Enter`**. Try modifying a code cell and 
re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to 
Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "Fz6KSQ13_3Rr",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment. The following code installs 
`apache-beam` and downloads some text files from Cloud Storage to your local 
file system. We'll use these text files as input to the pipelines."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "GOOk81Jj_yUy",
+    "colab_type": "code",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c",
+    "colab": {
+     "base_uri": "https://localhost:8080/";,
+     "height": 170
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "# Run and print a shell command.\n",
+    "def run(cmd):\n",
+    "  print('>> {}'.format(cmd))\n",
+    "  !{cmd}\n",
+    "  print('')\n",
+    "\n",
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "run('pip install --quiet apache-beam')\n",
+    "\n",
+    "# Copy the input files into the local file system.\n",
+    "run('mkdir -p data')\n",
+    "run('wget -O data/kinglear.txt 
https://storage.googleapis.com/dataflow-samples/shakespeare/kinglear.txt')\n",
+    "run('wget -O data/SMSSpamCollection.csv 
https://storage.googleapis.com/apache-beam-samples/SMSSpamCollection/SMSSpamCollection')"
+   ],
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Inspect the data\n",
+    "We'll be working with two datasets. We'll use `kinglear.txt` for the 
first example, and `SMSSpamCollection.csv` for the second and third 
examples.\n",
+    "Let's first take a look at the `kinglear.txt` dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('head data/kinglear.txt')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This is just a `txt` file that contains lines of text.\n",
+    "Let's take a look at the other dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('head data/SMSSpamCollection.csv')\n",
+    "run('wc -l data/SMSSpamCollection.csv')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This dataset is a `csv` file that contains 5,574 rows of SMS messages 
labeled either spam or not-spam (\"ham\"). Each row contains two columns 
separated by a tab character:\n",
+    "1. `Column 1`: The label, either `ham` or `spam`\n",
+    "2. `Column 2`: The SMS message as raw text (type `string`)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Example 1: Word count\n",
+    "This example is a version of the 
[WordCount](https://beam.apache.org/get-started/wordcount-example/)). It reads 
lines of text from the input dataset `kinglear.txt` and counts the number of 
times each word appears in the text.\n",
+    "To start, we'll create a `.yaml` file specifying our pipeline."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",
+    "pipeline:\n",
+    "  # Read input data. Each line from the txt file is a String.\n",
+    "  - type: ReadFromText\n",
+    "    name: InputText\n",
+    "    config:\n",
+    "      file_pattern: data/kinglear.txt\n",
+    "\n",
+    "  # Using a regex, we'll split the content of the message (one long 
string) into words (list of strings).\n",
+    "  # The 'fn' parameter accepts functions written in Python\n",
+    "  - type: PyFlatMap\n",
+    "    name: FindWords\n",
+    "    input: InputText\n",
+    "    config:\n",
+    "      fn: |\n",
+    "        import re\n",
+    "        lambda line: re.findall(r\"[a-zA-Z]+\", line)\n",
+    "\n",
+    "  # Transforming each word to lower case and combining it with a '1'. 
Result of this step are pairs (word: 1).\n",
+    "  - type: PyMap\n",
+    "    name: PairWordsWith1\n",
+    "    input: FindWords\n",
+    "    config:\n",
+    "      fn: 'lambda word: (word, 1)'\n",
+    "\n",
+    "  # Using CombinePerKey transform with the 'sum' function as a combine 
function,\n",
+    "  # we'll calculate the occurrence of each word.\n",
+    "  - type: CombinePerKey\n",
+    "    config:\n",
+    "      combine_fn: sum\n",
+    "    name: GroupAndSum\n",
+    "    input: PairWordsWith1\n",
+    "\n",
+    "  # Format results - each record should be represented as 'word: 
count'.\n",
+    "  # The 'fn' parameter accepts functions written in Python\n",
+    "  - type: PyMap\n",
+    "    name: FormatResults\n",
+    "    input: GroupAndSum\n",
+    "    config:\n",
+    "      fn: \"lambda word_count_tuple: f'{word_count_tuple[0]}: 
{word_count_tuple[1]}'\"\n",
+    "\n",
+    "  # Save results to a text file.\n",
+    "  - type: WriteToText\n",
+    "    name: SaveToText\n",
+    "    input: FormatResults\n",
+    "    config:\n",
+    "      file_path_prefix: \"data/result-pipeline-01\"\n",
+    "      file_name_suffix: \".txt\"\n",
+    "'''\n",
+    "save_to_file(pipeline, 'pipeline-01.yaml')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Each pipeline specification must start with a `pipeline` key on the first 
line.\n",
+    "The `pipeline` keyh is followed by a list of transforms. For example, the 
first transform reads the input file:\n",
+    "```\n",
+    "  # Read input data. Each line from the csv file is a String.\n",
+    "  - type: ReadFromText\n",
+    "    name: InputText\n",
+    "    config:\n",
+    "      file_pattern: data/kinglear.txt\n",
+    "```\n",
+    "Note: The indentation is important, because it specifies object 
hierarchy.\n",
+    "YAML supports comments. Everything after the `#` is always treated as a 
comment. Use them to improve readability.\n",
+    "\n",
+    "Each operation must specify the `type` descriptor and other fields, such 
as `name` and other transform-specific parameters.\n",
+    "For a list of available transforms and their parameters, see the YAML API 
documentation. # todo(yaml) add link\n",

Review Comment:
   Hi @amotley, this comment is also still open.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] bzablocki commented on a diff in pull request #27284: Yaml API: Day Zero tutorial notebook

Reply via email to