This is an automated email from the ASF dual-hosted git repository. git-site-role pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/asf-site by this push: new bb4f46e Publishing website 2020/12/18 00:01:29 at commit 1542971 bb4f46e is described below commit bb4f46e76b191ce55dd619f2bd079a1a3f05bffb Author: jenkins <bui...@apache.org> AuthorDate: Fri Dec 18 00:01:29 2020 +0000 Publishing website 2020/12/18 00:01:29 at commit 1542971 --- website/generated-content/documentation/index.xml | 265 ++++++++++++++------- .../io/built-in/google-bigquery/index.html | 4 +- .../documentation/runtime/environments/index.html | 149 ++++++++---- website/generated-content/sitemap.xml | 2 +- 4 files changed, 285 insertions(+), 135 deletions(-) diff --git a/website/generated-content/documentation/index.xml b/website/generated-content/documentation/index.xml index e45b820..a8a66a1 100644 --- a/website/generated-content/documentation/index.xml +++ b/website/generated-content/documentation/index.xml @@ -10024,114 +10024,201 @@ See the License for the specific language governing permissions and limitations under the License. --> <h1 id="container-environments">Container environments</h1> -<p>The Beam SDK runtime environment is isolated from other runtime systems because the SDK runtime environment is <a href="https://s.apache.org/beam-fn-api-container-contract">containerized</a> with <a href="https://www.docker.com/">Docker</a>. This means that any execution engine can run the Beam SDK.</p> -<p>This page describes how to customize, build, and push Beam SDK container images.</p> -<p>Before you begin, install <a href="https://www.docker.com/">Docker</a> on your workstation.</p> -<h2 id="customizing-container-images">Customizing container images</h2> -<p>You can add extra dependencies to container images so that you don&rsquo;t have to supply the dependencies to execution engines.</p> -<p>To customize a container image, either:</p> -<ul> -<li><a href="#writing-new-dockerfiles">Write a new</a> <a href="https://docs.docker.com/engine/reference/builder/">Dockerfile</a> on top of the original.</li> -<li><a href="#modifying-dockerfiles">Modify</a> the <a href="https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile">original Dockerfile</a> and reimage the container.</li> -</ul> -<p>It&rsquo;s often easier to write a new Dockerfile. However, by modifying the original Dockerfile, you can customize anything (including the base OS).</p> -<h3 id="writing-new-dockerfiles">Writing new Dockerfiles on top of the original</h3> +<p>The Beam SDK runtime environment can be <a href="https://www.docker.com/resources/what-container">containerized</a> with <a href="https://www.docker.com/">Docker</a> to isolate it from other runtime systems. To learn more about the container environment, read the Beam <a href="https://s.apache.org/beam-fn-api-container-contract">SDK Harness container contract</a>.</p> +<p>Prebuilt SDK container images are released per supported language during Beam releases and pushed to <a href="https://hub.docker.com/search?q=apache%2Fbeam&amp;type=image">Docker Hub</a>.</p> +<h2 id="custom-containers">Custom containers</h2> +<p>You may want to customize container images for many reasons, including:</p> +<ul> +<li>Pre-installing additional dependencies</li> +<li>Launching third-party software in the worker environment</li> +<li>Further customizing the execution environment</li> +</ul> +<p>This guide describes how to create and use customized containers for the Beam SDK.</p> +<h3 id="prerequisites">Prerequisites</h3> +<ul> +<li>This guide requires building images using Docker. <a href="https://docs.docker.com/get-docker/">Install Docker locally</a>. Some CI/CD platforms like <a href="https://cloud.google.com/cloud-build/docs/building/build-containers">Google Cloud Build</a> also provide the ability to build images using Docker.</li> +<li>For remote execution engines/runners, have a container registry to host your custom container image. Options include <a href="https://hub.docker.com/">Docker Hub</a> or a &ldquo;self-hosted&rdquo; repository, including cloud-specific container registries like <a href="https://cloud.google.com/container-registry">Google Container Registry</a> (GCR) or <a href="https://aws.amazon.com/ecr/">Amazon Elastic Container Registry</a> (ECR). Make sure your registry [...] +</ul> +<blockquote> +<p><strong>NOTE</strong>: On Nov 20, 2020, Docker Hub put <a href="https://www.docker.com/increase-rate-limits">rate limits</a> into effect for anonymous and free authenticated use, which may impact larger pipelines that pull containers several times.</p> +</blockquote> +<p>For optimal user experience, we also recommend you use the latest released version of Beam.</p> +<h3 id="building-and-pushing-custom-containers">Building and pushing custom containers</h3> +<p>Beam <a href="https://hub.docker.com/search?q=apache%2Fbeam&amp;type=image">SDK container images</a> are built from Dockerfiles checked into the <a href="https://github.com/apache/beam">Github</a> repository and published to Docker Hub for every release. You can build customized containers in one of two ways:</p> <ol> -<li>Pull a <a href="https://hub.docker.com/search?q=apache%2Fbeam&amp;type=image">prebuilt SDK container image</a> for your <a href="https://docs.docker.com/docker-hub/repos/#searching-for-repositories">target</a> language and version. The following example pulls the latest Python SDK:</li> +<li><strong><a href="#writing-new-dockerfiles">Writing a new</a> Dockerfile based on a released container image</strong>. This is sufficient for simple additions to the image, such as adding artifacts or environment variables.</li> +<li><strong><a href="#modifying-dockerfiles">Modifying</a> a source Dockerfile in <a href="https://github.com/apache/beam">Beam</a></strong>. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions).</li> </ol> -<pre><code>docker pull apache/beam_python3.7_sdk -</code></pre><ol start="2"> -<li><a href="https://docs.docker.com/develop/develop-images/dockerfile_best-practices/">Write a new Dockerfile</a> that <a href="https://docs.docker.com/engine/reference/builder/#from">designates</a> the original as its <a href="https://docs.docker.com/glossary/?term=parent%20image">parent</a>.</li> -<li><a href="#building-container-images">Build</a> a child image.</li> +<h4 id="writing-new-dockerfiles">Writing a new Dockerfile based on an existing published container image</h4> +<ol> +<li>Create a new Dockerfile that designates a base image using the <a href="https://docs.docker.com/engine/reference/builder/#from">FROM instruction</a>.</li> </ol> -<h3 id="modifying-dockerfiles">Modifying the original Dockerfile</h3> +<pre><code>FROM apache/beam_python3.7_sdk:2.25.0 +ENV FOO=bar +COPY /src/path/to/file /dest/path/to/file/ +</code></pre><p>This <code>Dockerfile</code> uses the prebuilt Python 3.7 SDK container image <a href="https://hub.docker.com/r/apache/beam_python3.7_sdk"><code>beam_python3.7_sdk</code></a> tagged at (SDK version) <code>2.25.0</code>, and adds an additional environment variable and file to the image.</p> +<ol start="2"> +<li><a href="https://docs.docker.com/engine/reference/commandline/build/">Build</a> and <a href="https://docs.docker.com/engine/reference/commandline/push/">push</a> the image using Docker.</li> +</ol> +<pre><code>export BASE_IMAGE=&quot;apache/beam_python3.7_sdk:2.25.0&quot; +export IMAGE_NAME=&quot;myremoterepo/mybeamsdk&quot; +export TAG=&quot;latest&quot; +# Optional - pull the base image into your local Docker daemon to ensure +# you have the most up-to-date version of the base image locally. +docker pull &quot;${BASE_IMAGE}&quot; +docker build -f Dockerfile -t &quot;${IMAGE_NAME}:${TAG}&quot; . +</code></pre><ol start="3"> +<li>If your runner is running remotely, retag and <a href="https://docs.docker.com/engine/reference/commandline/push/">push</a> the image to the appropriate repository.</li> +</ol> +<pre><code>docker push &quot;${IMAGE_NAME}:${TAG}&quot; +</code></pre><ol start="4"> +<li>After pushing a container image, verify the remote image ID and digest matches the local image ID and digest, output from <code>docker build</code> or <code>docker images</code>.</li> +</ol> +<h4 id="modifying-dockerfiles">Modifying a source Dockerfile in Beam</h4> +<p>This method requires building image artifacts from Beam source. For additional instructions on setting up your development environment, see the <a href="/contribute/#development-setup">Contribution guide</a>.</p> +<blockquote> +<p><strong>NOTE</strong>: It is recommended that you start from a stable release branch (<code>release-X.XX.X</code>) corresponding to the same version of the SDK to run your pipeline. Differences in SDK version may result in unexpected errors.</p> +</blockquote> <ol> -<li>Clone the <code>beam</code> repository:</li> +<li>Clone the <code>beam</code> repository.</li> </ol> -<pre><code>git clone https://github.com/apache/beam.git +<pre><code>export BEAM_SDK_VERSION=&quot;2.26.0&quot; +git clone https://github.com/apache/beam.git +cd beam +# Save current directory as working directory +export BEAM_WORKDIR=$PWD +git checkout origin/release-$BEAM_SDK_VERSION </code></pre><ol start="2"> -<li>Customize the <a href="https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile">Dockerfile</a>. If you&rsquo;re adding dependencies from <a href="https://pypi.org/">PyPI</a>, use <a href="https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt"><code>base_image_requirements.txt</code></a> instead.</li> -<li><a href="#building-container-images">Reimage</a> the container.</li> +<li> +<p>Customize the <code>Dockerfile</code> for a given language, typically <code>sdks/&lt;language&gt;/container/Dockerfile</code> directory (e.g. the <a href="https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile">Dockerfile for Python</a>. If you&rsquo;re adding dependencies from <a href="https://pypi.org/">PyPI</a>, use <a href="https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt">&l [...] +</li> +<li> +<p>Return to the root Beam directory and run the Gradle <code>docker</code> target for your image.</p> +</li> +</ol> +<pre><code>cd $BEAM_WORKDIR +# The default repository of each SDK +./gradlew :sdks:java:container:java8:docker +./gradlew :sdks:java:container:java11:docker +./gradlew :sdks:go:container:docker +./gradlew :sdks:python:container:py36:docker +./gradlew :sdks:python:container:py37:docker +./gradlew :sdks:python:container:py38:docker +# Shortcut for building all Python SDKs +./gradlew :sdks:python:container buildAll +</code></pre><ol start="4"> +<li>Verify the images you built were created by running <code>docker images</code>.</li> +</ol> +<pre><code>$&gt; docker images --digests +REPOSITORY TAG DIGEST IMAGE ID CREATED SIZE +apache/beam_java8_sdk latest sha256:... ... 1 min ago ... +apache/beam_java11_sdk latest sha256:... ... 1 min ago ... +apache/beam_python3.6_sdk latest sha256:... ... 1 min ago ... +apache/beam_python3.7_sdk latest sha256:... ... 1 min ago ... +apache/beam_python3.8_sdk latest sha256:... ... 1 min ago ... +apache/beam_go_sdk latest sha256:... ... 1 min ago ... +</code></pre><ol start="5"> +<li>If your runner is running remotely, retag the image and <a href="https://docs.docker.com/engine/reference/commandline/push/">push</a> the image to your repository. You can skip this step if you provide a custom repo/tag as <a href="#additional-build-parameters">additional parameters</a>.</li> </ol> -<h3 id="testing-customized-images">Testing customized images</h3> -<p>To test a customized image locally, run a pipeline with PortableRunner and set the <code>--environment_config</code> flag to the image path:</p> +<pre><code>export BEAM_SDK_VERSION=&quot;2.26.0&quot; +export IMAGE_NAME=&quot;gcr.io/my-gcp-project/beam_python3.7_sdk&quot; +export TAG=&quot;${BEAM_SDK_VERSION}-custom&quot; +docker tag apache/beam_python3.7_sdk &quot;${IMAGE_NAME}:${TAG}&quot; +docker push &quot;${IMAGE_NAME}:${TAG}&quot; +</code></pre><ol start="6"> +<li>After pushing a container image, verify the remote image ID and digest matches the local image ID and digest output from <code>docker_images --digests</code>.</li> +</ol> +<h4 id="additional-build-parameters">Additional build parameters</h4> +<p>The docker Gradle task defines a default image repository and <a href="https://docs.docker.com/engine/reference/commandline/tag/">tag</a> is the SDK version defined at <a href="https://github.com/apache/beam/blob/master/gradle.properties">gradle.properties</a>. The default repository is the Docker Hub <code>apache</code> namespace, and the default tag is the <a href="https://github.com/apache/beam/blob/master/gradle.properties">SDK version</a> defined at gra [...] +<p>You can specify a different repository or tag for built images by providing parameters to the build task. For example:</p> +<pre><code>./gradlew :sdks:python:container:py36:docker -Pdocker-repository-root=&quot;example-repo&quot; -Pdocker-tag=&quot;2.26.0-custom&quot; +</code></pre><p>builds the Python 3.6 container and tags it as <code>example-repo/beam_python3.6_sdk:2.26.0-custom</code>.</p> +<p>From Beam 2.21.0 and later, a <code>docker-pull-licenses</code> flag was introduced to add licenses/notices for third party dependencies to the docker images. For example:</p> +<pre><code>./gradlew :sdks:java:container:java8:docker -Pdocker-pull-licenses +</code></pre><p>creates a Java 8 SDK image with appropriate licenses in <code>/opt/apache/beam/third_party_licenses/</code>.</p> +<p>By default, no licenses/notices are added to the docker images.</p> +<h2 id="running-pipelines">Running pipelines with custom container images</h2> +<p>The common method for providing a container image requires using the +PortableRunner flag <code>--environment_config</code> as supported by the Portable +Runner or by runners supported PortableRunner flags. +Other runners, such as Dataflow, support specifying containers with different flags.</p> <div class=runner-direct> -<pre><code>python -m apache_beam.examples.wordcount \ +<pre><code>export IMAGE=&#34;my-repo/beam_python_sdk_custom&#34; +export TAG=&#34;X.Y.Z&#34; +export IMAGE_URL = &#34;${IMAGE}:${TAG}&#34; +python -m apache_beam.examples.wordcount \ --input=/path/to/inputfile \ --output /path/to/write/counts \ --runner=PortableRunner \ --job_endpoint=embed \ ---environment_config=path/to/container/image</code></pre> +--environment_type=&#34;DOCKER&#34; \ +--environment_config=&#34;${IMAGE_URL}&#34;</code></pre> </div> <div class=runner-flink-local> -<pre><code># Start a Flink job server on localhost:8099 -./gradlew :runners:flink:1.8:job-server:runShadow -# Run a pipeline on the Flink job server +<pre><code>export IMAGE=&#34;my-repo/beam_python_sdk_custom&#34; +export TAG=&#34;X.Y.Z&#34; +export IMAGE_URL = &#34;${IMAGE}:${TAG}&#34; +# Run a pipeline using the FlinkRunner which starts a Flink job server. python -m apache_beam.examples.wordcount \ --input=/path/to/inputfile \ ---output=/path/to/write/counts \ ---runner=PortableRunner \ ---job_endpoint=localhost:8099 \ ---environment_config=path/to/container/image</code></pre> +--output=path/to/write/counts \ +--runner=FlinkRunner \ +--environment_type=&#34;DOCKER&#34; \ +--environment_config=&#34;${IMAGE_URL}&#34;</code></pre> </div> <div class=runner-spark-local> -<pre><code># Start a Spark job server on localhost:8099 -./gradlew :runners:spark:job-server:runShadow -# Run a pipeline on the Spark job server +<pre><code>export IMAGE=&#34;my-repo/beam_python_sdk_custom&#34; +export TAG=&#34;X.Y.Z&#34; +export IMAGE_URL = &#34;${IMAGE}:${TAG}&#34; +# Run a pipeline using the SparkRunner which starts the Spark job server python -m apache_beam.examples.wordcount \ --input=/path/to/inputfile \ --output=path/to/write/counts \ ---runner=PortableRunner \ ---job_endpoint=localhost:8099 \ ---environment_config=path/to/container/image</code></pre> -</div> -<h2 id="building-container-images">Building container images</h2> -<p>To build Beam SDK container images:</p> -<ol> -<li>Navigate to the root directory of the local copy of your Apache Beam.</li> -<li>Run Gradle with the <code>docker</code> target. If you&rsquo;re <a href="#writing-new-dockerfiles">building a child image</a>, set the optional <code>--file</code> flag to the new Dockerfile. If you&rsquo;re <a href="#modifying-dockerfiles">building an image from an original Dockerfile</a>, ignore the <code>--file</code> flag:</li> -</ol> -<pre><code># The default repository of each SDK -./gradlew [--file=path/to/new/Dockerfile] :sdks:java:container:java8:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:java:container:java11:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:go:container:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py2:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py35:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py36:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py37:docker -# Shortcut for building all four Python SDKs -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container buildAll -</code></pre><p>From 2.21.0, <code>docker-pull-licenses</code> tag was introduced. Licenses/notices of third party dependencies will be added to the docker images when <code>docker-pull-licenses</code> was set. -For example, <code>./gradlew :sdks:java:container:java8:docker -Pdocker-pull-licenses</code>. The files are added to <code>/opt/apache/beam/third_party_licenses/</code>. -By default, no licenses/notices are added to the docker images.</p> -<p>To examine the containers that you built, run <code>docker images</code> from anywhere in the command line. If you successfully built all of the container images, the command prints a table like the following:</p> -<pre><code>REPOSITORY TAG IMAGE ID CREATED SIZE -apache/beam_java8_sdk latest ... 2 weeks ago ... -apache/beam_java11_sdk latest ... 2 weeks ago ... -apache/beam_python2.7_sdk latest ... 2 weeks ago ... -apache/beam_python3.5_sdk latest ... 2 weeks ago ... -apache/beam_python3.6_sdk latest ... 2 weeks ago ... -apache/beam_python3.7_sdk latest ... 2 weeks ago ... -apache/beam_go_sdk latest ... 2 weeks ago ... -</code></pre><h3 id="overriding-default-docker-targets">Overriding default Docker targets</h3> -<p>The default <a href="https://docs.docker.com/engine/reference/commandline/tag/">tag</a> is sdk_version defined at <a href="https://github.com/apache/beam/blob/master/gradle.properties">gradle.properties</a> and the default repositories are in the Docker Hub <code>apache</code> namespace. -The <code>docker</code> command-line tool implicitly <a href="#pushing-container-images">pushes container images</a> to this location.</p> -<p>To tag a local image, set the <code>docker-tag</code> option when building the container. The following command tags a Python SDK image with a date.</p> -<pre><code>./gradlew :sdks:python:container:py36:docker -Pdocker-tag=2019-10-04 -</code></pre><p>To change the repository, set the <code>docker-repository-root</code> option to a new location. The following command sets the <code>docker-repository-root</code> -to a repository named <code>example-repo</code> on Docker Hub.</p> -<pre><code>./gradlew :sdks:python:container:py36:docker -Pdocker-repository-root=example-repo -</code></pre><h2 id="pushing-container-images">Pushing container images</h2> -<p>After <a href="#building-container-images">building a container image</a>, you can store it in a remote Docker repository.</p> -<p>The following steps push a Python3.6 SDK image to the <a href="#overriding-default-docker-targets"><code>docker-root-repository</code> value</a>. -Please log in to the destination repository as needed.</p> -<p>Upload it to the remote repository:</p> -<pre><code>docker push example-repo/beam_python3.6_sdk -</code></pre><p>To download the image again, run <code>docker pull</code>:</p> -<pre><code>docker pull example-repo/beam_python3.6_sdk -</code></pre><blockquote> -<p><strong>Note</strong>: After pushing a container image, the remote image ID and digest match the local image ID and digest.</p> -</blockquote></description></item><item><title>Documentation: Count</title><link>/documentation/transforms/java/aggregation/count/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/documentation/transforms/java/aggregation/count/</guid><description> +--runner=SparkRunner \ +--environment_type=&#34;DOCKER&#34; \ +--environment_config=&#34;${IMAGE_URL}&#34;</code></pre> +</div> +<div class=runner-dataflow> +<pre><code>export GCS_PATH=&#34;gs://my-gcs-bucket&#34; +export GCP_PROJECT=&#34;my-gcp-project&#34; +export REGION=&#34;us-central1&#34; +# By default, the Dataflow runner has access to the GCR images +# under the same project. +export IMAGE=&#34;my-repo/beam_python_sdk_custom&#34; +export TAG=&#34;X.Y.Z&#34; +export IMAGE_URL = &#34;${IMAGE}:${TAG}&#34; +# Run a pipeline on Dataflow. +# This is a Python batch pipeline, so to run on Dataflow Runner V2 +# you must specify the experiment &#34;use_runner_v2&#34; +python -m apache_beam.examples.wordcount \ +--input gs://dataflow-samples/shakespeare/kinglear.txt \ +--output &#34;${GCS_PATH}/counts&#34; \ +--runner DataflowRunner \ +--project $GCP_PROJECT \ +--region $REGION \ +--temp_location &#34;${GCS_PATH}/tmp/&#34; \ +--experiment=use_runner_v2 \ +--worker_harness_container_image=$IMAGE_URL</code></pre> +</div> +<h3 id="troubleshooting">Troubleshooting</h3> +<p>The following section describes some common issues to consider +when you encounter unexpected errors running Beam pipelines with +custom containers.</p> +<ul> +<li>Differences in language and SDK version between the container SDK and +pipeline SDK may result in unexpected errors due to incompatibility. For best +results, make sure to use the same stable SDK version for your base container +and when running your pipeline.</li> +<li>If you are running into unexpected errors when using remote containers, +make sure that your container exists in the remote repository and can be +accessed by any third-party service, if needed.</li> +<li>Local runners attempt to pull remote images and default to local +images. If an image cannot be pulled locally (by the docker daemon), +you may see an log message like: +<pre><code>Error response from daemon: manifest for remote.repo/beam_python3.7_sdk:2.25.0-custom not found: manifest unknown: ... +INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:Unable to pull image... +</code></pre></li> +</ul></description></item><item><title>Documentation: Count</title><link>/documentation/transforms/java/aggregation/count/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>/documentation/transforms/java/aggregation/count/</guid><description> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -12470,10 +12557,10 @@ allows you to directly access tables in BigQuery storage, and supports features such as column selection and predicate filter push-down which can allow more efficient pipeline execution.</p> <p>The Beam SDK for Java supports using the BigQuery Storage API when reading from -BigQuery. SDK versions before 2.24.0 support the BigQuery Storage API as an +BigQuery. SDK versions before 2.25.0 support the BigQuery Storage API as an <a href="https://beam.apache.org/releases/javadoc/current/index.html?org/apache/beam/sdk/annotations/Experimental.html">experimental feature</a> and use the pre-GA BigQuery Storage API surface. Callers should migrate -pipelines which use the BigQuery Storage API to use SDK version 2.24.0 or later.</p> +pipelines which use the BigQuery Storage API to use SDK version 2.25.0 or later.</p> <p>The Beam SDK for Python does not support the BigQuery Storage API. See <a href="https://issues.apache.org/jira/browse/BEAM-10917">BEAM-10917</a>).</p> <h4 id="updating-your-code">Updating your code</h4> diff --git a/website/generated-content/documentation/io/built-in/google-bigquery/index.html b/website/generated-content/documentation/io/built-in/google-bigquery/index.html index 7a15a54..6b6b767 100644 --- a/website/generated-content/documentation/io/built-in/google-bigquery/index.html +++ b/website/generated-content/documentation/io/built-in/google-bigquery/index.html @@ -247,10 +247,10 @@ in the following example:</p><div class=language-java><div class=highlight><pre allows you to directly access tables in BigQuery storage, and supports features such as column selection and predicate filter push-down which can allow more efficient pipeline execution.</p><p>The Beam SDK for Java supports using the BigQuery Storage API when reading from -BigQuery. SDK versions before 2.24.0 support the BigQuery Storage API as an +BigQuery. SDK versions before 2.25.0 support the BigQuery Storage API as an <a href=https://beam.apache.org/releases/javadoc/current/index.html?org/apache/beam/sdk/annotations/Experimental.html>experimental feature</a> and use the pre-GA BigQuery Storage API surface. Callers should migrate -pipelines which use the BigQuery Storage API to use SDK version 2.24.0 or later.</p><p>The Beam SDK for Python does not support the BigQuery Storage API. See +pipelines which use the BigQuery Storage API to use SDK version 2.25.0 or later.</p><p>The Beam SDK for Python does not support the BigQuery Storage API. See <a href=https://issues.apache.org/jira/browse/BEAM-10917>BEAM-10917</a>).</p><h4 id=updating-your-code>Updating your code</h4><p>Use the following methods when you read from a table:</p><ul><li>Required: Specify <a href=https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.TypedRead.html#withMethod-org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method->withMethod(Method.DIRECT_READ)</a> to use the BigQuery Storage API for the read opera [...] example</a>. When the example’s read method option is set to <code>DIRECT_READ</code>, the pipeline uses diff --git a/website/generated-content/documentation/runtime/environments/index.html b/website/generated-content/documentation/runtime/environments/index.html index 7dc909d..3f35f3f 100644 --- a/website/generated-content/documentation/runtime/environments/index.html +++ b/website/generated-content/documentation/runtime/environments/index.html @@ -1,60 +1,123 @@ <!doctype html><html lang=en class=no-js><head><meta charset=utf-8><meta http-equiv=x-ua-compatible content="IE=edge"><meta name=viewport content="width=device-width,initial-scale=1"><title>Container environments</title><meta name=description content="Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain [...] <span class=sr-only>Toggle navigation</span> <span class=icon-bar></span><span class=icon-bar></span><span class=icon-bar></span></button> -<a href=/ class=navbar-brand><img alt=Brand style=height:25px src=/images/beam_logo_navbar.png></a></div><div class="navbar-mask closed"></div><div id=navbar class="navbar-container closed"><ul class="nav navbar-nav"><li><a href=/get-started/beam-overview/>Get Started</a></li><li><a href=/documentation/>Documentation</a></li><li><a href=/documentation/sdks/java/>Languages</a></li><li><a href=/documentation/runners/capability-matrix/>RUNNERS</a></li><li><a href=/roadmap/>Roadmap</a></li>< [...] -</code></pre><ol start=2><li><a href=https://docs.docker.com/develop/develop-images/dockerfile_best-practices/>Write a new Dockerfile</a> that <a href=https://docs.docker.com/engine/reference/builder/#from>designates</a> the original as its <a href="https://docs.docker.com/glossary/?term=parent%20image">parent</a>.</li><li><a href=#building-container-images>Build</a> a child image.</li></ol><h3 id=modifying-dockerfiles>Modifying the original Dockerfile</h3><ol><li>Clone the <code>beam</c [...] -</code></pre><ol start=2><li>Customize the <a href=https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile>Dockerfile</a>. If you’re adding dependencies from <a href=https://pypi.org/>PyPI</a>, use <a href=https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt><code>base_image_requirements.txt</code></a> instead.</li><li><a href=#building-container-images>Reimage</a> the container.</li></ol><h3 id=testing-customized-images>T [...] +<a href=/ class=navbar-brand><img alt=Brand style=height:25px src=/images/beam_logo_navbar.png></a></div><div class="navbar-mask closed"></div><div id=navbar class="navbar-container closed"><ul class="nav navbar-nav"><li><a href=/get-started/beam-overview/>Get Started</a></li><li><a href=/documentation/>Documentation</a></li><li><a href=/documentation/sdks/java/>Languages</a></li><li><a href=/documentation/runners/capability-matrix/>RUNNERS</a></li><li><a href=/roadmap/>Roadmap</a></li>< [...] + +ENV FOO=bar +COPY /src/path/to/file /dest/path/to/file/ +</code></pre><p>This <code>Dockerfile</code> uses the prebuilt Python 3.7 SDK container image <a href=https://hub.docker.com/r/apache/beam_python3.7_sdk><code>beam_python3.7_sdk</code></a> tagged at (SDK version) <code>2.25.0</code>, and adds an additional environment variable and file to the image.</p><ol start=2><li><a href=https://docs.docker.com/engine/reference/commandline/build/>Build</a> and <a href=https://docs.docker.com/engine/reference/commandline/push/>push</a> the image usin [...] +export IMAGE_NAME="myremoterepo/mybeamsdk" +export TAG="latest" + +# Optional - pull the base image into your local Docker daemon to ensure +# you have the most up-to-date version of the base image locally. +docker pull "${BASE_IMAGE}" + +docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" . +</code></pre><ol start=3><li>If your runner is running remotely, retag and <a href=https://docs.docker.com/engine/reference/commandline/push/>push</a> the image to the appropriate repository.</li></ol><pre><code>docker push "${IMAGE_NAME}:${TAG}" +</code></pre><ol start=4><li>After pushing a container image, verify the remote image ID and digest matches the local image ID and digest, output from <code>docker build</code> or <code>docker images</code>.</li></ol><h4 id=modifying-dockerfiles>Modifying a source Dockerfile in Beam</h4><p>This method requires building image artifacts from Beam source. For additional instructions on setting up your development environment, see the <a href=/contribute/#development-setup>Contribution guide [...] +git clone https://github.com/apache/beam.git +cd beam + +# Save current directory as working directory +export BEAM_WORKDIR=$PWD + +git checkout origin/release-$BEAM_SDK_VERSION +</code></pre><ol start=2><li><p>Customize the <code>Dockerfile</code> for a given language, typically <code>sdks/<language>/container/Dockerfile</code> directory (e.g. the <a href=https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile>Dockerfile for Python</a>. If you’re adding dependencies from <a href=https://pypi.org/>PyPI</a>, use <a href=https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt><code>base_image_require [...] + +# The default repository of each SDK +./gradlew :sdks:java:container:java8:docker +./gradlew :sdks:java:container:java11:docker +./gradlew :sdks:go:container:docker +./gradlew :sdks:python:container:py36:docker +./gradlew :sdks:python:container:py37:docker +./gradlew :sdks:python:container:py38:docker + +# Shortcut for building all Python SDKs +./gradlew :sdks:python:container buildAll +</code></pre><ol start=4><li>Verify the images you built were created by running <code>docker images</code>.</li></ol><pre><code>$> docker images --digests +REPOSITORY TAG DIGEST IMAGE ID CREATED SIZE +apache/beam_java8_sdk latest sha256:... ... 1 min ago ... +apache/beam_java11_sdk latest sha256:... ... 1 min ago ... +apache/beam_python3.6_sdk latest sha256:... ... 1 min ago ... +apache/beam_python3.7_sdk latest sha256:... ... 1 min ago ... +apache/beam_python3.8_sdk latest sha256:... ... 1 min ago ... +apache/beam_go_sdk latest sha256:... ... 1 min ago ... +</code></pre><ol start=5><li>If your runner is running remotely, retag the image and <a href=https://docs.docker.com/engine/reference/commandline/push/>push</a> the image to your repository. You can skip this step if you provide a custom repo/tag as <a href=#additional-build-parameters>additional parameters</a>.</li></ol><pre><code>export BEAM_SDK_VERSION="2.26.0" +export IMAGE_NAME="gcr.io/my-gcp-project/beam_python3.7_sdk" +export TAG="${BEAM_SDK_VERSION}-custom" + +docker tag apache/beam_python3.7_sdk "${IMAGE_NAME}:${TAG}" +docker push "${IMAGE_NAME}:${TAG}" +</code></pre><ol start=6><li>After pushing a container image, verify the remote image ID and digest matches the local image ID and digest output from <code>docker_images --digests</code>.</li></ol><h4 id=additional-build-parameters>Additional build parameters</h4><p>The docker Gradle task defines a default image repository and <a href=https://docs.docker.com/engine/reference/commandline/tag/>tag</a> is the SDK version defined at <a href=https://github.com/apache/beam/blob/master/gradle.p [...] +</code></pre><p>builds the Python 3.6 container and tags it as <code>example-repo/beam_python3.6_sdk:2.26.0-custom</code>.</p><p>From Beam 2.21.0 and later, a <code>docker-pull-licenses</code> flag was introduced to add licenses/notices for third party dependencies to the docker images. For example:</p><pre><code>./gradlew :sdks:java:container:java8:docker -Pdocker-pull-licenses +</code></pre><p>creates a Java 8 SDK image with appropriate licenses in <code>/opt/apache/beam/third_party_licenses/</code>.</p><p>By default, no licenses/notices are added to the docker images.</p><h2 id=running-pipelines>Running pipelines with custom container images</h2><p>The common method for providing a container image requires using the +PortableRunner flag <code>--environment_config</code> as supported by the Portable +Runner or by runners supported PortableRunner flags. +Other runners, such as Dataflow, support specifying containers with different flags.</p><div class=runner-direct><pre><code>export IMAGE="my-repo/beam_python_sdk_custom" +export TAG="X.Y.Z" +export IMAGE_URL = "${IMAGE}:${TAG}" + +python -m apache_beam.examples.wordcount \ --input=/path/to/inputfile \ --output /path/to/write/counts \ --runner=PortableRunner \ --job_endpoint=embed \ ---environment_config=path/to/container/image</code></pre></div><div class=runner-flink-local><pre><code># Start a Flink job server on localhost:8099 -./gradlew :runners:flink:1.8:job-server:runShadow +--environment_type="DOCKER" \ +--environment_config="${IMAGE_URL}"</code></pre></div><div class=runner-flink-local><pre><code>export IMAGE="my-repo/beam_python_sdk_custom" +export TAG="X.Y.Z" +export IMAGE_URL = "${IMAGE}:${TAG}" -# Run a pipeline on the Flink job server +# Run a pipeline using the FlinkRunner which starts a Flink job server. python -m apache_beam.examples.wordcount \ --input=/path/to/inputfile \ ---output=/path/to/write/counts \ ---runner=PortableRunner \ ---job_endpoint=localhost:8099 \ ---environment_config=path/to/container/image</code></pre></div><div class=runner-spark-local><pre><code># Start a Spark job server on localhost:8099 -./gradlew :runners:spark:job-server:runShadow +--output=path/to/write/counts \ +--runner=FlinkRunner \ +--environment_type="DOCKER" \ +--environment_config="${IMAGE_URL}"</code></pre></div><div class=runner-spark-local><pre><code>export IMAGE="my-repo/beam_python_sdk_custom" +export TAG="X.Y.Z" +export IMAGE_URL = "${IMAGE}:${TAG}" -# Run a pipeline on the Spark job server +# Run a pipeline using the SparkRunner which starts the Spark job server python -m apache_beam.examples.wordcount \ --input=/path/to/inputfile \ --output=path/to/write/counts \ ---runner=PortableRunner \ ---job_endpoint=localhost:8099 \ ---environment_config=path/to/container/image</code></pre></div><h2 id=building-container-images>Building container images</h2><p>To build Beam SDK container images:</p><ol><li>Navigate to the root directory of the local copy of your Apache Beam.</li><li>Run Gradle with the <code>docker</code> target. If you’re <a href=#writing-new-dockerfiles>building a child image</a>, set the optional <code>--file</code> flag to the new Dockerfile. If you’re <a href=#modifying-dockerfiles>b [...] -./gradlew [--file=path/to/new/Dockerfile] :sdks:java:container:java8:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:java:container:java11:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:go:container:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py2:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py35:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py36:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py37:docker - -# Shortcut for building all four Python SDKs -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container buildAll -</code></pre><p>From 2.21.0, <code>docker-pull-licenses</code> tag was introduced. Licenses/notices of third party dependencies will be added to the docker images when <code>docker-pull-licenses</code> was set. -For example, <code>./gradlew :sdks:java:container:java8:docker -Pdocker-pull-licenses</code>. The files are added to <code>/opt/apache/beam/third_party_licenses/</code>. -By default, no licenses/notices are added to the docker images.</p><p>To examine the containers that you built, run <code>docker images</code> from anywhere in the command line. If you successfully built all of the container images, the command prints a table like the following:</p><pre><code>REPOSITORY TAG IMAGE ID CREATED SIZE -apache/beam_java8_sdk latest ... 2 weeks ago ... -apache/beam_java11_sdk latest ... 2 weeks ago ... -apache/beam_python2.7_sdk latest ... 2 weeks ago ... -apache/beam_python3.5_sdk latest ... 2 weeks ago ... -apache/beam_python3.6_sdk latest ... 2 weeks ago ... -apache/beam_python3.7_sdk latest ... 2 weeks ago ... -apache/beam_go_sdk latest ... 2 weeks ago ... -</code></pre><h3 id=overriding-default-docker-targets>Overriding default Docker targets</h3><p>The default <a href=https://docs.docker.com/engine/reference/commandline/tag/>tag</a> is sdk_version defined at <a href=https://github.com/apache/beam/blob/master/gradle.properties>gradle.properties</a> and the default repositories are in the Docker Hub <code>apache</code> namespace. -The <code>docker</code> command-line tool implicitly <a href=#pushing-container-images>pushes container images</a> to this location.</p><p>To tag a local image, set the <code>docker-tag</code> option when building the container. The following command tags a Python SDK image with a date.</p><pre><code>./gradlew :sdks:python:container:py36:docker -Pdocker-tag=2019-10-04 -</code></pre><p>To change the repository, set the <code>docker-repository-root</code> option to a new location. The following command sets the <code>docker-repository-root</code> -to a repository named <code>example-repo</code> on Docker Hub.</p><pre><code>./gradlew :sdks:python:container:py36:docker -Pdocker-repository-root=example-repo -</code></pre><h2 id=pushing-container-images>Pushing container images</h2><p>After <a href=#building-container-images>building a container image</a>, you can store it in a remote Docker repository.</p><p>The following steps push a Python3.6 SDK image to the <a href=#overriding-default-docker-targets><code>docker-root-repository</code> value</a>. -Please log in to the destination repository as needed.</p><p>Upload it to the remote repository:</p><pre><code>docker push example-repo/beam_python3.6_sdk -</code></pre><p>To download the image again, run <code>docker pull</code>:</p><pre><code>docker pull example-repo/beam_python3.6_sdk -</code></pre><blockquote><p><strong>Note</strong>: After pushing a container image, the remote image ID and digest match the local image ID and digest.</p></blockquote></div></div><footer class=footer><div class=footer__contained><div class=footer__cols><div class=footer__cols__col><div class=footer__cols__col__logo><img src=/images/beam_logo_circle.svg class=footer__logo alt="Beam logo"></div><div class=footer__cols__col__logo><img src=/images/apache_logo_circle.svg class=footer__logo a [...] +--runner=SparkRunner \ +--environment_type="DOCKER" \ +--environment_config="${IMAGE_URL}"</code></pre></div><div class=runner-dataflow><pre><code>export GCS_PATH="gs://my-gcs-bucket" +export GCP_PROJECT="my-gcp-project" +export REGION="us-central1" + +# By default, the Dataflow runner has access to the GCR images +# under the same project. +export IMAGE="my-repo/beam_python_sdk_custom" +export TAG="X.Y.Z" +export IMAGE_URL = "${IMAGE}:${TAG}" + +# Run a pipeline on Dataflow. +# This is a Python batch pipeline, so to run on Dataflow Runner V2 +# you must specify the experiment "use_runner_v2" + +python -m apache_beam.examples.wordcount \ + --input gs://dataflow-samples/shakespeare/kinglear.txt \ + --output "${GCS_PATH}/counts" \ + --runner DataflowRunner \ + --project $GCP_PROJECT \ + --region $REGION \ + --temp_location "${GCS_PATH}/tmp/" \ + --experiment=use_runner_v2 \ + --worker_harness_container_image=$IMAGE_URL</code></pre></div><h3 id=troubleshooting>Troubleshooting</h3><p>The following section describes some common issues to consider +when you encounter unexpected errors running Beam pipelines with +custom containers.</p><ul><li>Differences in language and SDK version between the container SDK and +pipeline SDK may result in unexpected errors due to incompatibility. For best +results, make sure to use the same stable SDK version for your base container +and when running your pipeline.</li><li>If you are running into unexpected errors when using remote containers, +make sure that your container exists in the remote repository and can be +accessed by any third-party service, if needed.</li><li>Local runners attempt to pull remote images and default to local +images. If an image cannot be pulled locally (by the docker daemon), +you may see an log message like:<pre><code>Error response from daemon: manifest for remote.repo/beam_python3.7_sdk:2.25.0-custom not found: manifest unknown: ... +INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:Unable to pull image... +</code></pre></li></ul></div></div><footer class=footer><div class=footer__contained><div class=footer__cols><div class=footer__cols__col><div class=footer__cols__col__logo><img src=/images/beam_logo_circle.svg class=footer__logo alt="Beam logo"></div><div class=footer__cols__col__logo><img src=/images/apache_logo_circle.svg class=footer__logo alt="Apache logo"></div></div><div class="footer__cols__col footer__cols__col--md"><div class=footer__cols__col__title>Start</div><div class=foote [...] <a href=http://www.apache.org>The Apache Software Foundation</a> | <a href=/privacy_policy>Privacy Policy</a> | <a href=/feed.xml>RSS Feed</a><br><br>Apache Beam, Apache, Beam, the Beam logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation. All other products or name brands are trademarks of their respective holders, including The Apache Software Foundation.</div></footer></body></html> \ No newline at end of file diff --git a/website/generated-content/sitemap.xml b/website/generated-content/sitemap.xml index 2d71281..a6a4dab 100644 --- a/website/generated-content/sitemap.xml +++ b/website/generated-content/sitemap.xml @@ -1 +1 @@ -<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/categories/blog/</loc><lastmod>2020-12-14T07:36:58-08:00</lastmod></url><url><loc>/blog/</loc><lastmod>2020-12-14T07:36:58-08:00</lastmod></url><url><loc>/categories/</loc><lastmod>2020-12-14T07:36:58-08:00</lastmod></url><url><loc>/blog/splittable-do-fn-is-available/</loc><lastmod>2020-12-01T17:42:26-08:00</lastmod></url [...] \ No newline at end of file +<?xml version="1.0" encoding="utf-8" standalone="yes"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>/categories/blog/</loc><lastmod>2020-12-14T07:36:58-08:00</lastmod></url><url><loc>/blog/</loc><lastmod>2020-12-14T07:36:58-08:00</lastmod></url><url><loc>/categories/</loc><lastmod>2020-12-14T07:36:58-08:00</lastmod></url><url><loc>/blog/splittable-do-fn-is-available/</loc><lastmod>2020-12-01T17:42:26-08:00</lastmod></url [...] \ No newline at end of file