While Beam provides powerful APIs for authoring sophisticated data
processing pipelines, it often still has too high a barrier for
getting started and authoring simple pipelines. Even setting up the
environment, installing the dependencies, and setting up the project
can be an overwhelming amount of boilerplate for some (though
https://beam.apache.org/blog/beam-starter-projects/ has gone a long
way in making this easier). At the other extreme, the Dataflow project
has the notion of templates which are pre-built Beam pipelines that
can be easily launched from the command line, or even from your
browser, but they are fairly restrictive, limited to pre-assembled
pipelines taking a small number of parameters.

The idea of creating a yaml-based description of pipelines has come up
several times in several contexts and this last week I decided to code
up what it could look like. Here's a proposal.

pipeline:
  - type: chain
    transforms:
      - type: ReadFromText
        args:
         file_pattern: "wordcount.yaml"
      - type: PyMap
        fn: "str.lower"
      - type: PyFlatMap
        fn: "import re\nlambda line: re.findall('[a-z]+', line)"
      - type: PyTransform
        name: Count
        constructor: "apache_beam.transforms.combiners.Count.PerElement"
      - type: PyMap
        fn: str
      - type: WriteToText
        file_path_prefix: "counts.txt"

Some more examples at
https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a

A prototype (feedback welcome) can be found at
https://github.com/apache/beam/pull/24667. It can be invoked as

    python -m apache_beam.yaml.main --pipeline_spec_file
[path/to/file.yaml] [other_pipene_args]

or

    python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
[other_pipene_args]

For example, to play around with this one could do

    python -m apache_beam.yaml.main  \
        --pipeline_spec "$(curl
https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml)"
\
        --runner=apache_beam.runners.render.RenderRunner \
        --render_out=out.png

Alternatively one can run it as a docker container with no need to
install any SDK

    docker run --rm \
        --entrypoint /usr/local/bin/python \
        gcr.io/apache-beam-testing/yaml_template:dev
/dataflow/template/main.py \
        --pipeline_spec="$(curl
https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml)"

Though of course one would have to set up the appropriate mount points
to do any local filesystem io and/or credentials.

This is also available as a Dataflow template and can be invoked as

    gcloud dataflow flex-template run \
        "yaml-template-job" \
         --template-file-gcs-location
gs://apache-beam-testing-robertwb/yaml_template.json \
        --parameters ^~^pipeline_spec="$(curl
https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml)"
\
        --parameters pickle_library=cloudpickle \
        --project=apache-beam-testing \
        --region us-central1

(Note the escaping required for the parameter (use cat for a local
file), and the debug cycle here could be greatly improved, so I'd
recommend trying things locally first.)

A key point of this implementation is that it heavily uses the
expansion service and cross language transforms, tying into the
proposal at  https://s.apache.org/easy-multi-language . Though all the
examples use transforms defined in the Beam SDK, any appropriately
packaged libraries may be used.

There are many ways this could be extended. For example

* It would be useful to be able to templatize yaml descriptions. This
could be done with $SIGIL type notation or some other way. This would
even allow one to define reusable, parameterized composite PTransform
types in yaml itself.

* It would be good to have a more principled way of merging
environments. Currently each set of dependencies is a unique Beam
environment, and while Beam has sophisticated cross-language
capabilities, it would be nice if environments sharing the same
language (and likely also the same Beam version) could be fused
in-process (e.g. with separate class loaders or compatibility checks
for packages).

* Publishing and discovery of transformations could be improved,
possibly via shared standards and some kind of a transform catalog. An
ecosystem of easily sharable transforms (similar to what huggingface
provides for ML models) could provide a useful platform for making it
easy to build pipelines and open up Beam to a whole new set of users.

Let me know what you think.

- Robert

Reply via email to