While Beam provides powerful APIs for authoring sophisticated data processing pipelines, it often still has too high a barrier for getting started and authoring simple pipelines. Even setting up the environment, installing the dependencies, and setting up the project can be an overwhelming amount of boilerplate for some (though https://beam.apache.org/blog/beam-starter-projects/ has gone a long way in making this easier). At the other extreme, the Dataflow project has the notion of templates which are pre-built Beam pipelines that can be easily launched from the command line, or even from your browser, but they are fairly restrictive, limited to pre-assembled pipelines taking a small number of parameters.
The idea of creating a yaml-based description of pipelines has come up several times in several contexts and this last week I decided to code up what it could look like. Here's a proposal. pipeline: - type: chain transforms: - type: ReadFromText args: file_pattern: "wordcount.yaml" - type: PyMap fn: "str.lower" - type: PyFlatMap fn: "import re\nlambda line: re.findall('[a-z]+', line)" - type: PyTransform name: Count constructor: "apache_beam.transforms.combiners.Count.PerElement" - type: PyMap fn: str - type: WriteToText file_path_prefix: "counts.txt" Some more examples at https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a A prototype (feedback welcome) can be found at https://github.com/apache/beam/pull/24667. It can be invoked as python -m apache_beam.yaml.main --pipeline_spec_file [path/to/file.yaml] [other_pipene_args] or python -m apache_beam.yaml.main --pipeline_spec [yaml_contents] [other_pipene_args] For example, to play around with this one could do python -m apache_beam.yaml.main \ --pipeline_spec "$(curl https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml)" \ --runner=apache_beam.runners.render.RenderRunner \ --render_out=out.png Alternatively one can run it as a docker container with no need to install any SDK docker run --rm \ --entrypoint /usr/local/bin/python \ gcr.io/apache-beam-testing/yaml_template:dev /dataflow/template/main.py \ --pipeline_spec="$(curl https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml)" Though of course one would have to set up the appropriate mount points to do any local filesystem io and/or credentials. This is also available as a Dataflow template and can be invoked as gcloud dataflow flex-template run \ "yaml-template-job" \ --template-file-gcs-location gs://apache-beam-testing-robertwb/yaml_template.json \ --parameters ^~^pipeline_spec="$(curl https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml)" \ --parameters pickle_library=cloudpickle \ --project=apache-beam-testing \ --region us-central1 (Note the escaping required for the parameter (use cat for a local file), and the debug cycle here could be greatly improved, so I'd recommend trying things locally first.) A key point of this implementation is that it heavily uses the expansion service and cross language transforms, tying into the proposal at https://s.apache.org/easy-multi-language . Though all the examples use transforms defined in the Beam SDK, any appropriately packaged libraries may be used. There are many ways this could be extended. For example * It would be useful to be able to templatize yaml descriptions. This could be done with $SIGIL type notation or some other way. This would even allow one to define reusable, parameterized composite PTransform types in yaml itself. * It would be good to have a more principled way of merging environments. Currently each set of dependencies is a unique Beam environment, and while Beam has sophisticated cross-language capabilities, it would be nice if environments sharing the same language (and likely also the same Beam version) could be fused in-process (e.g. with separate class loaders or compatibility checks for packages). * Publishing and discovery of transformations could be improved, possibly via shared standards and some kind of a transform catalog. An ecosystem of easily sharable transforms (similar to what huggingface provides for ML models) could provide a useful platform for making it easy to build pipelines and open up Beam to a whole new set of users. Let me know what you think. - Robert