Hello Robert, I'm replying to say that I've been waiting for something like this ever since I started learning Beam and I'm grateful you are pushing this forward.
Best, Damon On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <rober...@google.com> wrote: > While Beam provides powerful APIs for authoring sophisticated data > processing pipelines, it often still has too high a barrier for > getting started and authoring simple pipelines. Even setting up the > environment, installing the dependencies, and setting up the project > can be an overwhelming amount of boilerplate for some (though > https://beam.apache.org/blog/beam-starter-projects/ has gone a long > way in making this easier). At the other extreme, the Dataflow project > has the notion of templates which are pre-built Beam pipelines that > can be easily launched from the command line, or even from your > browser, but they are fairly restrictive, limited to pre-assembled > pipelines taking a small number of parameters. > > The idea of creating a yaml-based description of pipelines has come up > several times in several contexts and this last week I decided to code > up what it could look like. Here's a proposal. > > pipeline: > - type: chain > transforms: > - type: ReadFromText > args: > file_pattern: "wordcount.yaml" > - type: PyMap > fn: "str.lower" > - type: PyFlatMap > fn: "import re\nlambda line: re.findall('[a-z]+', line)" > - type: PyTransform > name: Count > constructor: "apache_beam.transforms.combiners.Count.PerElement" > - type: PyMap > fn: str > - type: WriteToText > file_path_prefix: "counts.txt" > > Some more examples at > https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a > > A prototype (feedback welcome) can be found at > https://github.com/apache/beam/pull/24667. It can be invoked as > > python -m apache_beam.yaml.main --pipeline_spec_file > [path/to/file.yaml] [other_pipene_args] > > or > > python -m apache_beam.yaml.main --pipeline_spec [yaml_contents] > [other_pipene_args] > > For example, to play around with this one could do > > python -m apache_beam.yaml.main \ > --pipeline_spec "$(curl > > https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml > )" > \ > --runner=apache_beam.runners.render.RenderRunner \ > --render_out=out.png > > Alternatively one can run it as a docker container with no need to > install any SDK > > docker run --rm \ > --entrypoint /usr/local/bin/python \ > gcr.io/apache-beam-testing/yaml_template:dev > /dataflow/template/main.py \ > --pipeline_spec="$(curl > > https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml > )" > > Though of course one would have to set up the appropriate mount points > to do any local filesystem io and/or credentials. > > This is also available as a Dataflow template and can be invoked as > > gcloud dataflow flex-template run \ > "yaml-template-job" \ > --template-file-gcs-location > gs://apache-beam-testing-robertwb/yaml_template.json \ > --parameters ^~^pipeline_spec="$(curl > > https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml > )" > \ > --parameters pickle_library=cloudpickle \ > --project=apache-beam-testing \ > --region us-central1 > > (Note the escaping required for the parameter (use cat for a local > file), and the debug cycle here could be greatly improved, so I'd > recommend trying things locally first.) > > A key point of this implementation is that it heavily uses the > expansion service and cross language transforms, tying into the > proposal at https://s.apache.org/easy-multi-language . Though all the > examples use transforms defined in the Beam SDK, any appropriately > packaged libraries may be used. > > There are many ways this could be extended. For example > > * It would be useful to be able to templatize yaml descriptions. This > could be done with $SIGIL type notation or some other way. This would > even allow one to define reusable, parameterized composite PTransform > types in yaml itself. > > * It would be good to have a more principled way of merging > environments. Currently each set of dependencies is a unique Beam > environment, and while Beam has sophisticated cross-language > capabilities, it would be nice if environments sharing the same > language (and likely also the same Beam version) could be fused > in-process (e.g. with separate class loaders or compatibility checks > for packages). > > * Publishing and discovery of transformations could be improved, > possibly via shared standards and some kind of a transform catalog. An > ecosystem of easily sharable transforms (similar to what huggingface > provides for ML models) could provide a useful platform for making it > easy to build pipelines and open up Beam to a whole new set of users. > > Let me know what you think. > > - Robert >