This is great! I developed a similar template a year or two ago as a reference for a customer to speed up their development process and unsurprisingly it did speed up their development. Here's an example of the config layout I came up with at the time:
options: runner: DirectRunner pipeline: # - &messages # label: PubSub XML source # transform: # !PTransform:apache_beam.io.ReadFromPubSub # subscription: projects/PROJECT/subscriptions/SUBSCRIPTION - &message_source_1 label: XML source 1 transform: !PTransform:apache_beam.Create values: - /path/to/file.xml - &message_source_2 label: XML source 2 transform: !PTransform:apache_beam.Create values: - /path/to/another/file.xml - &message_xml label: XMLs inputs: - step: *message_source_1 - step: *message_source_2 transform: !PTransform:utils.transforms.ParseXmlDocument {} - &validated_messages label: Validate XMLs inputs: - step: *message_xml tag: success transform: !PTransform:utils.transforms.ValidateXmlDocumentWithXmlSchema schema: /path/to/file.xsd - &converted_messages label: Convert XMLs inputs: - step: *validated_messages transform: !PTransform:utils.transforms.ConvertXmlDocumentToDictionary schema: /path/to/file.xsd - label: Print XMLs inputs: - step: *converted_messages transform: !PTransform:utils.transforms.Print {} Highlights: Pipeline options are supplied under an options property. A pipeline is a flat set of all transforms in the pipeline. Transforms are defined using a YAML tag and named properties and can be used by constructing a YAML reference. DAG construction is done using a simple topological sort of transforms and their dependencies. Named side outputs can be referenced using a tag field. Multiple inputs are merged with a Flatten transform. Not sure if there's any inspiration left to take from this, but I figured I'd throw it up here to share. Cheers, Steve On Thu, Dec 15, 2022 at 12:48 AM Chamikara Jayalath via dev < dev@beam.apache.org> wrote: > +1 for these proposals and agree that these will simplify and demystify > Beam for many new users. I think when combined with the x-lang/Schema-Aware > transform binding, these might end up being adequate solutions for many > production use-cases as well (unless users need to define custom > composites, I/O connectors, etc.). > > Also, thanks for providing prototype implementations with examples. > > - Cham > > > On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev < > dev@beam.apache.org> wrote: > >> To build on Kenn's point, if we leverage existing stuff like dbt we get >> access to a ready made community which can help drive both adoption and >> incremental innovation by bringing more folks to Beam >> >> On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles <k...@apache.org> wrote: >> >>> 1. I love the idea. Back in the early days people talked about an "XML >>> SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the >>> time. Portability and specifically cross-language schema transforms gives >>> the right infrastructure so this is the perfect time: unique names (URNs) >>> for transforms and explicit lists of parameters they require. >>> >>> 2. I like the idea of re-using some existing thing like dbt if it is >>> pretty much what we were going to do anyhow. I don't think we should hold >>> ourselves back. I also don't think we'll gain anything in terms of >>> implementation. But at least it could fast-forward our design process >>> because we simply don't have to make most of the decisions because they are >>> made for us. >>> >>> >>> >>> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev <dev@beam.apache.org> >>> wrote: >>> >>>> And I guess also a PR for completeness to make it easier to find going >>>> forward instead of my random repo: >>>> https://github.com/apache/beam/pull/24670 >>>> >>>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <byronel...@google.com> >>>> wrote: >>>> >>>>> Since Robert opened that can of worms (and we happened to talk about >>>>> it yesterday)... :-) >>>>> >>>>> I figured I'd also share my start on a "port" of dbt to the Beam SDK. >>>>> This would be complementary as it doesn't really provide a way of >>>>> specifying a pipeline, more orchestrating and packaging a complex >>>>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem >>>>> like reasonable things for Beam and it wouldn't be a stretch to include >>>>> something like the format above. Though in my head I had imagined people >>>>> would tend to write composite transforms in the SDK of their choosing that >>>>> are then exposed at this layer. I decided to go with dbt as it also >>>>> provides a number of nice "quality of life" features for its users like >>>>> documentation, validation, environments and so on, >>>>> >>>>> I did a really quick proof-of-viability implementation here: >>>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions >>>>> >>>>> And you can see a really simple pipeline that reads a seed file >>>>> (TextIO), runs it through a couple of SQLTransforms and then drops it out >>>>> to a logger via a simple DoFn here: >>>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline >>>>> >>>>> I've also heard a rumor there might also be a textproto-based >>>>> representation floating around too :-) >>>>> >>>>> Best, >>>>> B >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev < >>>>> dev@beam.apache.org> wrote: >>>>> >>>>>> Hello Robert, >>>>>> >>>>>> I'm replying to say that I've been waiting for something like this >>>>>> ever since I started learning Beam and I'm grateful you are pushing this >>>>>> forward. >>>>>> >>>>>> Best, >>>>>> >>>>>> Damon >>>>>> >>>>>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <rober...@google.com> >>>>>> wrote: >>>>>> >>>>>>> While Beam provides powerful APIs for authoring sophisticated data >>>>>>> processing pipelines, it often still has too high a barrier for >>>>>>> getting started and authoring simple pipelines. Even setting up the >>>>>>> environment, installing the dependencies, and setting up the project >>>>>>> can be an overwhelming amount of boilerplate for some (though >>>>>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long >>>>>>> way in making this easier). At the other extreme, the Dataflow >>>>>>> project >>>>>>> has the notion of templates which are pre-built Beam pipelines that >>>>>>> can be easily launched from the command line, or even from your >>>>>>> browser, but they are fairly restrictive, limited to pre-assembled >>>>>>> pipelines taking a small number of parameters. >>>>>>> >>>>>>> The idea of creating a yaml-based description of pipelines has come >>>>>>> up >>>>>>> several times in several contexts and this last week I decided to >>>>>>> code >>>>>>> up what it could look like. Here's a proposal. >>>>>>> >>>>>>> pipeline: >>>>>>> - type: chain >>>>>>> transforms: >>>>>>> - type: ReadFromText >>>>>>> args: >>>>>>> file_pattern: "wordcount.yaml" >>>>>>> - type: PyMap >>>>>>> fn: "str.lower" >>>>>>> - type: PyFlatMap >>>>>>> fn: "import re\nlambda line: re.findall('[a-z]+', line)" >>>>>>> - type: PyTransform >>>>>>> name: Count >>>>>>> constructor: >>>>>>> "apache_beam.transforms.combiners.Count.PerElement" >>>>>>> - type: PyMap >>>>>>> fn: str >>>>>>> - type: WriteToText >>>>>>> file_path_prefix: "counts.txt" >>>>>>> >>>>>>> Some more examples at >>>>>>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a >>>>>>> >>>>>>> A prototype (feedback welcome) can be found at >>>>>>> https://github.com/apache/beam/pull/24667. It can be invoked as >>>>>>> >>>>>>> python -m apache_beam.yaml.main --pipeline_spec_file >>>>>>> [path/to/file.yaml] [other_pipene_args] >>>>>>> >>>>>>> or >>>>>>> >>>>>>> python -m apache_beam.yaml.main --pipeline_spec [yaml_contents] >>>>>>> [other_pipene_args] >>>>>>> >>>>>>> For example, to play around with this one could do >>>>>>> >>>>>>> python -m apache_beam.yaml.main \ >>>>>>> --pipeline_spec "$(curl >>>>>>> >>>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>>>>>> )" >>>>>>> \ >>>>>>> --runner=apache_beam.runners.render.RenderRunner \ >>>>>>> --render_out=out.png >>>>>>> >>>>>>> Alternatively one can run it as a docker container with no need to >>>>>>> install any SDK >>>>>>> >>>>>>> docker run --rm \ >>>>>>> --entrypoint /usr/local/bin/python \ >>>>>>> gcr.io/apache-beam-testing/yaml_template:dev >>>>>>> /dataflow/template/main.py \ >>>>>>> --pipeline_spec="$(curl >>>>>>> >>>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>>>>>> )" >>>>>>> >>>>>>> Though of course one would have to set up the appropriate mount >>>>>>> points >>>>>>> to do any local filesystem io and/or credentials. >>>>>>> >>>>>>> This is also available as a Dataflow template and can be invoked as >>>>>>> >>>>>>> gcloud dataflow flex-template run \ >>>>>>> "yaml-template-job" \ >>>>>>> --template-file-gcs-location >>>>>>> gs://apache-beam-testing-robertwb/yaml_template.json \ >>>>>>> --parameters ^~^pipeline_spec="$(curl >>>>>>> >>>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>>>>>> )" >>>>>>> \ >>>>>>> --parameters pickle_library=cloudpickle \ >>>>>>> --project=apache-beam-testing \ >>>>>>> --region us-central1 >>>>>>> >>>>>>> (Note the escaping required for the parameter (use cat for a local >>>>>>> file), and the debug cycle here could be greatly improved, so I'd >>>>>>> recommend trying things locally first.) >>>>>>> >>>>>>> A key point of this implementation is that it heavily uses the >>>>>>> expansion service and cross language transforms, tying into the >>>>>>> proposal at https://s.apache.org/easy-multi-language . Though all >>>>>>> the >>>>>>> examples use transforms defined in the Beam SDK, any appropriately >>>>>>> packaged libraries may be used. >>>>>>> >>>>>>> There are many ways this could be extended. For example >>>>>>> >>>>>>> * It would be useful to be able to templatize yaml descriptions. This >>>>>>> could be done with $SIGIL type notation or some other way. This would >>>>>>> even allow one to define reusable, parameterized composite PTransform >>>>>>> types in yaml itself. >>>>>>> >>>>>>> * It would be good to have a more principled way of merging >>>>>>> environments. Currently each set of dependencies is a unique Beam >>>>>>> environment, and while Beam has sophisticated cross-language >>>>>>> capabilities, it would be nice if environments sharing the same >>>>>>> language (and likely also the same Beam version) could be fused >>>>>>> in-process (e.g. with separate class loaders or compatibility checks >>>>>>> for packages). >>>>>>> >>>>>>> * Publishing and discovery of transformations could be improved, >>>>>>> possibly via shared standards and some kind of a transform catalog. >>>>>>> An >>>>>>> ecosystem of easily sharable transforms (similar to what huggingface >>>>>>> provides for ML models) could provide a useful platform for making it >>>>>>> easy to build pipelines and open up Beam to a whole new set of users. >>>>>>> >>>>>>> Let me know what you think. >>>>>>> >>>>>>> - Robert >>>>>>> >>>>>>