Seems a worthwhile addition which can expand the community by making Beam increasingly accessible to additional users and for more use-cases.
A bit of a tangent, since commenting on @Byron Ellis <byronel...@google.com>'s part, but ... Ensuring some have also seen Dataform [ ex: https://cloud.google.com/dataform/docs/overview ... and - formerly - https://dataform.co/ ] , since now part of the same company as you, there are potentially additional maybe-straightforward conversations/lessons-learned/etc to discuss [ in addition to collabs with the dbt community ]. At times, I think of these two [ dbt, dataform] as addressing similar things. On Thu, Dec 15, 2022 at 4:17 PM Ahmet Altay via dev <dev@beam.apache.org> wrote: > +1 to both of these proposals. In the past 12 months I have heard of at > least 3 YAML implementations built on top of Beam in large production > systems. Unfortunately, none of those were open sourced. Having these out > of the box would be great, and it will clearly have used demand. Thank > you all! > > On Thu, Dec 15, 2022 at 10:59 AM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> On Thu, Dec 15, 2022 at 3:37 AM Steven van Rossum >> <sjvanros...@google.com> wrote: >> > >> > This is great! I developed a similar template a year or two ago as a >> reference for a customer to speed up their development process and >> unsurprisingly it did speed up their development. >> > Here's an example of the config layout I came up with at the time: >> > >> > options: >> > runner: DirectRunner >> > >> > pipeline: >> > # - &messages >> > # label: PubSub XML source >> > # transform: >> > # !PTransform:apache_beam.io.ReadFromPubSub >> > # subscription: projects/PROJECT/subscriptions/SUBSCRIPTION >> > - &message_source_1 >> > label: XML source 1 >> > transform: >> > !PTransform:apache_beam.Create >> > values: >> > - /path/to/file.xml >> > - &message_source_2 >> > label: XML source 2 >> > transform: >> > !PTransform:apache_beam.Create >> > values: >> > - /path/to/another/file.xml >> > - &message_xml >> > label: XMLs >> > inputs: >> > - step: *message_source_1 >> > - step: *message_source_2 >> > transform: >> > !PTransform:utils.transforms.ParseXmlDocument {} >> > - &validated_messages >> > label: Validate XMLs >> > inputs: >> > - step: *message_xml >> > tag: success >> > transform: >> > !PTransform:utils.transforms.ValidateXmlDocumentWithXmlSchema >> > schema: /path/to/file.xsd >> > - &converted_messages >> > label: Convert XMLs >> > inputs: >> > - step: *validated_messages >> > transform: >> > !PTransform:utils.transforms.ConvertXmlDocumentToDictionary >> > schema: /path/to/file.xsd >> > - label: Print XMLs >> > inputs: >> > - step: *converted_messages >> > transform: >> > !PTransform:utils.transforms.Print {} >> > >> > Highlights: >> > Pipeline options are supplied under an options property. >> >> Yep, I was thinking exactly the same: >> >> https://github.com/apache/beam/blob/c5518014d47a42651df94419e3ccbc79eaf96cb3/sdks/python/apache_beam/yaml/main.py#L51 >> >> > A pipeline is a flat set of all transforms in the pipeline. >> >> One can certainly enumerate the transforms as a flat set, but I do >> think being able to define a composite structure is nice. In addition, >> the "chain" composite allows one to automatically infer the >> input-output relation rather than having to spell it out (much as one >> can chain multiple transforms in the various SDKs rather than have to >> assign each result to a intermediate). >> >> > Transforms are defined using a YAML tag and named properties and can be >> used by constructing a YAML reference. >> >> That's an interesting idea. Can it be done inline as well? >> >> > DAG construction is done using a simple topological sort of transforms >> and their dependencies. >> >> Same. >> >> > Named side outputs can be referenced using a tag field. >> >> I didn't put this in any of the examples, but I do the same. If a >> transform Foo produces multiple outputs, one can (in fact must) >> reference the various outputs by Foo.output1, Foo.output2, etc. >> >> > Multiple inputs are merged with a Flatten transform. >> >> PTransfoms can have named inputs as well (they're not always >> symmetric), so I let inputs be a map if they care to distinguish them. >> >> > Not sure if there's any inspiration left to take from this, but I >> figured I'd throw it up here to share. >> >> Thanks. It's neat to see others coming up with the same idea, with >> very similar conventions, and validates that it'd be both natural and >> useful. >> >> >> > On Thu, Dec 15, 2022 at 12:48 AM Chamikara Jayalath via dev < >> dev@beam.apache.org> wrote: >> >> >> >> +1 for these proposals and agree that these will simplify and >> demystify Beam for many new users. I think when combined with the >> x-lang/Schema-Aware transform binding, these might end up being adequate >> solutions for many production use-cases as well (unless users need to >> define custom composites, I/O connectors, etc.). >> >> >> >> Also, thanks for providing prototype implementations with examples. >> >> >> >> - Cham >> >> >> >> >> >> On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev < >> dev@beam.apache.org> wrote: >> >>> >> >>> To build on Kenn's point, if we leverage existing stuff like dbt we >> get access to a ready made community which can help drive both adoption and >> incremental innovation by bringing more folks to Beam >> >>> >> >>> On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles <k...@apache.org> >> wrote: >> >>>> >> >>>> 1. I love the idea. Back in the early days people talked about an >> "XML SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at >> the time. Portability and specifically cross-language schema transforms >> gives the right infrastructure so this is the perfect time: unique names >> (URNs) for transforms and explicit lists of parameters they require. >> >>>> >> >>>> 2. I like the idea of re-using some existing thing like dbt if it is >> pretty much what we were going to do anyhow. I don't think we should hold >> ourselves back. I also don't think we'll gain anything in terms of >> implementation. But at least it could fast-forward our design process >> because we simply don't have to make most of the decisions because they are >> made for us. >> >>>> >> >>>> >> >>>> >> >>>> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev < >> dev@beam.apache.org> wrote: >> >>>>> >> >>>>> And I guess also a PR for completeness to make it easier to find >> going forward instead of my random repo: >> https://github.com/apache/beam/pull/24670 >> >>>>> >> >>>>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <byronel...@google.com> >> wrote: >> >>>>>> >> >>>>>> Since Robert opened that can of worms (and we happened to talk >> about it yesterday)... :-) >> >>>>>> >> >>>>>> I figured I'd also share my start on a "port" of dbt to the Beam >> SDK. This would be complementary as it doesn't really provide a way of >> specifying a pipeline, more orchestrating and packaging a complex >> pipeline---dbt itself supports SQL and Python Dataframes, which both seem >> like reasonable things for Beam and it wouldn't be a stretch to include >> something like the format above. Though in my head I had imagined people >> would tend to write composite transforms in the SDK of their choosing that >> are then exposed at this layer. I decided to go with dbt as it also >> provides a number of nice "quality of life" features for its users like >> documentation, validation, environments and so on, >> >>>>>> >> >>>>>> I did a really quick proof-of-viability implementation here: >> https://github.com/byronellis/beam/tree/structured-pipeline-definitions >> >>>>>> >> >>>>>> And you can see a really simple pipeline that reads a seed file >> (TextIO), runs it through a couple of SQLTransforms and then drops it out >> to a logger via a simple DoFn here: >> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline >> >>>>>> >> >>>>>> I've also heard a rumor there might also be a textproto-based >> representation floating around too :-) >> >>>>>> >> >>>>>> Best, >> >>>>>> B >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev < >> dev@beam.apache.org> wrote: >> >>>>>>> >> >>>>>>> Hello Robert, >> >>>>>>> >> >>>>>>> I'm replying to say that I've been waiting for something like >> this ever since I started learning Beam and I'm grateful you are pushing >> this forward. >> >>>>>>> >> >>>>>>> Best, >> >>>>>>> >> >>>>>>> Damon >> >>>>>>> >> >>>>>>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw < >> rober...@google.com> wrote: >> >>>>>>>> >> >>>>>>>> While Beam provides powerful APIs for authoring sophisticated >> data >> >>>>>>>> processing pipelines, it often still has too high a barrier for >> >>>>>>>> getting started and authoring simple pipelines. Even setting up >> the >> >>>>>>>> environment, installing the dependencies, and setting up the >> project >> >>>>>>>> can be an overwhelming amount of boilerplate for some (though >> >>>>>>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a >> long >> >>>>>>>> way in making this easier). At the other extreme, the Dataflow >> project >> >>>>>>>> has the notion of templates which are pre-built Beam pipelines >> that >> >>>>>>>> can be easily launched from the command line, or even from your >> >>>>>>>> browser, but they are fairly restrictive, limited to >> pre-assembled >> >>>>>>>> pipelines taking a small number of parameters. >> >>>>>>>> >> >>>>>>>> The idea of creating a yaml-based description of pipelines has >> come up >> >>>>>>>> several times in several contexts and this last week I decided >> to code >> >>>>>>>> up what it could look like. Here's a proposal. >> >>>>>>>> >> >>>>>>>> pipeline: >> >>>>>>>> - type: chain >> >>>>>>>> transforms: >> >>>>>>>> - type: ReadFromText >> >>>>>>>> args: >> >>>>>>>> file_pattern: "wordcount.yaml" >> >>>>>>>> - type: PyMap >> >>>>>>>> fn: "str.lower" >> >>>>>>>> - type: PyFlatMap >> >>>>>>>> fn: "import re\nlambda line: re.findall('[a-z]+', line)" >> >>>>>>>> - type: PyTransform >> >>>>>>>> name: Count >> >>>>>>>> constructor: >> "apache_beam.transforms.combiners.Count.PerElement" >> >>>>>>>> - type: PyMap >> >>>>>>>> fn: str >> >>>>>>>> - type: WriteToText >> >>>>>>>> file_path_prefix: "counts.txt" >> >>>>>>>> >> >>>>>>>> Some more examples at >> >>>>>>>> >> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a >> >>>>>>>> >> >>>>>>>> A prototype (feedback welcome) can be found at >> >>>>>>>> https://github.com/apache/beam/pull/24667. It can be invoked as >> >>>>>>>> >> >>>>>>>> python -m apache_beam.yaml.main --pipeline_spec_file >> >>>>>>>> [path/to/file.yaml] [other_pipene_args] >> >>>>>>>> >> >>>>>>>> or >> >>>>>>>> >> >>>>>>>> python -m apache_beam.yaml.main --pipeline_spec >> [yaml_contents] >> >>>>>>>> [other_pipene_args] >> >>>>>>>> >> >>>>>>>> For example, to play around with this one could do >> >>>>>>>> >> >>>>>>>> python -m apache_beam.yaml.main \ >> >>>>>>>> --pipeline_spec "$(curl >> >>>>>>>> >> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >> )" >> >>>>>>>> \ >> >>>>>>>> --runner=apache_beam.runners.render.RenderRunner \ >> >>>>>>>> --render_out=out.png >> >>>>>>>> >> >>>>>>>> Alternatively one can run it as a docker container with no need >> to >> >>>>>>>> install any SDK >> >>>>>>>> >> >>>>>>>> docker run --rm \ >> >>>>>>>> --entrypoint /usr/local/bin/python \ >> >>>>>>>> gcr.io/apache-beam-testing/yaml_template:dev >> >>>>>>>> /dataflow/template/main.py \ >> >>>>>>>> --pipeline_spec="$(curl >> >>>>>>>> >> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >> )" >> >>>>>>>> >> >>>>>>>> Though of course one would have to set up the appropriate mount >> points >> >>>>>>>> to do any local filesystem io and/or credentials. >> >>>>>>>> >> >>>>>>>> This is also available as a Dataflow template and can be invoked >> as >> >>>>>>>> >> >>>>>>>> gcloud dataflow flex-template run \ >> >>>>>>>> "yaml-template-job" \ >> >>>>>>>> --template-file-gcs-location >> >>>>>>>> gs://apache-beam-testing-robertwb/yaml_template.json \ >> >>>>>>>> --parameters ^~^pipeline_spec="$(curl >> >>>>>>>> >> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >> )" >> >>>>>>>> \ >> >>>>>>>> --parameters pickle_library=cloudpickle \ >> >>>>>>>> --project=apache-beam-testing \ >> >>>>>>>> --region us-central1 >> >>>>>>>> >> >>>>>>>> (Note the escaping required for the parameter (use cat for a >> local >> >>>>>>>> file), and the debug cycle here could be greatly improved, so I'd >> >>>>>>>> recommend trying things locally first.) >> >>>>>>>> >> >>>>>>>> A key point of this implementation is that it heavily uses the >> >>>>>>>> expansion service and cross language transforms, tying into the >> >>>>>>>> proposal at https://s.apache.org/easy-multi-language . Though >> all the >> >>>>>>>> examples use transforms defined in the Beam SDK, any >> appropriately >> >>>>>>>> packaged libraries may be used. >> >>>>>>>> >> >>>>>>>> There are many ways this could be extended. For example >> >>>>>>>> >> >>>>>>>> * It would be useful to be able to templatize yaml descriptions. >> This >> >>>>>>>> could be done with $SIGIL type notation or some other way. This >> would >> >>>>>>>> even allow one to define reusable, parameterized composite >> PTransform >> >>>>>>>> types in yaml itself. >> >>>>>>>> >> >>>>>>>> * It would be good to have a more principled way of merging >> >>>>>>>> environments. Currently each set of dependencies is a unique Beam >> >>>>>>>> environment, and while Beam has sophisticated cross-language >> >>>>>>>> capabilities, it would be nice if environments sharing the same >> >>>>>>>> language (and likely also the same Beam version) could be fused >> >>>>>>>> in-process (e.g. with separate class loaders or compatibility >> checks >> >>>>>>>> for packages). >> >>>>>>>> >> >>>>>>>> * Publishing and discovery of transformations could be improved, >> >>>>>>>> possibly via shared standards and some kind of a transform >> catalog. An >> >>>>>>>> ecosystem of easily sharable transforms (similar to what >> huggingface >> >>>>>>>> provides for ML models) could provide a useful platform for >> making it >> >>>>>>>> easy to build pipelines and open up Beam to a whole new set of >> users. >> >>>>>>>> >> >>>>>>>> Let me know what you think. >> >>>>>>>> >> >>>>>>>> - Robert >> >