Since Robert opened that can of worms (and we happened to talk about it yesterday)... :-)
I figured I'd also share my start on a "port" of dbt to the Beam SDK. This would be complementary as it doesn't really provide a way of specifying a pipeline, more orchestrating and packaging a complex pipeline---dbt itself supports SQL and Python Dataframes, which both seem like reasonable things for Beam and it wouldn't be a stretch to include something like the format above. Though in my head I had imagined people would tend to write composite transforms in the SDK of their choosing that are then exposed at this layer. I decided to go with dbt as it also provides a number of nice "quality of life" features for its users like documentation, validation, environments and so on, I did a really quick proof-of-viability implementation here: https://github.com/byronellis/beam/tree/structured-pipeline-definitions And you can see a really simple pipeline that reads a seed file (TextIO), runs it through a couple of SQLTransforms and then drops it out to a logger via a simple DoFn here: https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline I've also heard a rumor there might also be a textproto-based representation floating around too :-) Best, B On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <dev@beam.apache.org> wrote: > Hello Robert, > > I'm replying to say that I've been waiting for something like this ever > since I started learning Beam and I'm grateful you are pushing this forward. > > Best, > > Damon > > On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <rober...@google.com> > wrote: > >> While Beam provides powerful APIs for authoring sophisticated data >> processing pipelines, it often still has too high a barrier for >> getting started and authoring simple pipelines. Even setting up the >> environment, installing the dependencies, and setting up the project >> can be an overwhelming amount of boilerplate for some (though >> https://beam.apache.org/blog/beam-starter-projects/ has gone a long >> way in making this easier). At the other extreme, the Dataflow project >> has the notion of templates which are pre-built Beam pipelines that >> can be easily launched from the command line, or even from your >> browser, but they are fairly restrictive, limited to pre-assembled >> pipelines taking a small number of parameters. >> >> The idea of creating a yaml-based description of pipelines has come up >> several times in several contexts and this last week I decided to code >> up what it could look like. Here's a proposal. >> >> pipeline: >> - type: chain >> transforms: >> - type: ReadFromText >> args: >> file_pattern: "wordcount.yaml" >> - type: PyMap >> fn: "str.lower" >> - type: PyFlatMap >> fn: "import re\nlambda line: re.findall('[a-z]+', line)" >> - type: PyTransform >> name: Count >> constructor: "apache_beam.transforms.combiners.Count.PerElement" >> - type: PyMap >> fn: str >> - type: WriteToText >> file_path_prefix: "counts.txt" >> >> Some more examples at >> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a >> >> A prototype (feedback welcome) can be found at >> https://github.com/apache/beam/pull/24667. It can be invoked as >> >> python -m apache_beam.yaml.main --pipeline_spec_file >> [path/to/file.yaml] [other_pipene_args] >> >> or >> >> python -m apache_beam.yaml.main --pipeline_spec [yaml_contents] >> [other_pipene_args] >> >> For example, to play around with this one could do >> >> python -m apache_beam.yaml.main \ >> --pipeline_spec "$(curl >> >> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >> )" >> \ >> --runner=apache_beam.runners.render.RenderRunner \ >> --render_out=out.png >> >> Alternatively one can run it as a docker container with no need to >> install any SDK >> >> docker run --rm \ >> --entrypoint /usr/local/bin/python \ >> gcr.io/apache-beam-testing/yaml_template:dev >> /dataflow/template/main.py \ >> --pipeline_spec="$(curl >> >> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >> )" >> >> Though of course one would have to set up the appropriate mount points >> to do any local filesystem io and/or credentials. >> >> This is also available as a Dataflow template and can be invoked as >> >> gcloud dataflow flex-template run \ >> "yaml-template-job" \ >> --template-file-gcs-location >> gs://apache-beam-testing-robertwb/yaml_template.json \ >> --parameters ^~^pipeline_spec="$(curl >> >> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >> )" >> \ >> --parameters pickle_library=cloudpickle \ >> --project=apache-beam-testing \ >> --region us-central1 >> >> (Note the escaping required for the parameter (use cat for a local >> file), and the debug cycle here could be greatly improved, so I'd >> recommend trying things locally first.) >> >> A key point of this implementation is that it heavily uses the >> expansion service and cross language transforms, tying into the >> proposal at https://s.apache.org/easy-multi-language . Though all the >> examples use transforms defined in the Beam SDK, any appropriately >> packaged libraries may be used. >> >> There are many ways this could be extended. For example >> >> * It would be useful to be able to templatize yaml descriptions. This >> could be done with $SIGIL type notation or some other way. This would >> even allow one to define reusable, parameterized composite PTransform >> types in yaml itself. >> >> * It would be good to have a more principled way of merging >> environments. Currently each set of dependencies is a unique Beam >> environment, and while Beam has sophisticated cross-language >> capabilities, it would be nice if environments sharing the same >> language (and likely also the same Beam version) could be fused >> in-process (e.g. with separate class loaders or compatibility checks >> for packages). >> >> * Publishing and discovery of transformations could be improved, >> possibly via shared standards and some kind of a transform catalog. An >> ecosystem of easily sharable transforms (similar to what huggingface >> provides for ML models) could provide a useful platform for making it >> easy to build pipelines and open up Beam to a whole new set of users. >> >> Let me know what you think. >> >> - Robert >> >