And I guess also a PR for completeness to make it easier to find going forward instead of my random repo: https://github.com/apache/beam/pull/24670
On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <byronel...@google.com> wrote: > Since Robert opened that can of worms (and we happened to talk about it > yesterday)... :-) > > I figured I'd also share my start on a "port" of dbt to the Beam SDK. This > would be complementary as it doesn't really provide a way of specifying a > pipeline, more orchestrating and packaging a complex pipeline---dbt itself > supports SQL and Python Dataframes, which both seem like reasonable things > for Beam and it wouldn't be a stretch to include something like the format > above. Though in my head I had imagined people would tend to write > composite transforms in the SDK of their choosing that are then exposed at > this layer. I decided to go with dbt as it also provides a number of nice > "quality of life" features for its users like documentation, validation, > environments and so on, > > I did a really quick proof-of-viability implementation here: > https://github.com/byronellis/beam/tree/structured-pipeline-definitions > > And you can see a really simple pipeline that reads a seed file (TextIO), > runs it through a couple of SQLTransforms and then drops it out to a logger > via a simple DoFn here: > https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline > > I've also heard a rumor there might also be a textproto-based > representation floating around too :-) > > Best, > B > > > > > > On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <dev@beam.apache.org> > wrote: > >> Hello Robert, >> >> I'm replying to say that I've been waiting for something like this ever >> since I started learning Beam and I'm grateful you are pushing this forward. >> >> Best, >> >> Damon >> >> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> While Beam provides powerful APIs for authoring sophisticated data >>> processing pipelines, it often still has too high a barrier for >>> getting started and authoring simple pipelines. Even setting up the >>> environment, installing the dependencies, and setting up the project >>> can be an overwhelming amount of boilerplate for some (though >>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long >>> way in making this easier). At the other extreme, the Dataflow project >>> has the notion of templates which are pre-built Beam pipelines that >>> can be easily launched from the command line, or even from your >>> browser, but they are fairly restrictive, limited to pre-assembled >>> pipelines taking a small number of parameters. >>> >>> The idea of creating a yaml-based description of pipelines has come up >>> several times in several contexts and this last week I decided to code >>> up what it could look like. Here's a proposal. >>> >>> pipeline: >>> - type: chain >>> transforms: >>> - type: ReadFromText >>> args: >>> file_pattern: "wordcount.yaml" >>> - type: PyMap >>> fn: "str.lower" >>> - type: PyFlatMap >>> fn: "import re\nlambda line: re.findall('[a-z]+', line)" >>> - type: PyTransform >>> name: Count >>> constructor: "apache_beam.transforms.combiners.Count.PerElement" >>> - type: PyMap >>> fn: str >>> - type: WriteToText >>> file_path_prefix: "counts.txt" >>> >>> Some more examples at >>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a >>> >>> A prototype (feedback welcome) can be found at >>> https://github.com/apache/beam/pull/24667. It can be invoked as >>> >>> python -m apache_beam.yaml.main --pipeline_spec_file >>> [path/to/file.yaml] [other_pipene_args] >>> >>> or >>> >>> python -m apache_beam.yaml.main --pipeline_spec [yaml_contents] >>> [other_pipene_args] >>> >>> For example, to play around with this one could do >>> >>> python -m apache_beam.yaml.main \ >>> --pipeline_spec "$(curl >>> >>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>> )" >>> \ >>> --runner=apache_beam.runners.render.RenderRunner \ >>> --render_out=out.png >>> >>> Alternatively one can run it as a docker container with no need to >>> install any SDK >>> >>> docker run --rm \ >>> --entrypoint /usr/local/bin/python \ >>> gcr.io/apache-beam-testing/yaml_template:dev >>> /dataflow/template/main.py \ >>> --pipeline_spec="$(curl >>> >>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>> )" >>> >>> Though of course one would have to set up the appropriate mount points >>> to do any local filesystem io and/or credentials. >>> >>> This is also available as a Dataflow template and can be invoked as >>> >>> gcloud dataflow flex-template run \ >>> "yaml-template-job" \ >>> --template-file-gcs-location >>> gs://apache-beam-testing-robertwb/yaml_template.json \ >>> --parameters ^~^pipeline_spec="$(curl >>> >>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>> )" >>> \ >>> --parameters pickle_library=cloudpickle \ >>> --project=apache-beam-testing \ >>> --region us-central1 >>> >>> (Note the escaping required for the parameter (use cat for a local >>> file), and the debug cycle here could be greatly improved, so I'd >>> recommend trying things locally first.) >>> >>> A key point of this implementation is that it heavily uses the >>> expansion service and cross language transforms, tying into the >>> proposal at https://s.apache.org/easy-multi-language . Though all the >>> examples use transforms defined in the Beam SDK, any appropriately >>> packaged libraries may be used. >>> >>> There are many ways this could be extended. For example >>> >>> * It would be useful to be able to templatize yaml descriptions. This >>> could be done with $SIGIL type notation or some other way. This would >>> even allow one to define reusable, parameterized composite PTransform >>> types in yaml itself. >>> >>> * It would be good to have a more principled way of merging >>> environments. Currently each set of dependencies is a unique Beam >>> environment, and while Beam has sophisticated cross-language >>> capabilities, it would be nice if environments sharing the same >>> language (and likely also the same Beam version) could be fused >>> in-process (e.g. with separate class loaders or compatibility checks >>> for packages). >>> >>> * Publishing and discovery of transformations could be improved, >>> possibly via shared standards and some kind of a transform catalog. An >>> ecosystem of easily sharable transforms (similar to what huggingface >>> provides for ML models) could provide a useful platform for making it >>> easy to build pipelines and open up Beam to a whole new set of users. >>> >>> Let me know what you think. >>> >>> - Robert >>> >>