1. I love the idea. Back in the early days people talked about an "XML SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the time. Portability and specifically cross-language schema transforms gives the right infrastructure so this is the perfect time: unique names (URNs) for transforms and explicit lists of parameters they require.
2. I like the idea of re-using some existing thing like dbt if it is pretty much what we were going to do anyhow. I don't think we should hold ourselves back. I also don't think we'll gain anything in terms of implementation. But at least it could fast-forward our design process because we simply don't have to make most of the decisions because they are made for us. On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev <dev@beam.apache.org> wrote: > And I guess also a PR for completeness to make it easier to find going > forward instead of my random repo: > https://github.com/apache/beam/pull/24670 > > On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <byronel...@google.com> wrote: > >> Since Robert opened that can of worms (and we happened to talk about it >> yesterday)... :-) >> >> I figured I'd also share my start on a "port" of dbt to the Beam SDK. >> This would be complementary as it doesn't really provide a way of >> specifying a pipeline, more orchestrating and packaging a complex >> pipeline---dbt itself supports SQL and Python Dataframes, which both seem >> like reasonable things for Beam and it wouldn't be a stretch to include >> something like the format above. Though in my head I had imagined people >> would tend to write composite transforms in the SDK of their choosing that >> are then exposed at this layer. I decided to go with dbt as it also >> provides a number of nice "quality of life" features for its users like >> documentation, validation, environments and so on, >> >> I did a really quick proof-of-viability implementation here: >> https://github.com/byronellis/beam/tree/structured-pipeline-definitions >> >> And you can see a really simple pipeline that reads a seed file (TextIO), >> runs it through a couple of SQLTransforms and then drops it out to a logger >> via a simple DoFn here: >> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline >> >> I've also heard a rumor there might also be a textproto-based >> representation floating around too :-) >> >> Best, >> B >> >> >> >> >> >> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev < >> dev@beam.apache.org> wrote: >> >>> Hello Robert, >>> >>> I'm replying to say that I've been waiting for something like this ever >>> since I started learning Beam and I'm grateful you are pushing this forward. >>> >>> Best, >>> >>> Damon >>> >>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <rober...@google.com> >>> wrote: >>> >>>> While Beam provides powerful APIs for authoring sophisticated data >>>> processing pipelines, it often still has too high a barrier for >>>> getting started and authoring simple pipelines. Even setting up the >>>> environment, installing the dependencies, and setting up the project >>>> can be an overwhelming amount of boilerplate for some (though >>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long >>>> way in making this easier). At the other extreme, the Dataflow project >>>> has the notion of templates which are pre-built Beam pipelines that >>>> can be easily launched from the command line, or even from your >>>> browser, but they are fairly restrictive, limited to pre-assembled >>>> pipelines taking a small number of parameters. >>>> >>>> The idea of creating a yaml-based description of pipelines has come up >>>> several times in several contexts and this last week I decided to code >>>> up what it could look like. Here's a proposal. >>>> >>>> pipeline: >>>> - type: chain >>>> transforms: >>>> - type: ReadFromText >>>> args: >>>> file_pattern: "wordcount.yaml" >>>> - type: PyMap >>>> fn: "str.lower" >>>> - type: PyFlatMap >>>> fn: "import re\nlambda line: re.findall('[a-z]+', line)" >>>> - type: PyTransform >>>> name: Count >>>> constructor: "apache_beam.transforms.combiners.Count.PerElement" >>>> - type: PyMap >>>> fn: str >>>> - type: WriteToText >>>> file_path_prefix: "counts.txt" >>>> >>>> Some more examples at >>>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a >>>> >>>> A prototype (feedback welcome) can be found at >>>> https://github.com/apache/beam/pull/24667. It can be invoked as >>>> >>>> python -m apache_beam.yaml.main --pipeline_spec_file >>>> [path/to/file.yaml] [other_pipene_args] >>>> >>>> or >>>> >>>> python -m apache_beam.yaml.main --pipeline_spec [yaml_contents] >>>> [other_pipene_args] >>>> >>>> For example, to play around with this one could do >>>> >>>> python -m apache_beam.yaml.main \ >>>> --pipeline_spec "$(curl >>>> >>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>>> )" >>>> \ >>>> --runner=apache_beam.runners.render.RenderRunner \ >>>> --render_out=out.png >>>> >>>> Alternatively one can run it as a docker container with no need to >>>> install any SDK >>>> >>>> docker run --rm \ >>>> --entrypoint /usr/local/bin/python \ >>>> gcr.io/apache-beam-testing/yaml_template:dev >>>> /dataflow/template/main.py \ >>>> --pipeline_spec="$(curl >>>> >>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>>> )" >>>> >>>> Though of course one would have to set up the appropriate mount points >>>> to do any local filesystem io and/or credentials. >>>> >>>> This is also available as a Dataflow template and can be invoked as >>>> >>>> gcloud dataflow flex-template run \ >>>> "yaml-template-job" \ >>>> --template-file-gcs-location >>>> gs://apache-beam-testing-robertwb/yaml_template.json \ >>>> --parameters ^~^pipeline_spec="$(curl >>>> >>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>>> )" >>>> \ >>>> --parameters pickle_library=cloudpickle \ >>>> --project=apache-beam-testing \ >>>> --region us-central1 >>>> >>>> (Note the escaping required for the parameter (use cat for a local >>>> file), and the debug cycle here could be greatly improved, so I'd >>>> recommend trying things locally first.) >>>> >>>> A key point of this implementation is that it heavily uses the >>>> expansion service and cross language transforms, tying into the >>>> proposal at https://s.apache.org/easy-multi-language . Though all the >>>> examples use transforms defined in the Beam SDK, any appropriately >>>> packaged libraries may be used. >>>> >>>> There are many ways this could be extended. For example >>>> >>>> * It would be useful to be able to templatize yaml descriptions. This >>>> could be done with $SIGIL type notation or some other way. This would >>>> even allow one to define reusable, parameterized composite PTransform >>>> types in yaml itself. >>>> >>>> * It would be good to have a more principled way of merging >>>> environments. Currently each set of dependencies is a unique Beam >>>> environment, and while Beam has sophisticated cross-language >>>> capabilities, it would be nice if environments sharing the same >>>> language (and likely also the same Beam version) could be fused >>>> in-process (e.g. with separate class loaders or compatibility checks >>>> for packages). >>>> >>>> * Publishing and discovery of transformations could be improved, >>>> possibly via shared standards and some kind of a transform catalog. An >>>> ecosystem of easily sharable transforms (similar to what huggingface >>>> provides for ML models) could provide a useful platform for making it >>>> easy to build pipelines and open up Beam to a whole new set of users. >>>> >>>> Let me know what you think. >>>> >>>> - Robert >>>> >>>