I like the idea of a common spec for something like this so we can actually cross validate all the SDK behaviours. It would make testing significantly easier.
On Wed, Dec 14, 2022, 2:57 PM Kenneth Knowles <k...@apache.org> wrote: > 1. I love the idea. Back in the early days people talked about an "XML > SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the > time. Portability and specifically cross-language schema transforms gives > the right infrastructure so this is the perfect time: unique names (URNs) > for transforms and explicit lists of parameters they require. > > 2. I like the idea of re-using some existing thing like dbt if it is > pretty much what we were going to do anyhow. I don't think we should hold > ourselves back. I also don't think we'll gain anything in terms of > implementation. But at least it could fast-forward our design process > because we simply don't have to make most of the decisions because they are > made for us. > > > > On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev <dev@beam.apache.org> > wrote: > >> And I guess also a PR for completeness to make it easier to find going >> forward instead of my random repo: >> https://github.com/apache/beam/pull/24670 >> >> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <byronel...@google.com> >> wrote: >> >>> Since Robert opened that can of worms (and we happened to talk about it >>> yesterday)... :-) >>> >>> I figured I'd also share my start on a "port" of dbt to the Beam SDK. >>> This would be complementary as it doesn't really provide a way of >>> specifying a pipeline, more orchestrating and packaging a complex >>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem >>> like reasonable things for Beam and it wouldn't be a stretch to include >>> something like the format above. Though in my head I had imagined people >>> would tend to write composite transforms in the SDK of their choosing that >>> are then exposed at this layer. I decided to go with dbt as it also >>> provides a number of nice "quality of life" features for its users like >>> documentation, validation, environments and so on, >>> >>> I did a really quick proof-of-viability implementation here: >>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions >>> >>> And you can see a really simple pipeline that reads a seed file >>> (TextIO), runs it through a couple of SQLTransforms and then drops it out >>> to a logger via a simple DoFn here: >>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline >>> >>> I've also heard a rumor there might also be a textproto-based >>> representation floating around too :-) >>> >>> Best, >>> B >>> >>> >>> >>> >>> >>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev < >>> dev@beam.apache.org> wrote: >>> >>>> Hello Robert, >>>> >>>> I'm replying to say that I've been waiting for something like this ever >>>> since I started learning Beam and I'm grateful you are pushing this >>>> forward. >>>> >>>> Best, >>>> >>>> Damon >>>> >>>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <rober...@google.com> >>>> wrote: >>>> >>>>> While Beam provides powerful APIs for authoring sophisticated data >>>>> processing pipelines, it often still has too high a barrier for >>>>> getting started and authoring simple pipelines. Even setting up the >>>>> environment, installing the dependencies, and setting up the project >>>>> can be an overwhelming amount of boilerplate for some (though >>>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long >>>>> way in making this easier). At the other extreme, the Dataflow project >>>>> has the notion of templates which are pre-built Beam pipelines that >>>>> can be easily launched from the command line, or even from your >>>>> browser, but they are fairly restrictive, limited to pre-assembled >>>>> pipelines taking a small number of parameters. >>>>> >>>>> The idea of creating a yaml-based description of pipelines has come up >>>>> several times in several contexts and this last week I decided to code >>>>> up what it could look like. Here's a proposal. >>>>> >>>>> pipeline: >>>>> - type: chain >>>>> transforms: >>>>> - type: ReadFromText >>>>> args: >>>>> file_pattern: "wordcount.yaml" >>>>> - type: PyMap >>>>> fn: "str.lower" >>>>> - type: PyFlatMap >>>>> fn: "import re\nlambda line: re.findall('[a-z]+', line)" >>>>> - type: PyTransform >>>>> name: Count >>>>> constructor: >>>>> "apache_beam.transforms.combiners.Count.PerElement" >>>>> - type: PyMap >>>>> fn: str >>>>> - type: WriteToText >>>>> file_path_prefix: "counts.txt" >>>>> >>>>> Some more examples at >>>>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a >>>>> >>>>> A prototype (feedback welcome) can be found at >>>>> https://github.com/apache/beam/pull/24667. It can be invoked as >>>>> >>>>> python -m apache_beam.yaml.main --pipeline_spec_file >>>>> [path/to/file.yaml] [other_pipene_args] >>>>> >>>>> or >>>>> >>>>> python -m apache_beam.yaml.main --pipeline_spec [yaml_contents] >>>>> [other_pipene_args] >>>>> >>>>> For example, to play around with this one could do >>>>> >>>>> python -m apache_beam.yaml.main \ >>>>> --pipeline_spec "$(curl >>>>> >>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>>>> )" >>>>> \ >>>>> --runner=apache_beam.runners.render.RenderRunner \ >>>>> --render_out=out.png >>>>> >>>>> Alternatively one can run it as a docker container with no need to >>>>> install any SDK >>>>> >>>>> docker run --rm \ >>>>> --entrypoint /usr/local/bin/python \ >>>>> gcr.io/apache-beam-testing/yaml_template:dev >>>>> /dataflow/template/main.py \ >>>>> --pipeline_spec="$(curl >>>>> >>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>>>> )" >>>>> >>>>> Though of course one would have to set up the appropriate mount points >>>>> to do any local filesystem io and/or credentials. >>>>> >>>>> This is also available as a Dataflow template and can be invoked as >>>>> >>>>> gcloud dataflow flex-template run \ >>>>> "yaml-template-job" \ >>>>> --template-file-gcs-location >>>>> gs://apache-beam-testing-robertwb/yaml_template.json \ >>>>> --parameters ^~^pipeline_spec="$(curl >>>>> >>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>>>> )" >>>>> \ >>>>> --parameters pickle_library=cloudpickle \ >>>>> --project=apache-beam-testing \ >>>>> --region us-central1 >>>>> >>>>> (Note the escaping required for the parameter (use cat for a local >>>>> file), and the debug cycle here could be greatly improved, so I'd >>>>> recommend trying things locally first.) >>>>> >>>>> A key point of this implementation is that it heavily uses the >>>>> expansion service and cross language transforms, tying into the >>>>> proposal at https://s.apache.org/easy-multi-language . Though all the >>>>> examples use transforms defined in the Beam SDK, any appropriately >>>>> packaged libraries may be used. >>>>> >>>>> There are many ways this could be extended. For example >>>>> >>>>> * It would be useful to be able to templatize yaml descriptions. This >>>>> could be done with $SIGIL type notation or some other way. This would >>>>> even allow one to define reusable, parameterized composite PTransform >>>>> types in yaml itself. >>>>> >>>>> * It would be good to have a more principled way of merging >>>>> environments. Currently each set of dependencies is a unique Beam >>>>> environment, and while Beam has sophisticated cross-language >>>>> capabilities, it would be nice if environments sharing the same >>>>> language (and likely also the same Beam version) could be fused >>>>> in-process (e.g. with separate class loaders or compatibility checks >>>>> for packages). >>>>> >>>>> * Publishing and discovery of transformations could be improved, >>>>> possibly via shared standards and some kind of a transform catalog. An >>>>> ecosystem of easily sharable transforms (similar to what huggingface >>>>> provides for ML models) could provide a useful platform for making it >>>>> easy to build pipelines and open up Beam to a whole new set of users. >>>>> >>>>> Let me know what you think. >>>>> >>>>> - Robert >>>>> >>>>