+1 for these proposals and agree that these will simplify and demystify Beam for many new users. I think when combined with the x-lang/Schema-Aware transform binding, these might end up being adequate solutions for many production use-cases as well (unless users need to define custom composites, I/O connectors, etc.).
Also, thanks for providing prototype implementations with examples. - Cham On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev <dev@beam.apache.org> wrote: > To build on Kenn's point, if we leverage existing stuff like dbt we get > access to a ready made community which can help drive both adoption and > incremental innovation by bringing more folks to Beam > > On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles <k...@apache.org> wrote: > >> 1. I love the idea. Back in the early days people talked about an "XML >> SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the >> time. Portability and specifically cross-language schema transforms gives >> the right infrastructure so this is the perfect time: unique names (URNs) >> for transforms and explicit lists of parameters they require. >> >> 2. I like the idea of re-using some existing thing like dbt if it is >> pretty much what we were going to do anyhow. I don't think we should hold >> ourselves back. I also don't think we'll gain anything in terms of >> implementation. But at least it could fast-forward our design process >> because we simply don't have to make most of the decisions because they are >> made for us. >> >> >> >> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev <dev@beam.apache.org> >> wrote: >> >>> And I guess also a PR for completeness to make it easier to find going >>> forward instead of my random repo: >>> https://github.com/apache/beam/pull/24670 >>> >>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <byronel...@google.com> >>> wrote: >>> >>>> Since Robert opened that can of worms (and we happened to talk about it >>>> yesterday)... :-) >>>> >>>> I figured I'd also share my start on a "port" of dbt to the Beam SDK. >>>> This would be complementary as it doesn't really provide a way of >>>> specifying a pipeline, more orchestrating and packaging a complex >>>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem >>>> like reasonable things for Beam and it wouldn't be a stretch to include >>>> something like the format above. Though in my head I had imagined people >>>> would tend to write composite transforms in the SDK of their choosing that >>>> are then exposed at this layer. I decided to go with dbt as it also >>>> provides a number of nice "quality of life" features for its users like >>>> documentation, validation, environments and so on, >>>> >>>> I did a really quick proof-of-viability implementation here: >>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions >>>> >>>> And you can see a really simple pipeline that reads a seed file >>>> (TextIO), runs it through a couple of SQLTransforms and then drops it out >>>> to a logger via a simple DoFn here: >>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline >>>> >>>> I've also heard a rumor there might also be a textproto-based >>>> representation floating around too :-) >>>> >>>> Best, >>>> B >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev < >>>> dev@beam.apache.org> wrote: >>>> >>>>> Hello Robert, >>>>> >>>>> I'm replying to say that I've been waiting for something like this >>>>> ever since I started learning Beam and I'm grateful you are pushing this >>>>> forward. >>>>> >>>>> Best, >>>>> >>>>> Damon >>>>> >>>>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <rober...@google.com> >>>>> wrote: >>>>> >>>>>> While Beam provides powerful APIs for authoring sophisticated data >>>>>> processing pipelines, it often still has too high a barrier for >>>>>> getting started and authoring simple pipelines. Even setting up the >>>>>> environment, installing the dependencies, and setting up the project >>>>>> can be an overwhelming amount of boilerplate for some (though >>>>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long >>>>>> way in making this easier). At the other extreme, the Dataflow project >>>>>> has the notion of templates which are pre-built Beam pipelines that >>>>>> can be easily launched from the command line, or even from your >>>>>> browser, but they are fairly restrictive, limited to pre-assembled >>>>>> pipelines taking a small number of parameters. >>>>>> >>>>>> The idea of creating a yaml-based description of pipelines has come up >>>>>> several times in several contexts and this last week I decided to code >>>>>> up what it could look like. Here's a proposal. >>>>>> >>>>>> pipeline: >>>>>> - type: chain >>>>>> transforms: >>>>>> - type: ReadFromText >>>>>> args: >>>>>> file_pattern: "wordcount.yaml" >>>>>> - type: PyMap >>>>>> fn: "str.lower" >>>>>> - type: PyFlatMap >>>>>> fn: "import re\nlambda line: re.findall('[a-z]+', line)" >>>>>> - type: PyTransform >>>>>> name: Count >>>>>> constructor: >>>>>> "apache_beam.transforms.combiners.Count.PerElement" >>>>>> - type: PyMap >>>>>> fn: str >>>>>> - type: WriteToText >>>>>> file_path_prefix: "counts.txt" >>>>>> >>>>>> Some more examples at >>>>>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a >>>>>> >>>>>> A prototype (feedback welcome) can be found at >>>>>> https://github.com/apache/beam/pull/24667. It can be invoked as >>>>>> >>>>>> python -m apache_beam.yaml.main --pipeline_spec_file >>>>>> [path/to/file.yaml] [other_pipene_args] >>>>>> >>>>>> or >>>>>> >>>>>> python -m apache_beam.yaml.main --pipeline_spec [yaml_contents] >>>>>> [other_pipene_args] >>>>>> >>>>>> For example, to play around with this one could do >>>>>> >>>>>> python -m apache_beam.yaml.main \ >>>>>> --pipeline_spec "$(curl >>>>>> >>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>>>>> )" >>>>>> \ >>>>>> --runner=apache_beam.runners.render.RenderRunner \ >>>>>> --render_out=out.png >>>>>> >>>>>> Alternatively one can run it as a docker container with no need to >>>>>> install any SDK >>>>>> >>>>>> docker run --rm \ >>>>>> --entrypoint /usr/local/bin/python \ >>>>>> gcr.io/apache-beam-testing/yaml_template:dev >>>>>> /dataflow/template/main.py \ >>>>>> --pipeline_spec="$(curl >>>>>> >>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>>>>> )" >>>>>> >>>>>> Though of course one would have to set up the appropriate mount points >>>>>> to do any local filesystem io and/or credentials. >>>>>> >>>>>> This is also available as a Dataflow template and can be invoked as >>>>>> >>>>>> gcloud dataflow flex-template run \ >>>>>> "yaml-template-job" \ >>>>>> --template-file-gcs-location >>>>>> gs://apache-beam-testing-robertwb/yaml_template.json \ >>>>>> --parameters ^~^pipeline_spec="$(curl >>>>>> >>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml >>>>>> )" >>>>>> \ >>>>>> --parameters pickle_library=cloudpickle \ >>>>>> --project=apache-beam-testing \ >>>>>> --region us-central1 >>>>>> >>>>>> (Note the escaping required for the parameter (use cat for a local >>>>>> file), and the debug cycle here could be greatly improved, so I'd >>>>>> recommend trying things locally first.) >>>>>> >>>>>> A key point of this implementation is that it heavily uses the >>>>>> expansion service and cross language transforms, tying into the >>>>>> proposal at https://s.apache.org/easy-multi-language . Though all >>>>>> the >>>>>> examples use transforms defined in the Beam SDK, any appropriately >>>>>> packaged libraries may be used. >>>>>> >>>>>> There are many ways this could be extended. For example >>>>>> >>>>>> * It would be useful to be able to templatize yaml descriptions. This >>>>>> could be done with $SIGIL type notation or some other way. This would >>>>>> even allow one to define reusable, parameterized composite PTransform >>>>>> types in yaml itself. >>>>>> >>>>>> * It would be good to have a more principled way of merging >>>>>> environments. Currently each set of dependencies is a unique Beam >>>>>> environment, and while Beam has sophisticated cross-language >>>>>> capabilities, it would be nice if environments sharing the same >>>>>> language (and likely also the same Beam version) could be fused >>>>>> in-process (e.g. with separate class loaders or compatibility checks >>>>>> for packages). >>>>>> >>>>>> * Publishing and discovery of transformations could be improved, >>>>>> possibly via shared standards and some kind of a transform catalog. An >>>>>> ecosystem of easily sharable transforms (similar to what huggingface >>>>>> provides for ML models) could provide a useful platform for making it >>>>>> easy to build pipelines and open up Beam to a whole new set of users. >>>>>> >>>>>> Let me know what you think. >>>>>> >>>>>> - Robert >>>>>> >>>>>