Re: A Declarative API for Apache Beam

Damon Douglas via dev Wed, 14 Dec 2022 14:21:08 -0800

Hello Robert,

I'm replying to say that I've been waiting for something like this ever
since I started learning Beam and I'm grateful you are pushing this forward.


Best,

Damon

On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <rober...@google.com> wrote:

> While Beam provides powerful APIs for authoring sophisticated data
> processing pipelines, it often still has too high a barrier for
> getting started and authoring simple pipelines. Even setting up the
> environment, installing the dependencies, and setting up the project
> can be an overwhelming amount of boilerplate for some (though
> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
> way in making this easier). At the other extreme, the Dataflow project
> has the notion of templates which are pre-built Beam pipelines that
> can be easily launched from the command line, or even from your
> browser, but they are fairly restrictive, limited to pre-assembled
> pipelines taking a small number of parameters.
>
> The idea of creating a yaml-based description of pipelines has come up
> several times in several contexts and this last week I decided to code
> up what it could look like. Here's a proposal.
>
> pipeline:
>   - type: chain
>     transforms:
>       - type: ReadFromText
>         args:
>          file_pattern: "wordcount.yaml"
>       - type: PyMap
>         fn: "str.lower"
>       - type: PyFlatMap
>         fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>       - type: PyTransform
>         name: Count
>         constructor: "apache_beam.transforms.combiners.Count.PerElement"
>       - type: PyMap
>         fn: str
>       - type: WriteToText
>         file_path_prefix: "counts.txt"
>
> Some more examples at
> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>
> A prototype (feedback welcome) can be found at
> https://github.com/apache/beam/pull/24667. It can be invoked as
>
>     python -m apache_beam.yaml.main --pipeline_spec_file
> [path/to/file.yaml] [other_pipene_args]
>
> or
>
>     python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
> [other_pipene_args]
>
> For example, to play around with this one could do
>
>     python -m apache_beam.yaml.main  \
>         --pipeline_spec "$(curl
>
> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
> )"
> \
>         --runner=apache_beam.runners.render.RenderRunner \
>         --render_out=out.png
>
> Alternatively one can run it as a docker container with no need to
> install any SDK
>
>     docker run --rm \
>         --entrypoint /usr/local/bin/python \
>         gcr.io/apache-beam-testing/yaml_template:dev
> /dataflow/template/main.py \
>         --pipeline_spec="$(curl
>
> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
> )"
>
> Though of course one would have to set up the appropriate mount points
> to do any local filesystem io and/or credentials.
>
> This is also available as a Dataflow template and can be invoked as
>
>     gcloud dataflow flex-template run \
>         "yaml-template-job" \
>          --template-file-gcs-location
> gs://apache-beam-testing-robertwb/yaml_template.json \
>         --parameters ^~^pipeline_spec="$(curl
>
> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
> )"
> \
>         --parameters pickle_library=cloudpickle \
>         --project=apache-beam-testing \
>         --region us-central1
>
> (Note the escaping required for the parameter (use cat for a local
> file), and the debug cycle here could be greatly improved, so I'd
> recommend trying things locally first.)
>
> A key point of this implementation is that it heavily uses the
> expansion service and cross language transforms, tying into the
> proposal at  https://s.apache.org/easy-multi-language . Though all the
> examples use transforms defined in the Beam SDK, any appropriately
> packaged libraries may be used.
>
> There are many ways this could be extended. For example
>
> * It would be useful to be able to templatize yaml descriptions. This
> could be done with $SIGIL type notation or some other way. This would
> even allow one to define reusable, parameterized composite PTransform
> types in yaml itself.
>
> * It would be good to have a more principled way of merging
> environments. Currently each set of dependencies is a unique Beam
> environment, and while Beam has sophisticated cross-language
> capabilities, it would be nice if environments sharing the same
> language (and likely also the same Beam version) could be fused
> in-process (e.g. with separate class loaders or compatibility checks
> for packages).
>
> * Publishing and discovery of transformations could be improved,
> possibly via shared standards and some kind of a transform catalog. An
> ecosystem of easily sharable transforms (similar to what huggingface
> provides for ML models) could provide a useful platform for making it
> easy to build pipelines and open up Beam to a whole new set of users.
>
> Let me know what you think.
>
> - Robert
>

Re: A Declarative API for Apache Beam

Reply via email to