Re: A Declarative API for Apache Beam

Byron Ellis via dev Wed, 14 Dec 2022 14:44:33 -0800

And I guess also a PR for completeness to make it easier to find going
forward instead of my random repo: https://github.com/apache/beam/pull/24670


On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <[email protected]> wrote:

> Since Robert opened that can of worms (and we happened to talk about it
> yesterday)... :-)
>
> I figured I'd also share my start on a "port" of dbt to the Beam SDK. This
> would be complementary as it doesn't really provide a way of specifying a
> pipeline, more orchestrating and packaging a complex pipeline---dbt itself
> supports SQL and Python Dataframes, which both seem like reasonable things
> for Beam and it wouldn't be a stretch to include something like the format
> above. Though in my head I had imagined people would tend to write
> composite transforms in the SDK of their choosing that are then exposed at
> this layer. I decided to go with dbt as it also provides a number of nice
> "quality of life" features for its users like documentation, validation,
> environments and so on,
>
> I did a really quick proof-of-viability implementation here:
> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>
> And you can see a really simple pipeline that reads a seed file (TextIO),
> runs it through a couple of SQLTransforms and then drops it out to a logger
> via a simple DoFn here:
> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>
> I've also heard a rumor there might also be a textproto-based
> representation floating around too :-)
>
> Best,
> B
>
>
>
>
>
> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <[email protected]>
> wrote:
>
>> Hello Robert,
>>
>> I'm replying to say that I've been waiting for something like this ever
>> since I started learning Beam and I'm grateful you are pushing this forward.
>>
>> Best,
>>
>> Damon
>>
>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> While Beam provides powerful APIs for authoring sophisticated data
>>> processing pipelines, it often still has too high a barrier for
>>> getting started and authoring simple pipelines. Even setting up the
>>> environment, installing the dependencies, and setting up the project
>>> can be an overwhelming amount of boilerplate for some (though
>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
>>> way in making this easier). At the other extreme, the Dataflow project
>>> has the notion of templates which are pre-built Beam pipelines that
>>> can be easily launched from the command line, or even from your
>>> browser, but they are fairly restrictive, limited to pre-assembled
>>> pipelines taking a small number of parameters.
>>>
>>> The idea of creating a yaml-based description of pipelines has come up
>>> several times in several contexts and this last week I decided to code
>>> up what it could look like. Here's a proposal.
>>>
>>> pipeline:
>>>   - type: chain
>>>     transforms:
>>>       - type: ReadFromText
>>>         args:
>>>          file_pattern: "wordcount.yaml"
>>>       - type: PyMap
>>>         fn: "str.lower"
>>>       - type: PyFlatMap
>>>         fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>>>       - type: PyTransform
>>>         name: Count
>>>         constructor: "apache_beam.transforms.combiners.Count.PerElement"
>>>       - type: PyMap
>>>         fn: str
>>>       - type: WriteToText
>>>         file_path_prefix: "counts.txt"
>>>
>>> Some more examples at
>>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>>>
>>> A prototype (feedback welcome) can be found at
>>> https://github.com/apache/beam/pull/24667. It can be invoked as
>>>
>>>     python -m apache_beam.yaml.main --pipeline_spec_file
>>> [path/to/file.yaml] [other_pipene_args]
>>>
>>> or
>>>
>>>     python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
>>> [other_pipene_args]
>>>
>>> For example, to play around with this one could do
>>>
>>>     python -m apache_beam.yaml.main  \
>>>         --pipeline_spec "$(curl
>>>
>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>> )"
>>> \
>>>         --runner=apache_beam.runners.render.RenderRunner \
>>>         --render_out=out.png
>>>
>>> Alternatively one can run it as a docker container with no need to
>>> install any SDK
>>>
>>>     docker run --rm \
>>>         --entrypoint /usr/local/bin/python \
>>>         gcr.io/apache-beam-testing/yaml_template:dev
>>> /dataflow/template/main.py \
>>>         --pipeline_spec="$(curl
>>>
>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>> )"
>>>
>>> Though of course one would have to set up the appropriate mount points
>>> to do any local filesystem io and/or credentials.
>>>
>>> This is also available as a Dataflow template and can be invoked as
>>>
>>>     gcloud dataflow flex-template run \
>>>         "yaml-template-job" \
>>>          --template-file-gcs-location
>>> gs://apache-beam-testing-robertwb/yaml_template.json \
>>>         --parameters ^~^pipeline_spec="$(curl
>>>
>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>> )"
>>> \
>>>         --parameters pickle_library=cloudpickle \
>>>         --project=apache-beam-testing \
>>>         --region us-central1
>>>
>>> (Note the escaping required for the parameter (use cat for a local
>>> file), and the debug cycle here could be greatly improved, so I'd
>>> recommend trying things locally first.)
>>>
>>> A key point of this implementation is that it heavily uses the
>>> expansion service and cross language transforms, tying into the
>>> proposal at  https://s.apache.org/easy-multi-language . Though all the
>>> examples use transforms defined in the Beam SDK, any appropriately
>>> packaged libraries may be used.
>>>
>>> There are many ways this could be extended. For example
>>>
>>> * It would be useful to be able to templatize yaml descriptions. This
>>> could be done with $SIGIL type notation or some other way. This would
>>> even allow one to define reusable, parameterized composite PTransform
>>> types in yaml itself.
>>>
>>> * It would be good to have a more principled way of merging
>>> environments. Currently each set of dependencies is a unique Beam
>>> environment, and while Beam has sophisticated cross-language
>>> capabilities, it would be nice if environments sharing the same
>>> language (and likely also the same Beam version) could be fused
>>> in-process (e.g. with separate class loaders or compatibility checks
>>> for packages).
>>>
>>> * Publishing and discovery of transformations could be improved,
>>> possibly via shared standards and some kind of a transform catalog. An
>>> ecosystem of easily sharable transforms (similar to what huggingface
>>> provides for ML models) could provide a useful platform for making it
>>> easy to build pipelines and open up Beam to a whole new set of users.
>>>
>>> Let me know what you think.
>>>
>>> - Robert
>>>
>>

Re: A Declarative API for Apache Beam

Reply via email to