Re: A Declarative API for Apache Beam

Kenneth Knowles Wed, 14 Dec 2022 14:57:30 -0800

1. I love the idea. Back in the early days people talked about an "XML SDK"
or "JSON SDK" or "YAML SDK" and it didn't really make sense at the time.
Portability and specifically cross-language schema transforms gives the
right infrastructure so this is the perfect time: unique names (URNs) for
transforms and explicit lists of parameters they require.


2. I like the idea of re-using some existing thing like dbt if it is pretty
much what we were going to do anyhow. I don't think we should hold
ourselves back. I also don't think we'll gain anything in terms of
implementation. But at least it could fast-forward our design process
because we simply don't have to make most of the decisions because they are
made for us.



On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev <[email protected]>
wrote:

> And I guess also a PR for completeness to make it easier to find going
> forward instead of my random repo:
> https://github.com/apache/beam/pull/24670
>
> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <[email protected]> wrote:
>
>> Since Robert opened that can of worms (and we happened to talk about it
>> yesterday)... :-)
>>
>> I figured I'd also share my start on a "port" of dbt to the Beam SDK.
>> This would be complementary as it doesn't really provide a way of
>> specifying a pipeline, more orchestrating and packaging a complex
>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem
>> like reasonable things for Beam and it wouldn't be a stretch to include
>> something like the format above. Though in my head I had imagined people
>> would tend to write composite transforms in the SDK of their choosing that
>> are then exposed at this layer. I decided to go with dbt as it also
>> provides a number of nice "quality of life" features for its users like
>> documentation, validation, environments and so on,
>>
>> I did a really quick proof-of-viability implementation here:
>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>>
>> And you can see a really simple pipeline that reads a seed file (TextIO),
>> runs it through a couple of SQLTransforms and then drops it out to a logger
>> via a simple DoFn here:
>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>>
>> I've also heard a rumor there might also be a textproto-based
>> representation floating around too :-)
>>
>> Best,
>> B
>>
>>
>>
>>
>>
>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <
>> [email protected]> wrote:
>>
>>> Hello Robert,
>>>
>>> I'm replying to say that I've been waiting for something like this ever
>>> since I started learning Beam and I'm grateful you are pushing this forward.
>>>
>>> Best,
>>>
>>> Damon
>>>
>>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> While Beam provides powerful APIs for authoring sophisticated data
>>>> processing pipelines, it often still has too high a barrier for
>>>> getting started and authoring simple pipelines. Even setting up the
>>>> environment, installing the dependencies, and setting up the project
>>>> can be an overwhelming amount of boilerplate for some (though
>>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
>>>> way in making this easier). At the other extreme, the Dataflow project
>>>> has the notion of templates which are pre-built Beam pipelines that
>>>> can be easily launched from the command line, or even from your
>>>> browser, but they are fairly restrictive, limited to pre-assembled
>>>> pipelines taking a small number of parameters.
>>>>
>>>> The idea of creating a yaml-based description of pipelines has come up
>>>> several times in several contexts and this last week I decided to code
>>>> up what it could look like. Here's a proposal.
>>>>
>>>> pipeline:
>>>>   - type: chain
>>>>     transforms:
>>>>       - type: ReadFromText
>>>>         args:
>>>>          file_pattern: "wordcount.yaml"
>>>>       - type: PyMap
>>>>         fn: "str.lower"
>>>>       - type: PyFlatMap
>>>>         fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>>>>       - type: PyTransform
>>>>         name: Count
>>>>         constructor: "apache_beam.transforms.combiners.Count.PerElement"
>>>>       - type: PyMap
>>>>         fn: str
>>>>       - type: WriteToText
>>>>         file_path_prefix: "counts.txt"
>>>>
>>>> Some more examples at
>>>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>>>>
>>>> A prototype (feedback welcome) can be found at
>>>> https://github.com/apache/beam/pull/24667. It can be invoked as
>>>>
>>>>     python -m apache_beam.yaml.main --pipeline_spec_file
>>>> [path/to/file.yaml] [other_pipene_args]
>>>>
>>>> or
>>>>
>>>>     python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
>>>> [other_pipene_args]
>>>>
>>>> For example, to play around with this one could do
>>>>
>>>>     python -m apache_beam.yaml.main  \
>>>>         --pipeline_spec "$(curl
>>>>
>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>>> )"
>>>> \
>>>>         --runner=apache_beam.runners.render.RenderRunner \
>>>>         --render_out=out.png
>>>>
>>>> Alternatively one can run it as a docker container with no need to
>>>> install any SDK
>>>>
>>>>     docker run --rm \
>>>>         --entrypoint /usr/local/bin/python \
>>>>         gcr.io/apache-beam-testing/yaml_template:dev
>>>> /dataflow/template/main.py \
>>>>         --pipeline_spec="$(curl
>>>>
>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>>> )"
>>>>
>>>> Though of course one would have to set up the appropriate mount points
>>>> to do any local filesystem io and/or credentials.
>>>>
>>>> This is also available as a Dataflow template and can be invoked as
>>>>
>>>>     gcloud dataflow flex-template run \
>>>>         "yaml-template-job" \
>>>>          --template-file-gcs-location
>>>> gs://apache-beam-testing-robertwb/yaml_template.json \
>>>>         --parameters ^~^pipeline_spec="$(curl
>>>>
>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>>> )"
>>>> \
>>>>         --parameters pickle_library=cloudpickle \
>>>>         --project=apache-beam-testing \
>>>>         --region us-central1
>>>>
>>>> (Note the escaping required for the parameter (use cat for a local
>>>> file), and the debug cycle here could be greatly improved, so I'd
>>>> recommend trying things locally first.)
>>>>
>>>> A key point of this implementation is that it heavily uses the
>>>> expansion service and cross language transforms, tying into the
>>>> proposal at  https://s.apache.org/easy-multi-language . Though all the
>>>> examples use transforms defined in the Beam SDK, any appropriately
>>>> packaged libraries may be used.
>>>>
>>>> There are many ways this could be extended. For example
>>>>
>>>> * It would be useful to be able to templatize yaml descriptions. This
>>>> could be done with $SIGIL type notation or some other way. This would
>>>> even allow one to define reusable, parameterized composite PTransform
>>>> types in yaml itself.
>>>>
>>>> * It would be good to have a more principled way of merging
>>>> environments. Currently each set of dependencies is a unique Beam
>>>> environment, and while Beam has sophisticated cross-language
>>>> capabilities, it would be nice if environments sharing the same
>>>> language (and likely also the same Beam version) could be fused
>>>> in-process (e.g. with separate class loaders or compatibility checks
>>>> for packages).
>>>>
>>>> * Publishing and discovery of transformations could be improved,
>>>> possibly via shared standards and some kind of a transform catalog. An
>>>> ecosystem of easily sharable transforms (similar to what huggingface
>>>> provides for ML models) could provide a useful platform for making it
>>>> easy to build pipelines and open up Beam to a whole new set of users.
>>>>
>>>> Let me know what you think.
>>>>
>>>> - Robert
>>>>
>>>

Re: A Declarative API for Apache Beam

Reply via email to