Re: A Declarative API for Apache Beam

Robert Burke Wed, 14 Dec 2022 15:00:57 -0800

I like the idea of a common spec for something like this so we can actually
cross validate all the SDK behaviours. It would make testing significantly
easier.


On Wed, Dec 14, 2022, 2:57 PM Kenneth Knowles <[email protected]> wrote:

> 1. I love the idea. Back in the early days people talked about an "XML
> SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the
> time. Portability and specifically cross-language schema transforms gives
> the right infrastructure so this is the perfect time: unique names (URNs)
> for transforms and explicit lists of parameters they require.
>
> 2. I like the idea of re-using some existing thing like dbt if it is
> pretty much what we were going to do anyhow. I don't think we should hold
> ourselves back. I also don't think we'll gain anything in terms of
> implementation. But at least it could fast-forward our design process
> because we simply don't have to make most of the decisions because they are
> made for us.
>
>
>
> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev <[email protected]>
> wrote:
>
>> And I guess also a PR for completeness to make it easier to find going
>> forward instead of my random repo:
>> https://github.com/apache/beam/pull/24670
>>
>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <[email protected]>
>> wrote:
>>
>>> Since Robert opened that can of worms (and we happened to talk about it
>>> yesterday)... :-)
>>>
>>> I figured I'd also share my start on a "port" of dbt to the Beam SDK.
>>> This would be complementary as it doesn't really provide a way of
>>> specifying a pipeline, more orchestrating and packaging a complex
>>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem
>>> like reasonable things for Beam and it wouldn't be a stretch to include
>>> something like the format above. Though in my head I had imagined people
>>> would tend to write composite transforms in the SDK of their choosing that
>>> are then exposed at this layer. I decided to go with dbt as it also
>>> provides a number of nice "quality of life" features for its users like
>>> documentation, validation, environments and so on,
>>>
>>> I did a really quick proof-of-viability implementation here:
>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>>>
>>> And you can see a really simple pipeline that reads a seed file
>>> (TextIO), runs it through a couple of SQLTransforms and then drops it out
>>> to a logger via a simple DoFn here:
>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>>>
>>> I've also heard a rumor there might also be a textproto-based
>>> representation floating around too :-)
>>>
>>> Best,
>>> B
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <
>>> [email protected]> wrote:
>>>
>>>> Hello Robert,
>>>>
>>>> I'm replying to say that I've been waiting for something like this ever
>>>> since I started learning Beam and I'm grateful you are pushing this 
>>>> forward.
>>>>
>>>> Best,
>>>>
>>>> Damon
>>>>
>>>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>>
>>>>> While Beam provides powerful APIs for authoring sophisticated data
>>>>> processing pipelines, it often still has too high a barrier for
>>>>> getting started and authoring simple pipelines. Even setting up the
>>>>> environment, installing the dependencies, and setting up the project
>>>>> can be an overwhelming amount of boilerplate for some (though
>>>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
>>>>> way in making this easier). At the other extreme, the Dataflow project
>>>>> has the notion of templates which are pre-built Beam pipelines that
>>>>> can be easily launched from the command line, or even from your
>>>>> browser, but they are fairly restrictive, limited to pre-assembled
>>>>> pipelines taking a small number of parameters.
>>>>>
>>>>> The idea of creating a yaml-based description of pipelines has come up
>>>>> several times in several contexts and this last week I decided to code
>>>>> up what it could look like. Here's a proposal.
>>>>>
>>>>> pipeline:
>>>>>   - type: chain
>>>>>     transforms:
>>>>>       - type: ReadFromText
>>>>>         args:
>>>>>          file_pattern: "wordcount.yaml"
>>>>>       - type: PyMap
>>>>>         fn: "str.lower"
>>>>>       - type: PyFlatMap
>>>>>         fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>>>>>       - type: PyTransform
>>>>>         name: Count
>>>>>         constructor:
>>>>> "apache_beam.transforms.combiners.Count.PerElement"
>>>>>       - type: PyMap
>>>>>         fn: str
>>>>>       - type: WriteToText
>>>>>         file_path_prefix: "counts.txt"
>>>>>
>>>>> Some more examples at
>>>>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>>>>>
>>>>> A prototype (feedback welcome) can be found at
>>>>> https://github.com/apache/beam/pull/24667. It can be invoked as
>>>>>
>>>>>     python -m apache_beam.yaml.main --pipeline_spec_file
>>>>> [path/to/file.yaml] [other_pipene_args]
>>>>>
>>>>> or
>>>>>
>>>>>     python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
>>>>> [other_pipene_args]
>>>>>
>>>>> For example, to play around with this one could do
>>>>>
>>>>>     python -m apache_beam.yaml.main  \
>>>>>         --pipeline_spec "$(curl
>>>>>
>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>>>> )"
>>>>> \
>>>>>         --runner=apache_beam.runners.render.RenderRunner \
>>>>>         --render_out=out.png
>>>>>
>>>>> Alternatively one can run it as a docker container with no need to
>>>>> install any SDK
>>>>>
>>>>>     docker run --rm \
>>>>>         --entrypoint /usr/local/bin/python \
>>>>>         gcr.io/apache-beam-testing/yaml_template:dev
>>>>> /dataflow/template/main.py \
>>>>>         --pipeline_spec="$(curl
>>>>>
>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>>>> )"
>>>>>
>>>>> Though of course one would have to set up the appropriate mount points
>>>>> to do any local filesystem io and/or credentials.
>>>>>
>>>>> This is also available as a Dataflow template and can be invoked as
>>>>>
>>>>>     gcloud dataflow flex-template run \
>>>>>         "yaml-template-job" \
>>>>>          --template-file-gcs-location
>>>>> gs://apache-beam-testing-robertwb/yaml_template.json \
>>>>>         --parameters ^~^pipeline_spec="$(curl
>>>>>
>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>>>> )"
>>>>> \
>>>>>         --parameters pickle_library=cloudpickle \
>>>>>         --project=apache-beam-testing \
>>>>>         --region us-central1
>>>>>
>>>>> (Note the escaping required for the parameter (use cat for a local
>>>>> file), and the debug cycle here could be greatly improved, so I'd
>>>>> recommend trying things locally first.)
>>>>>
>>>>> A key point of this implementation is that it heavily uses the
>>>>> expansion service and cross language transforms, tying into the
>>>>> proposal at  https://s.apache.org/easy-multi-language . Though all the
>>>>> examples use transforms defined in the Beam SDK, any appropriately
>>>>> packaged libraries may be used.
>>>>>
>>>>> There are many ways this could be extended. For example
>>>>>
>>>>> * It would be useful to be able to templatize yaml descriptions. This
>>>>> could be done with $SIGIL type notation or some other way. This would
>>>>> even allow one to define reusable, parameterized composite PTransform
>>>>> types in yaml itself.
>>>>>
>>>>> * It would be good to have a more principled way of merging
>>>>> environments. Currently each set of dependencies is a unique Beam
>>>>> environment, and while Beam has sophisticated cross-language
>>>>> capabilities, it would be nice if environments sharing the same
>>>>> language (and likely also the same Beam version) could be fused
>>>>> in-process (e.g. with separate class loaders or compatibility checks
>>>>> for packages).
>>>>>
>>>>> * Publishing and discovery of transformations could be improved,
>>>>> possibly via shared standards and some kind of a transform catalog. An
>>>>> ecosystem of easily sharable transforms (similar to what huggingface
>>>>> provides for ML models) could provide a useful platform for making it
>>>>> easy to build pipelines and open up Beam to a whole new set of users.
>>>>>
>>>>> Let me know what you think.
>>>>>
>>>>> - Robert
>>>>>
>>>>

Re: A Declarative API for Apache Beam

Reply via email to