Re: A Declarative API for Apache Beam

Chamikara Jayalath via dev Wed, 14 Dec 2022 15:48:34 -0800

+1 for these proposals and agree that these will simplify and demystify
Beam for many new users. I think when combined with the x-lang/Schema-Aware
transform binding, these might end up being adequate solutions for many
production use-cases as well (unless users need to define custom
composites, I/O connectors, etc.).


Also, thanks for providing prototype implementations with examples.

- Cham


On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev <[email protected]>
wrote:

> To build on Kenn's point, if we leverage existing stuff like dbt we get
> access to a ready made community which can help drive both adoption and
> incremental innovation by bringing more folks to Beam
>
> On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles <[email protected]> wrote:
>
>> 1. I love the idea. Back in the early days people talked about an "XML
>> SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the
>> time. Portability and specifically cross-language schema transforms gives
>> the right infrastructure so this is the perfect time: unique names (URNs)
>> for transforms and explicit lists of parameters they require.
>>
>> 2. I like the idea of re-using some existing thing like dbt if it is
>> pretty much what we were going to do anyhow. I don't think we should hold
>> ourselves back. I also don't think we'll gain anything in terms of
>> implementation. But at least it could fast-forward our design process
>> because we simply don't have to make most of the decisions because they are
>> made for us.
>>
>>
>>
>> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev <[email protected]>
>> wrote:
>>
>>> And I guess also a PR for completeness to make it easier to find going
>>> forward instead of my random repo:
>>> https://github.com/apache/beam/pull/24670
>>>
>>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <[email protected]>
>>> wrote:
>>>
>>>> Since Robert opened that can of worms (and we happened to talk about it
>>>> yesterday)... :-)
>>>>
>>>> I figured I'd also share my start on a "port" of dbt to the Beam SDK.
>>>> This would be complementary as it doesn't really provide a way of
>>>> specifying a pipeline, more orchestrating and packaging a complex
>>>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem
>>>> like reasonable things for Beam and it wouldn't be a stretch to include
>>>> something like the format above. Though in my head I had imagined people
>>>> would tend to write composite transforms in the SDK of their choosing that
>>>> are then exposed at this layer. I decided to go with dbt as it also
>>>> provides a number of nice "quality of life" features for its users like
>>>> documentation, validation, environments and so on,
>>>>
>>>> I did a really quick proof-of-viability implementation here:
>>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>>>>
>>>> And you can see a really simple pipeline that reads a seed file
>>>> (TextIO), runs it through a couple of SQLTransforms and then drops it out
>>>> to a logger via a simple DoFn here:
>>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>>>>
>>>> I've also heard a rumor there might also be a textproto-based
>>>> representation floating around too :-)
>>>>
>>>> Best,
>>>> B
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <
>>>> [email protected]> wrote:
>>>>
>>>>> Hello Robert,
>>>>>
>>>>> I'm replying to say that I've been waiting for something like this
>>>>> ever since I started learning Beam and I'm grateful you are pushing this
>>>>> forward.
>>>>>
>>>>> Best,
>>>>>
>>>>> Damon
>>>>>
>>>>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> While Beam provides powerful APIs for authoring sophisticated data
>>>>>> processing pipelines, it often still has too high a barrier for
>>>>>> getting started and authoring simple pipelines. Even setting up the
>>>>>> environment, installing the dependencies, and setting up the project
>>>>>> can be an overwhelming amount of boilerplate for some (though
>>>>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
>>>>>> way in making this easier). At the other extreme, the Dataflow project
>>>>>> has the notion of templates which are pre-built Beam pipelines that
>>>>>> can be easily launched from the command line, or even from your
>>>>>> browser, but they are fairly restrictive, limited to pre-assembled
>>>>>> pipelines taking a small number of parameters.
>>>>>>
>>>>>> The idea of creating a yaml-based description of pipelines has come up
>>>>>> several times in several contexts and this last week I decided to code
>>>>>> up what it could look like. Here's a proposal.
>>>>>>
>>>>>> pipeline:
>>>>>>   - type: chain
>>>>>>     transforms:
>>>>>>       - type: ReadFromText
>>>>>>         args:
>>>>>>          file_pattern: "wordcount.yaml"
>>>>>>       - type: PyMap
>>>>>>         fn: "str.lower"
>>>>>>       - type: PyFlatMap
>>>>>>         fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>>>>>>       - type: PyTransform
>>>>>>         name: Count
>>>>>>         constructor:
>>>>>> "apache_beam.transforms.combiners.Count.PerElement"
>>>>>>       - type: PyMap
>>>>>>         fn: str
>>>>>>       - type: WriteToText
>>>>>>         file_path_prefix: "counts.txt"
>>>>>>
>>>>>> Some more examples at
>>>>>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>>>>>>
>>>>>> A prototype (feedback welcome) can be found at
>>>>>> https://github.com/apache/beam/pull/24667. It can be invoked as
>>>>>>
>>>>>>     python -m apache_beam.yaml.main --pipeline_spec_file
>>>>>> [path/to/file.yaml] [other_pipene_args]
>>>>>>
>>>>>> or
>>>>>>
>>>>>>     python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
>>>>>> [other_pipene_args]
>>>>>>
>>>>>> For example, to play around with this one could do
>>>>>>
>>>>>>     python -m apache_beam.yaml.main  \
>>>>>>         --pipeline_spec "$(curl
>>>>>>
>>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>>>>> )"
>>>>>> \
>>>>>>         --runner=apache_beam.runners.render.RenderRunner \
>>>>>>         --render_out=out.png
>>>>>>
>>>>>> Alternatively one can run it as a docker container with no need to
>>>>>> install any SDK
>>>>>>
>>>>>>     docker run --rm \
>>>>>>         --entrypoint /usr/local/bin/python \
>>>>>>         gcr.io/apache-beam-testing/yaml_template:dev
>>>>>> /dataflow/template/main.py \
>>>>>>         --pipeline_spec="$(curl
>>>>>>
>>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>>>>> )"
>>>>>>
>>>>>> Though of course one would have to set up the appropriate mount points
>>>>>> to do any local filesystem io and/or credentials.
>>>>>>
>>>>>> This is also available as a Dataflow template and can be invoked as
>>>>>>
>>>>>>     gcloud dataflow flex-template run \
>>>>>>         "yaml-template-job" \
>>>>>>          --template-file-gcs-location
>>>>>> gs://apache-beam-testing-robertwb/yaml_template.json \
>>>>>>         --parameters ^~^pipeline_spec="$(curl
>>>>>>
>>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>>>>> )"
>>>>>> \
>>>>>>         --parameters pickle_library=cloudpickle \
>>>>>>         --project=apache-beam-testing \
>>>>>>         --region us-central1
>>>>>>
>>>>>> (Note the escaping required for the parameter (use cat for a local
>>>>>> file), and the debug cycle here could be greatly improved, so I'd
>>>>>> recommend trying things locally first.)
>>>>>>
>>>>>> A key point of this implementation is that it heavily uses the
>>>>>> expansion service and cross language transforms, tying into the
>>>>>> proposal at  https://s.apache.org/easy-multi-language . Though all
>>>>>> the
>>>>>> examples use transforms defined in the Beam SDK, any appropriately
>>>>>> packaged libraries may be used.
>>>>>>
>>>>>> There are many ways this could be extended. For example
>>>>>>
>>>>>> * It would be useful to be able to templatize yaml descriptions. This
>>>>>> could be done with $SIGIL type notation or some other way. This would
>>>>>> even allow one to define reusable, parameterized composite PTransform
>>>>>> types in yaml itself.
>>>>>>
>>>>>> * It would be good to have a more principled way of merging
>>>>>> environments. Currently each set of dependencies is a unique Beam
>>>>>> environment, and while Beam has sophisticated cross-language
>>>>>> capabilities, it would be nice if environments sharing the same
>>>>>> language (and likely also the same Beam version) could be fused
>>>>>> in-process (e.g. with separate class loaders or compatibility checks
>>>>>> for packages).
>>>>>>
>>>>>> * Publishing and discovery of transformations could be improved,
>>>>>> possibly via shared standards and some kind of a transform catalog. An
>>>>>> ecosystem of easily sharable transforms (similar to what huggingface
>>>>>> provides for ML models) could provide a useful platform for making it
>>>>>> easy to build pipelines and open up Beam to a whole new set of users.
>>>>>>
>>>>>> Let me know what you think.
>>>>>>
>>>>>> - Robert
>>>>>>
>>>>>

Re: A Declarative API for Apache Beam

Reply via email to