Re: A Declarative API for Apache Beam

Steven van Rossum via dev Thu, 15 Dec 2022 03:37:17 -0800

This is great! I developed a similar template a year or two ago as a
reference for a customer to speed up their development process and
unsurprisingly it did speed up their development.
Here's an example of the config layout I came up with at the time:


options:
  runner: DirectRunner

pipeline:
# - &messages
#   label: PubSub XML source
#   transform:
#     !PTransform:apache_beam.io.ReadFromPubSub
#     subscription: projects/PROJECT/subscriptions/SUBSCRIPTION
- &message_source_1
  label: XML source 1
  transform:
    !PTransform:apache_beam.Create
    values:
    - /path/to/file.xml
- &message_source_2
  label: XML source 2
  transform:
    !PTransform:apache_beam.Create
    values:
    - /path/to/another/file.xml
- &message_xml
  label: XMLs
  inputs:
  - step: *message_source_1
  - step: *message_source_2
  transform:
    !PTransform:utils.transforms.ParseXmlDocument {}
- &validated_messages
  label: Validate XMLs
  inputs:
  - step: *message_xml
    tag: success
  transform:
    !PTransform:utils.transforms.ValidateXmlDocumentWithXmlSchema
    schema: /path/to/file.xsd
- &converted_messages
  label: Convert XMLs
  inputs:
  - step: *validated_messages
  transform:
    !PTransform:utils.transforms.ConvertXmlDocumentToDictionary
    schema: /path/to/file.xsd
- label: Print XMLs
  inputs:
  - step: *converted_messages
  transform:
    !PTransform:utils.transforms.Print {}

Highlights:
Pipeline options are supplied under an options property.
A pipeline is a flat set of all transforms in the pipeline.
Transforms are defined using a YAML tag and named properties and can be
used by constructing a YAML reference.
DAG construction is done using a simple topological sort of transforms and
their dependencies.
Named side outputs can be referenced using a tag field.
Multiple inputs are merged with a Flatten transform.

Not sure if there's any inspiration left to take from this, but I figured
I'd throw it up here to share.

Cheers,

Steve

On Thu, Dec 15, 2022 at 12:48 AM Chamikara Jayalath via dev <
[email protected]> wrote:

> +1 for these proposals and agree that these will simplify and demystify
> Beam for many new users. I think when combined with the x-lang/Schema-Aware
> transform binding, these might end up being adequate solutions for many
> production use-cases as well (unless users need to define custom
> composites, I/O connectors, etc.).
>
> Also, thanks for providing prototype implementations with examples.
>
> - Cham
>
>
> On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev <
> [email protected]> wrote:
>
>> To build on Kenn's point, if we leverage existing stuff like dbt we get
>> access to a ready made community which can help drive both adoption and
>> incremental innovation by bringing more folks to Beam
>>
>> On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles <[email protected]> wrote:
>>
>>> 1. I love the idea. Back in the early days people talked about an "XML
>>> SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the
>>> time. Portability and specifically cross-language schema transforms gives
>>> the right infrastructure so this is the perfect time: unique names (URNs)
>>> for transforms and explicit lists of parameters they require.
>>>
>>> 2. I like the idea of re-using some existing thing like dbt if it is
>>> pretty much what we were going to do anyhow. I don't think we should hold
>>> ourselves back. I also don't think we'll gain anything in terms of
>>> implementation. But at least it could fast-forward our design process
>>> because we simply don't have to make most of the decisions because they are
>>> made for us.
>>>
>>>
>>>
>>> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev <[email protected]>
>>> wrote:
>>>
>>>> And I guess also a PR for completeness to make it easier to find going
>>>> forward instead of my random repo:
>>>> https://github.com/apache/beam/pull/24670
>>>>
>>>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <[email protected]>
>>>> wrote:
>>>>
>>>>> Since Robert opened that can of worms (and we happened to talk about
>>>>> it yesterday)... :-)
>>>>>
>>>>> I figured I'd also share my start on a "port" of dbt to the Beam SDK.
>>>>> This would be complementary as it doesn't really provide a way of
>>>>> specifying a pipeline, more orchestrating and packaging a complex
>>>>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem
>>>>> like reasonable things for Beam and it wouldn't be a stretch to include
>>>>> something like the format above. Though in my head I had imagined people
>>>>> would tend to write composite transforms in the SDK of their choosing that
>>>>> are then exposed at this layer. I decided to go with dbt as it also
>>>>> provides a number of nice "quality of life" features for its users like
>>>>> documentation, validation, environments and so on,
>>>>>
>>>>> I did a really quick proof-of-viability implementation here:
>>>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>>>>>
>>>>> And you can see a really simple pipeline that reads a seed file
>>>>> (TextIO), runs it through a couple of SQLTransforms and then drops it out
>>>>> to a logger via a simple DoFn here:
>>>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>>>>>
>>>>> I've also heard a rumor there might also be a textproto-based
>>>>> representation floating around too :-)
>>>>>
>>>>> Best,
>>>>> B
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello Robert,
>>>>>>
>>>>>> I'm replying to say that I've been waiting for something like this
>>>>>> ever since I started learning Beam and I'm grateful you are pushing this
>>>>>> forward.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Damon
>>>>>>
>>>>>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> While Beam provides powerful APIs for authoring sophisticated data
>>>>>>> processing pipelines, it often still has too high a barrier for
>>>>>>> getting started and authoring simple pipelines. Even setting up the
>>>>>>> environment, installing the dependencies, and setting up the project
>>>>>>> can be an overwhelming amount of boilerplate for some (though
>>>>>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
>>>>>>> way in making this easier). At the other extreme, the Dataflow
>>>>>>> project
>>>>>>> has the notion of templates which are pre-built Beam pipelines that
>>>>>>> can be easily launched from the command line, or even from your
>>>>>>> browser, but they are fairly restrictive, limited to pre-assembled
>>>>>>> pipelines taking a small number of parameters.
>>>>>>>
>>>>>>> The idea of creating a yaml-based description of pipelines has come
>>>>>>> up
>>>>>>> several times in several contexts and this last week I decided to
>>>>>>> code
>>>>>>> up what it could look like. Here's a proposal.
>>>>>>>
>>>>>>> pipeline:
>>>>>>>   - type: chain
>>>>>>>     transforms:
>>>>>>>       - type: ReadFromText
>>>>>>>         args:
>>>>>>>          file_pattern: "wordcount.yaml"
>>>>>>>       - type: PyMap
>>>>>>>         fn: "str.lower"
>>>>>>>       - type: PyFlatMap
>>>>>>>         fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>>>>>>>       - type: PyTransform
>>>>>>>         name: Count
>>>>>>>         constructor:
>>>>>>> "apache_beam.transforms.combiners.Count.PerElement"
>>>>>>>       - type: PyMap
>>>>>>>         fn: str
>>>>>>>       - type: WriteToText
>>>>>>>         file_path_prefix: "counts.txt"
>>>>>>>
>>>>>>> Some more examples at
>>>>>>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>>>>>>>
>>>>>>> A prototype (feedback welcome) can be found at
>>>>>>> https://github.com/apache/beam/pull/24667. It can be invoked as
>>>>>>>
>>>>>>>     python -m apache_beam.yaml.main --pipeline_spec_file
>>>>>>> [path/to/file.yaml] [other_pipene_args]
>>>>>>>
>>>>>>> or
>>>>>>>
>>>>>>>     python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
>>>>>>> [other_pipene_args]
>>>>>>>
>>>>>>> For example, to play around with this one could do
>>>>>>>
>>>>>>>     python -m apache_beam.yaml.main  \
>>>>>>>         --pipeline_spec "$(curl
>>>>>>>
>>>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>>>>>> )"
>>>>>>> \
>>>>>>>         --runner=apache_beam.runners.render.RenderRunner \
>>>>>>>         --render_out=out.png
>>>>>>>
>>>>>>> Alternatively one can run it as a docker container with no need to
>>>>>>> install any SDK
>>>>>>>
>>>>>>>     docker run --rm \
>>>>>>>         --entrypoint /usr/local/bin/python \
>>>>>>>         gcr.io/apache-beam-testing/yaml_template:dev
>>>>>>> /dataflow/template/main.py \
>>>>>>>         --pipeline_spec="$(curl
>>>>>>>
>>>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>>>>>> )"
>>>>>>>
>>>>>>> Though of course one would have to set up the appropriate mount
>>>>>>> points
>>>>>>> to do any local filesystem io and/or credentials.
>>>>>>>
>>>>>>> This is also available as a Dataflow template and can be invoked as
>>>>>>>
>>>>>>>     gcloud dataflow flex-template run \
>>>>>>>         "yaml-template-job" \
>>>>>>>          --template-file-gcs-location
>>>>>>> gs://apache-beam-testing-robertwb/yaml_template.json \
>>>>>>>         --parameters ^~^pipeline_spec="$(curl
>>>>>>>
>>>>>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>>>>>> )"
>>>>>>> \
>>>>>>>         --parameters pickle_library=cloudpickle \
>>>>>>>         --project=apache-beam-testing \
>>>>>>>         --region us-central1
>>>>>>>
>>>>>>> (Note the escaping required for the parameter (use cat for a local
>>>>>>> file), and the debug cycle here could be greatly improved, so I'd
>>>>>>> recommend trying things locally first.)
>>>>>>>
>>>>>>> A key point of this implementation is that it heavily uses the
>>>>>>> expansion service and cross language transforms, tying into the
>>>>>>> proposal at  https://s.apache.org/easy-multi-language . Though all
>>>>>>> the
>>>>>>> examples use transforms defined in the Beam SDK, any appropriately
>>>>>>> packaged libraries may be used.
>>>>>>>
>>>>>>> There are many ways this could be extended. For example
>>>>>>>
>>>>>>> * It would be useful to be able to templatize yaml descriptions. This
>>>>>>> could be done with $SIGIL type notation or some other way. This would
>>>>>>> even allow one to define reusable, parameterized composite PTransform
>>>>>>> types in yaml itself.
>>>>>>>
>>>>>>> * It would be good to have a more principled way of merging
>>>>>>> environments. Currently each set of dependencies is a unique Beam
>>>>>>> environment, and while Beam has sophisticated cross-language
>>>>>>> capabilities, it would be nice if environments sharing the same
>>>>>>> language (and likely also the same Beam version) could be fused
>>>>>>> in-process (e.g. with separate class loaders or compatibility checks
>>>>>>> for packages).
>>>>>>>
>>>>>>> * Publishing and discovery of transformations could be improved,
>>>>>>> possibly via shared standards and some kind of a transform catalog.
>>>>>>> An
>>>>>>> ecosystem of easily sharable transforms (similar to what huggingface
>>>>>>> provides for ML models) could provide a useful platform for making it
>>>>>>> easy to build pipelines and open up Beam to a whole new set of users.
>>>>>>>
>>>>>>> Let me know what you think.
>>>>>>>
>>>>>>> - Robert
>>>>>>>
>>>>>>

Re: A Declarative API for Apache Beam

Reply via email to