Re: A Declarative API for Apache Beam

Sachin Agarwal via dev Fri, 16 Dec 2022 16:44:48 -0800

While dbt and Dataform clearly can solve some of the same problems, there
is a large dbt community that Beam could serve. If the Beam community
thinks joining forces with thr dbt community to bring dbt to ETL use cases
beyond just data warehouses where dbt is used for ELT, that is great for
everyone.


Beam is much more than just Google, and that is incredibly important as an
ASF TLP.

On Fri, Dec 16, 2022 at 4:22 PM Austin Bennett <aus...@apache.org> wrote:

> Seems a worthwhile addition which can expand the community by making Beam
> increasingly accessible to additional users and for more use-cases.
>
> A bit of a tangent, since commenting on @Byron Ellis
> <byronel...@google.com>'s part, but ...  Ensuring some have also seen
> Dataform [ ex: https://cloud.google.com/dataform/docs/overview ... and -
> formerly - https://dataform.co/ ] , since now part of the same company as
> you, there are potentially additional maybe-straightforward
> conversations/lessons-learned/etc to discuss [ in addition to collabs with
> the dbt community ].  At times, I think of these two [ dbt, dataform] as
> addressing similar things.
>
>
>
> On Thu, Dec 15, 2022 at 4:17 PM Ahmet Altay via dev <dev@beam.apache.org>
> wrote:
>
>> +1 to both of these proposals. In the past 12 months I have heard of at
>> least 3 YAML implementations built on top of Beam in large production
>> systems. Unfortunately, none of those were open sourced. Having these out
>> of the box would be great, and it will clearly have used demand. Thank
>> you all!
>>
>> On Thu, Dec 15, 2022 at 10:59 AM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> On Thu, Dec 15, 2022 at 3:37 AM Steven van Rossum
>>> <sjvanros...@google.com> wrote:
>>> >
>>> > This is great! I developed a similar template a year or two ago as a
>>> reference for a customer to speed up their development process and
>>> unsurprisingly it did speed up their development.
>>> > Here's an example of the config layout I came up with at the time:
>>> >
>>> > options:
>>> >   runner: DirectRunner
>>> >
>>> > pipeline:
>>> > # - &messages
>>> > #   label: PubSub XML source
>>> > #   transform:
>>> > #     !PTransform:apache_beam.io.ReadFromPubSub
>>> > #     subscription: projects/PROJECT/subscriptions/SUBSCRIPTION
>>> > - &message_source_1
>>> >   label: XML source 1
>>> >   transform:
>>> >     !PTransform:apache_beam.Create
>>> >     values:
>>> >     - /path/to/file.xml
>>> > - &message_source_2
>>> >   label: XML source 2
>>> >   transform:
>>> >     !PTransform:apache_beam.Create
>>> >     values:
>>> >     - /path/to/another/file.xml
>>> > - &message_xml
>>> >   label: XMLs
>>> >   inputs:
>>> >   - step: *message_source_1
>>> >   - step: *message_source_2
>>> >   transform:
>>> >     !PTransform:utils.transforms.ParseXmlDocument {}
>>> > - &validated_messages
>>> >   label: Validate XMLs
>>> >   inputs:
>>> >   - step: *message_xml
>>> >     tag: success
>>> >   transform:
>>> >     !PTransform:utils.transforms.ValidateXmlDocumentWithXmlSchema
>>> >     schema: /path/to/file.xsd
>>> > - &converted_messages
>>> >   label: Convert XMLs
>>> >   inputs:
>>> >   - step: *validated_messages
>>> >   transform:
>>> >     !PTransform:utils.transforms.ConvertXmlDocumentToDictionary
>>> >     schema: /path/to/file.xsd
>>> > - label: Print XMLs
>>> >   inputs:
>>> >   - step: *converted_messages
>>> >   transform:
>>> >     !PTransform:utils.transforms.Print {}
>>> >
>>> > Highlights:
>>> > Pipeline options are supplied under an options property.
>>>
>>> Yep, I was thinking exactly the same:
>>>
>>> https://github.com/apache/beam/blob/c5518014d47a42651df94419e3ccbc79eaf96cb3/sdks/python/apache_beam/yaml/main.py#L51
>>>
>>> > A pipeline is a flat set of all transforms in the pipeline.
>>>
>>> One can certainly enumerate the transforms as a flat set, but I do
>>> think being able to define a composite structure is nice. In addition,
>>> the "chain" composite allows one to automatically infer the
>>> input-output relation rather than having to spell it out (much as one
>>> can chain multiple transforms in the various SDKs rather than have to
>>> assign each result to a intermediate).
>>>
>>> > Transforms are defined using a YAML tag and named properties and can
>>> be used by constructing a YAML reference.
>>>
>>> That's an interesting idea. Can it be done inline as well?
>>>
>>> > DAG construction is done using a simple topological sort of transforms
>>> and their dependencies.
>>>
>>> Same.
>>>
>>> > Named side outputs can be referenced using a tag field.
>>>
>>> I didn't put this in any of the examples, but I do the same. If a
>>> transform Foo produces multiple outputs, one can (in fact must)
>>> reference the various outputs by Foo.output1, Foo.output2, etc.
>>>
>>> > Multiple inputs are merged with a Flatten transform.
>>>
>>> PTransfoms can have named inputs as well (they're not always
>>> symmetric), so I let inputs be a map if they care to distinguish them.
>>>
>>> > Not sure if there's any inspiration left to take from this, but I
>>> figured I'd throw it up here to share.
>>>
>>> Thanks. It's neat to see others coming up with the same idea, with
>>> very similar conventions, and validates that it'd be both natural and
>>> useful.
>>>
>>>
>>> > On Thu, Dec 15, 2022 at 12:48 AM Chamikara Jayalath via dev <
>>> dev@beam.apache.org> wrote:
>>> >>
>>> >> +1 for these proposals and agree that these will simplify and
>>> demystify Beam for many new users. I think when combined with the
>>> x-lang/Schema-Aware transform binding, these might end up being adequate
>>> solutions for many production use-cases as well (unless users need to
>>> define custom composites, I/O connectors, etc.).
>>> >>
>>> >> Also, thanks for providing prototype implementations with examples.
>>> >>
>>> >> - Cham
>>> >>
>>> >>
>>> >> On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev <
>>> dev@beam.apache.org> wrote:
>>> >>>
>>> >>> To build on Kenn's point, if we leverage existing stuff like dbt we
>>> get access to a ready made community which can help drive both adoption and
>>> incremental innovation by bringing more folks to Beam
>>> >>>
>>> >>> On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles <k...@apache.org>
>>> wrote:
>>> >>>>
>>> >>>> 1. I love the idea. Back in the early days people talked about an
>>> "XML SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at
>>> the time. Portability and specifically cross-language schema transforms
>>> gives the right infrastructure so this is the perfect time: unique names
>>> (URNs) for transforms and explicit lists of parameters they require.
>>> >>>>
>>> >>>> 2. I like the idea of re-using some existing thing like dbt if it
>>> is pretty much what we were going to do anyhow. I don't think we should
>>> hold ourselves back. I also don't think we'll gain anything in terms of
>>> implementation. But at least it could fast-forward our design process
>>> because we simply don't have to make most of the decisions because they are
>>> made for us.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev <
>>> dev@beam.apache.org> wrote:
>>> >>>>>
>>> >>>>> And I guess also a PR for completeness to make it easier to find
>>> going forward instead of my random repo:
>>> https://github.com/apache/beam/pull/24670
>>> >>>>>
>>> >>>>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis <byronel...@google.com>
>>> wrote:
>>> >>>>>>
>>> >>>>>> Since Robert opened that can of worms (and we happened to talk
>>> about it yesterday)... :-)
>>> >>>>>>
>>> >>>>>> I figured I'd also share my start on a "port" of dbt to the Beam
>>> SDK. This would be complementary as it doesn't really provide a way of
>>> specifying a pipeline, more orchestrating and packaging a complex
>>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem
>>> like reasonable things for Beam and it wouldn't be a stretch to include
>>> something like the format above. Though in my head I had imagined people
>>> would tend to write composite transforms in the SDK of their choosing that
>>> are then exposed at this layer. I decided to go with dbt as it also
>>> provides a number of nice "quality of life" features for its users like
>>> documentation, validation, environments and so on,
>>> >>>>>>
>>> >>>>>> I did a really quick proof-of-viability implementation here:
>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>>> >>>>>>
>>> >>>>>> And you can see a really simple pipeline that reads a seed file
>>> (TextIO), runs it through a couple of SQLTransforms and then drops it out
>>> to a logger via a simple DoFn here:
>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>>> >>>>>>
>>> >>>>>> I've also heard a rumor there might also be a textproto-based
>>> representation floating around too :-)
>>> >>>>>>
>>> >>>>>> Best,
>>> >>>>>> B
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <
>>> dev@beam.apache.org> wrote:
>>> >>>>>>>
>>> >>>>>>> Hello Robert,
>>> >>>>>>>
>>> >>>>>>> I'm replying to say that I've been waiting for something like
>>> this ever since I started learning Beam and I'm grateful you are pushing
>>> this forward.
>>> >>>>>>>
>>> >>>>>>> Best,
>>> >>>>>>>
>>> >>>>>>> Damon
>>> >>>>>>>
>>> >>>>>>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>> >>>>>>>>
>>> >>>>>>>> While Beam provides powerful APIs for authoring sophisticated
>>> data
>>> >>>>>>>> processing pipelines, it often still has too high a barrier for
>>> >>>>>>>> getting started and authoring simple pipelines. Even setting up
>>> the
>>> >>>>>>>> environment, installing the dependencies, and setting up the
>>> project
>>> >>>>>>>> can be an overwhelming amount of boilerplate for some (though
>>> >>>>>>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a
>>> long
>>> >>>>>>>> way in making this easier). At the other extreme, the Dataflow
>>> project
>>> >>>>>>>> has the notion of templates which are pre-built Beam pipelines
>>> that
>>> >>>>>>>> can be easily launched from the command line, or even from your
>>> >>>>>>>> browser, but they are fairly restrictive, limited to
>>> pre-assembled
>>> >>>>>>>> pipelines taking a small number of parameters.
>>> >>>>>>>>
>>> >>>>>>>> The idea of creating a yaml-based description of pipelines has
>>> come up
>>> >>>>>>>> several times in several contexts and this last week I decided
>>> to code
>>> >>>>>>>> up what it could look like. Here's a proposal.
>>> >>>>>>>>
>>> >>>>>>>> pipeline:
>>> >>>>>>>>   - type: chain
>>> >>>>>>>>     transforms:
>>> >>>>>>>>       - type: ReadFromText
>>> >>>>>>>>         args:
>>> >>>>>>>>          file_pattern: "wordcount.yaml"
>>> >>>>>>>>       - type: PyMap
>>> >>>>>>>>         fn: "str.lower"
>>> >>>>>>>>       - type: PyFlatMap
>>> >>>>>>>>         fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>>> >>>>>>>>       - type: PyTransform
>>> >>>>>>>>         name: Count
>>> >>>>>>>>         constructor:
>>> "apache_beam.transforms.combiners.Count.PerElement"
>>> >>>>>>>>       - type: PyMap
>>> >>>>>>>>         fn: str
>>> >>>>>>>>       - type: WriteToText
>>> >>>>>>>>         file_path_prefix: "counts.txt"
>>> >>>>>>>>
>>> >>>>>>>> Some more examples at
>>> >>>>>>>>
>>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>>> >>>>>>>>
>>> >>>>>>>> A prototype (feedback welcome) can be found at
>>> >>>>>>>> https://github.com/apache/beam/pull/24667. It can be invoked as
>>> >>>>>>>>
>>> >>>>>>>>     python -m apache_beam.yaml.main --pipeline_spec_file
>>> >>>>>>>> [path/to/file.yaml] [other_pipene_args]
>>> >>>>>>>>
>>> >>>>>>>> or
>>> >>>>>>>>
>>> >>>>>>>>     python -m apache_beam.yaml.main --pipeline_spec
>>> [yaml_contents]
>>> >>>>>>>> [other_pipene_args]
>>> >>>>>>>>
>>> >>>>>>>> For example, to play around with this one could do
>>> >>>>>>>>
>>> >>>>>>>>     python -m apache_beam.yaml.main  \
>>> >>>>>>>>         --pipeline_spec "$(curl
>>> >>>>>>>>
>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>> )"
>>> >>>>>>>> \
>>> >>>>>>>>         --runner=apache_beam.runners.render.RenderRunner \
>>> >>>>>>>>         --render_out=out.png
>>> >>>>>>>>
>>> >>>>>>>> Alternatively one can run it as a docker container with no need
>>> to
>>> >>>>>>>> install any SDK
>>> >>>>>>>>
>>> >>>>>>>>     docker run --rm \
>>> >>>>>>>>         --entrypoint /usr/local/bin/python \
>>> >>>>>>>>         gcr.io/apache-beam-testing/yaml_template:dev
>>> >>>>>>>> /dataflow/template/main.py \
>>> >>>>>>>>         --pipeline_spec="$(curl
>>> >>>>>>>>
>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>> )"
>>> >>>>>>>>
>>> >>>>>>>> Though of course one would have to set up the appropriate mount
>>> points
>>> >>>>>>>> to do any local filesystem io and/or credentials.
>>> >>>>>>>>
>>> >>>>>>>> This is also available as a Dataflow template and can be
>>> invoked as
>>> >>>>>>>>
>>> >>>>>>>>     gcloud dataflow flex-template run \
>>> >>>>>>>>         "yaml-template-job" \
>>> >>>>>>>>          --template-file-gcs-location
>>> >>>>>>>> gs://apache-beam-testing-robertwb/yaml_template.json \
>>> >>>>>>>>         --parameters ^~^pipeline_spec="$(curl
>>> >>>>>>>>
>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>> )"
>>> >>>>>>>> \
>>> >>>>>>>>         --parameters pickle_library=cloudpickle \
>>> >>>>>>>>         --project=apache-beam-testing \
>>> >>>>>>>>         --region us-central1
>>> >>>>>>>>
>>> >>>>>>>> (Note the escaping required for the parameter (use cat for a
>>> local
>>> >>>>>>>> file), and the debug cycle here could be greatly improved, so
>>> I'd
>>> >>>>>>>> recommend trying things locally first.)
>>> >>>>>>>>
>>> >>>>>>>> A key point of this implementation is that it heavily uses the
>>> >>>>>>>> expansion service and cross language transforms, tying into the
>>> >>>>>>>> proposal at  https://s.apache.org/easy-multi-language . Though
>>> all the
>>> >>>>>>>> examples use transforms defined in the Beam SDK, any
>>> appropriately
>>> >>>>>>>> packaged libraries may be used.
>>> >>>>>>>>
>>> >>>>>>>> There are many ways this could be extended. For example
>>> >>>>>>>>
>>> >>>>>>>> * It would be useful to be able to templatize yaml
>>> descriptions. This
>>> >>>>>>>> could be done with $SIGIL type notation or some other way. This
>>> would
>>> >>>>>>>> even allow one to define reusable, parameterized composite
>>> PTransform
>>> >>>>>>>> types in yaml itself.
>>> >>>>>>>>
>>> >>>>>>>> * It would be good to have a more principled way of merging
>>> >>>>>>>> environments. Currently each set of dependencies is a unique
>>> Beam
>>> >>>>>>>> environment, and while Beam has sophisticated cross-language
>>> >>>>>>>> capabilities, it would be nice if environments sharing the same
>>> >>>>>>>> language (and likely also the same Beam version) could be fused
>>> >>>>>>>> in-process (e.g. with separate class loaders or compatibility
>>> checks
>>> >>>>>>>> for packages).
>>> >>>>>>>>
>>> >>>>>>>> * Publishing and discovery of transformations could be improved,
>>> >>>>>>>> possibly via shared standards and some kind of a transform
>>> catalog. An
>>> >>>>>>>> ecosystem of easily sharable transforms (similar to what
>>> huggingface
>>> >>>>>>>> provides for ML models) could provide a useful platform for
>>> making it
>>> >>>>>>>> easy to build pipelines and open up Beam to a whole new set of
>>> users.
>>> >>>>>>>>
>>> >>>>>>>> Let me know what you think.
>>> >>>>>>>>
>>> >>>>>>>> - Robert
>>>
>>

Re: A Declarative API for Apache Beam

Reply via email to