Re: A Declarative API for Apache Beam

2022-12-16 Thread Byron Ellis via dev
So, I picked dbt for the simple reason that it's what all the data
practitioners I know seem to be using and/or complaining about. Hell, I
even found articles

about people moving their batch Beam jobs to dbt (but keeping streaming) so
maybe there's some overlap already with the Beam communities.

In any case, the spec was pretty clear and used stuff that we already had
in Beam like SQL support and YAML parsing so it was pretty easy to just
start coding and see where I ended up. It also just plain resonated with
me--I can practically hear the conversations that led to various design
choices (wish I'd thought to use the table name macro to also do the
topological sort) and I really like the focus on user quality of life
features.

But, yeah, Dataform does largely the same thing (and there are probably a
few others out there as well), though it seems to be pretty BigQuery
focused these days judging from the website copy so it might be different.
Also YAML > JSON for this sort of thing.

Best,
B

P.S. I think I now understand how PythonTransform works so I'm going to try
my hand at getting Python working in addition to SQL :-)

On Fri, Dec 16, 2022 at 4:44 PM Sachin Agarwal  wrote:

> While dbt and Dataform clearly can solve some of the same problems, there
> is a large dbt community that Beam could serve. If the Beam community
> thinks joining forces with thr dbt community to bring dbt to ETL use cases
> beyond just data warehouses where dbt is used for ELT, that is great for
> everyone.
>
> Beam is much more than just Google, and that is incredibly important as an
> ASF TLP.
>
> On Fri, Dec 16, 2022 at 4:22 PM Austin Bennett  wrote:
>
>> Seems a worthwhile addition which can expand the community by making Beam
>> increasingly accessible to additional users and for more use-cases.
>>
>> A bit of a tangent, since commenting on @Byron Ellis
>> 's part, but ...  Ensuring some have also seen
>> Dataform [ ex: https://cloud.google.com/dataform/docs/overview ... and -
>> formerly - https://dataform.co/ ] , since now part of the same company
>> as you, there are potentially additional maybe-straightforward
>> conversations/lessons-learned/etc to discuss [ in addition to collabs with
>> the dbt community ].  At times, I think of these two [ dbt, dataform] as
>> addressing similar things.
>>
>>
>>
>> On Thu, Dec 15, 2022 at 4:17 PM Ahmet Altay via dev 
>> wrote:
>>
>>> +1 to both of these proposals. In the past 12 months I have heard of at
>>> least 3 YAML implementations built on top of Beam in large production
>>> systems. Unfortunately, none of those were open sourced. Having these out
>>> of the box would be great, and it will clearly have used demand. Thank
>>> you all!
>>>
>>> On Thu, Dec 15, 2022 at 10:59 AM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
 On Thu, Dec 15, 2022 at 3:37 AM Steven van Rossum
  wrote:
 >
 > This is great! I developed a similar template a year or two ago as a
 reference for a customer to speed up their development process and
 unsurprisingly it did speed up their development.
 > Here's an example of the config layout I came up with at the time:
 >
 > options:
 >   runner: DirectRunner
 >
 > pipeline:
 > # - 
 > #   label: PubSub XML source
 > #   transform:
 > # !PTransform:apache_beam.io.ReadFromPubSub
 > # subscription: projects/PROJECT/subscriptions/SUBSCRIPTION
 > - _source_1
 >   label: XML source 1
 >   transform:
 > !PTransform:apache_beam.Create
 > values:
 > - /path/to/file.xml
 > - _source_2
 >   label: XML source 2
 >   transform:
 > !PTransform:apache_beam.Create
 > values:
 > - /path/to/another/file.xml
 > - _xml
 >   label: XMLs
 >   inputs:
 >   - step: *message_source_1
 >   - step: *message_source_2
 >   transform:
 > !PTransform:utils.transforms.ParseXmlDocument {}
 > - _messages
 >   label: Validate XMLs
 >   inputs:
 >   - step: *message_xml
 > tag: success
 >   transform:
 > !PTransform:utils.transforms.ValidateXmlDocumentWithXmlSchema
 > schema: /path/to/file.xsd
 > - _messages
 >   label: Convert XMLs
 >   inputs:
 >   - step: *validated_messages
 >   transform:
 > !PTransform:utils.transforms.ConvertXmlDocumentToDictionary
 > schema: /path/to/file.xsd
 > - label: Print XMLs
 >   inputs:
 >   - step: *converted_messages
 >   transform:
 > !PTransform:utils.transforms.Print {}
 >
 > Highlights:
 > Pipeline options are supplied under an options property.

 Yep, I was thinking exactly the same:

 https://github.com/apache/beam/blob/c5518014d47a42651df94419e3ccbc79eaf96cb3/sdks/python/apache_beam/yaml/main.py#L51

 > 

Re: A Declarative API for Apache Beam

2022-12-16 Thread Sachin Agarwal via dev
While dbt and Dataform clearly can solve some of the same problems, there
is a large dbt community that Beam could serve. If the Beam community
thinks joining forces with thr dbt community to bring dbt to ETL use cases
beyond just data warehouses where dbt is used for ELT, that is great for
everyone.

Beam is much more than just Google, and that is incredibly important as an
ASF TLP.

On Fri, Dec 16, 2022 at 4:22 PM Austin Bennett  wrote:

> Seems a worthwhile addition which can expand the community by making Beam
> increasingly accessible to additional users and for more use-cases.
>
> A bit of a tangent, since commenting on @Byron Ellis
> 's part, but ...  Ensuring some have also seen
> Dataform [ ex: https://cloud.google.com/dataform/docs/overview ... and -
> formerly - https://dataform.co/ ] , since now part of the same company as
> you, there are potentially additional maybe-straightforward
> conversations/lessons-learned/etc to discuss [ in addition to collabs with
> the dbt community ].  At times, I think of these two [ dbt, dataform] as
> addressing similar things.
>
>
>
> On Thu, Dec 15, 2022 at 4:17 PM Ahmet Altay via dev 
> wrote:
>
>> +1 to both of these proposals. In the past 12 months I have heard of at
>> least 3 YAML implementations built on top of Beam in large production
>> systems. Unfortunately, none of those were open sourced. Having these out
>> of the box would be great, and it will clearly have used demand. Thank
>> you all!
>>
>> On Thu, Dec 15, 2022 at 10:59 AM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> On Thu, Dec 15, 2022 at 3:37 AM Steven van Rossum
>>>  wrote:
>>> >
>>> > This is great! I developed a similar template a year or two ago as a
>>> reference for a customer to speed up their development process and
>>> unsurprisingly it did speed up their development.
>>> > Here's an example of the config layout I came up with at the time:
>>> >
>>> > options:
>>> >   runner: DirectRunner
>>> >
>>> > pipeline:
>>> > # - 
>>> > #   label: PubSub XML source
>>> > #   transform:
>>> > # !PTransform:apache_beam.io.ReadFromPubSub
>>> > # subscription: projects/PROJECT/subscriptions/SUBSCRIPTION
>>> > - _source_1
>>> >   label: XML source 1
>>> >   transform:
>>> > !PTransform:apache_beam.Create
>>> > values:
>>> > - /path/to/file.xml
>>> > - _source_2
>>> >   label: XML source 2
>>> >   transform:
>>> > !PTransform:apache_beam.Create
>>> > values:
>>> > - /path/to/another/file.xml
>>> > - _xml
>>> >   label: XMLs
>>> >   inputs:
>>> >   - step: *message_source_1
>>> >   - step: *message_source_2
>>> >   transform:
>>> > !PTransform:utils.transforms.ParseXmlDocument {}
>>> > - _messages
>>> >   label: Validate XMLs
>>> >   inputs:
>>> >   - step: *message_xml
>>> > tag: success
>>> >   transform:
>>> > !PTransform:utils.transforms.ValidateXmlDocumentWithXmlSchema
>>> > schema: /path/to/file.xsd
>>> > - _messages
>>> >   label: Convert XMLs
>>> >   inputs:
>>> >   - step: *validated_messages
>>> >   transform:
>>> > !PTransform:utils.transforms.ConvertXmlDocumentToDictionary
>>> > schema: /path/to/file.xsd
>>> > - label: Print XMLs
>>> >   inputs:
>>> >   - step: *converted_messages
>>> >   transform:
>>> > !PTransform:utils.transforms.Print {}
>>> >
>>> > Highlights:
>>> > Pipeline options are supplied under an options property.
>>>
>>> Yep, I was thinking exactly the same:
>>>
>>> https://github.com/apache/beam/blob/c5518014d47a42651df94419e3ccbc79eaf96cb3/sdks/python/apache_beam/yaml/main.py#L51
>>>
>>> > A pipeline is a flat set of all transforms in the pipeline.
>>>
>>> One can certainly enumerate the transforms as a flat set, but I do
>>> think being able to define a composite structure is nice. In addition,
>>> the "chain" composite allows one to automatically infer the
>>> input-output relation rather than having to spell it out (much as one
>>> can chain multiple transforms in the various SDKs rather than have to
>>> assign each result to a intermediate).
>>>
>>> > Transforms are defined using a YAML tag and named properties and can
>>> be used by constructing a YAML reference.
>>>
>>> That's an interesting idea. Can it be done inline as well?
>>>
>>> > DAG construction is done using a simple topological sort of transforms
>>> and their dependencies.
>>>
>>> Same.
>>>
>>> > Named side outputs can be referenced using a tag field.
>>>
>>> I didn't put this in any of the examples, but I do the same. If a
>>> transform Foo produces multiple outputs, one can (in fact must)
>>> reference the various outputs by Foo.output1, Foo.output2, etc.
>>>
>>> > Multiple inputs are merged with a Flatten transform.
>>>
>>> PTransfoms can have named inputs as well (they're not always
>>> symmetric), so I let inputs be a map if they care to distinguish them.
>>>
>>> > Not sure if there's any inspiration left to take from this, but I
>>> figured I'd throw it up here to share.
>>>
>>> Thanks. It's neat to see others coming 

Re: A Declarative API for Apache Beam

2022-12-16 Thread Austin Bennett
Seems a worthwhile addition which can expand the community by making Beam
increasingly accessible to additional users and for more use-cases.

A bit of a tangent, since commenting on @Byron Ellis 's
part, but ...  Ensuring some have also seen Dataform [ ex:
https://cloud.google.com/dataform/docs/overview ... and - formerly -
https://dataform.co/ ] , since now part of the same company as you, there
are potentially additional maybe-straightforward
conversations/lessons-learned/etc to discuss [ in addition to collabs with
the dbt community ].  At times, I think of these two [ dbt, dataform] as
addressing similar things.



On Thu, Dec 15, 2022 at 4:17 PM Ahmet Altay via dev 
wrote:

> +1 to both of these proposals. In the past 12 months I have heard of at
> least 3 YAML implementations built on top of Beam in large production
> systems. Unfortunately, none of those were open sourced. Having these out
> of the box would be great, and it will clearly have used demand. Thank
> you all!
>
> On Thu, Dec 15, 2022 at 10:59 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> On Thu, Dec 15, 2022 at 3:37 AM Steven van Rossum
>>  wrote:
>> >
>> > This is great! I developed a similar template a year or two ago as a
>> reference for a customer to speed up their development process and
>> unsurprisingly it did speed up their development.
>> > Here's an example of the config layout I came up with at the time:
>> >
>> > options:
>> >   runner: DirectRunner
>> >
>> > pipeline:
>> > # - 
>> > #   label: PubSub XML source
>> > #   transform:
>> > # !PTransform:apache_beam.io.ReadFromPubSub
>> > # subscription: projects/PROJECT/subscriptions/SUBSCRIPTION
>> > - _source_1
>> >   label: XML source 1
>> >   transform:
>> > !PTransform:apache_beam.Create
>> > values:
>> > - /path/to/file.xml
>> > - _source_2
>> >   label: XML source 2
>> >   transform:
>> > !PTransform:apache_beam.Create
>> > values:
>> > - /path/to/another/file.xml
>> > - _xml
>> >   label: XMLs
>> >   inputs:
>> >   - step: *message_source_1
>> >   - step: *message_source_2
>> >   transform:
>> > !PTransform:utils.transforms.ParseXmlDocument {}
>> > - _messages
>> >   label: Validate XMLs
>> >   inputs:
>> >   - step: *message_xml
>> > tag: success
>> >   transform:
>> > !PTransform:utils.transforms.ValidateXmlDocumentWithXmlSchema
>> > schema: /path/to/file.xsd
>> > - _messages
>> >   label: Convert XMLs
>> >   inputs:
>> >   - step: *validated_messages
>> >   transform:
>> > !PTransform:utils.transforms.ConvertXmlDocumentToDictionary
>> > schema: /path/to/file.xsd
>> > - label: Print XMLs
>> >   inputs:
>> >   - step: *converted_messages
>> >   transform:
>> > !PTransform:utils.transforms.Print {}
>> >
>> > Highlights:
>> > Pipeline options are supplied under an options property.
>>
>> Yep, I was thinking exactly the same:
>>
>> https://github.com/apache/beam/blob/c5518014d47a42651df94419e3ccbc79eaf96cb3/sdks/python/apache_beam/yaml/main.py#L51
>>
>> > A pipeline is a flat set of all transforms in the pipeline.
>>
>> One can certainly enumerate the transforms as a flat set, but I do
>> think being able to define a composite structure is nice. In addition,
>> the "chain" composite allows one to automatically infer the
>> input-output relation rather than having to spell it out (much as one
>> can chain multiple transforms in the various SDKs rather than have to
>> assign each result to a intermediate).
>>
>> > Transforms are defined using a YAML tag and named properties and can be
>> used by constructing a YAML reference.
>>
>> That's an interesting idea. Can it be done inline as well?
>>
>> > DAG construction is done using a simple topological sort of transforms
>> and their dependencies.
>>
>> Same.
>>
>> > Named side outputs can be referenced using a tag field.
>>
>> I didn't put this in any of the examples, but I do the same. If a
>> transform Foo produces multiple outputs, one can (in fact must)
>> reference the various outputs by Foo.output1, Foo.output2, etc.
>>
>> > Multiple inputs are merged with a Flatten transform.
>>
>> PTransfoms can have named inputs as well (they're not always
>> symmetric), so I let inputs be a map if they care to distinguish them.
>>
>> > Not sure if there's any inspiration left to take from this, but I
>> figured I'd throw it up here to share.
>>
>> Thanks. It's neat to see others coming up with the same idea, with
>> very similar conventions, and validates that it'd be both natural and
>> useful.
>>
>>
>> > On Thu, Dec 15, 2022 at 12:48 AM Chamikara Jayalath via dev <
>> dev@beam.apache.org> wrote:
>> >>
>> >> +1 for these proposals and agree that these will simplify and
>> demystify Beam for many new users. I think when combined with the
>> x-lang/Schema-Aware transform binding, these might end up being adequate
>> solutions for many production use-cases as well (unless users need to
>> define custom composites, I/O connectors, etc.).
>> >>
>> >> Also, 

Re: A Declarative API for Apache Beam

2022-12-15 Thread Ahmet Altay via dev
+1 to both of these proposals. In the past 12 months I have heard of at
least 3 YAML implementations built on top of Beam in large production
systems. Unfortunately, none of those were open sourced. Having these out
of the box would be great, and it will clearly have used demand. Thank
you all!

On Thu, Dec 15, 2022 at 10:59 AM Robert Bradshaw via dev <
dev@beam.apache.org> wrote:

> On Thu, Dec 15, 2022 at 3:37 AM Steven van Rossum
>  wrote:
> >
> > This is great! I developed a similar template a year or two ago as a
> reference for a customer to speed up their development process and
> unsurprisingly it did speed up their development.
> > Here's an example of the config layout I came up with at the time:
> >
> > options:
> >   runner: DirectRunner
> >
> > pipeline:
> > # - 
> > #   label: PubSub XML source
> > #   transform:
> > # !PTransform:apache_beam.io.ReadFromPubSub
> > # subscription: projects/PROJECT/subscriptions/SUBSCRIPTION
> > - _source_1
> >   label: XML source 1
> >   transform:
> > !PTransform:apache_beam.Create
> > values:
> > - /path/to/file.xml
> > - _source_2
> >   label: XML source 2
> >   transform:
> > !PTransform:apache_beam.Create
> > values:
> > - /path/to/another/file.xml
> > - _xml
> >   label: XMLs
> >   inputs:
> >   - step: *message_source_1
> >   - step: *message_source_2
> >   transform:
> > !PTransform:utils.transforms.ParseXmlDocument {}
> > - _messages
> >   label: Validate XMLs
> >   inputs:
> >   - step: *message_xml
> > tag: success
> >   transform:
> > !PTransform:utils.transforms.ValidateXmlDocumentWithXmlSchema
> > schema: /path/to/file.xsd
> > - _messages
> >   label: Convert XMLs
> >   inputs:
> >   - step: *validated_messages
> >   transform:
> > !PTransform:utils.transforms.ConvertXmlDocumentToDictionary
> > schema: /path/to/file.xsd
> > - label: Print XMLs
> >   inputs:
> >   - step: *converted_messages
> >   transform:
> > !PTransform:utils.transforms.Print {}
> >
> > Highlights:
> > Pipeline options are supplied under an options property.
>
> Yep, I was thinking exactly the same:
>
> https://github.com/apache/beam/blob/c5518014d47a42651df94419e3ccbc79eaf96cb3/sdks/python/apache_beam/yaml/main.py#L51
>
> > A pipeline is a flat set of all transforms in the pipeline.
>
> One can certainly enumerate the transforms as a flat set, but I do
> think being able to define a composite structure is nice. In addition,
> the "chain" composite allows one to automatically infer the
> input-output relation rather than having to spell it out (much as one
> can chain multiple transforms in the various SDKs rather than have to
> assign each result to a intermediate).
>
> > Transforms are defined using a YAML tag and named properties and can be
> used by constructing a YAML reference.
>
> That's an interesting idea. Can it be done inline as well?
>
> > DAG construction is done using a simple topological sort of transforms
> and their dependencies.
>
> Same.
>
> > Named side outputs can be referenced using a tag field.
>
> I didn't put this in any of the examples, but I do the same. If a
> transform Foo produces multiple outputs, one can (in fact must)
> reference the various outputs by Foo.output1, Foo.output2, etc.
>
> > Multiple inputs are merged with a Flatten transform.
>
> PTransfoms can have named inputs as well (they're not always
> symmetric), so I let inputs be a map if they care to distinguish them.
>
> > Not sure if there's any inspiration left to take from this, but I
> figured I'd throw it up here to share.
>
> Thanks. It's neat to see others coming up with the same idea, with
> very similar conventions, and validates that it'd be both natural and
> useful.
>
>
> > On Thu, Dec 15, 2022 at 12:48 AM Chamikara Jayalath via dev <
> dev@beam.apache.org> wrote:
> >>
> >> +1 for these proposals and agree that these will simplify and demystify
> Beam for many new users. I think when combined with the x-lang/Schema-Aware
> transform binding, these might end up being adequate solutions for many
> production use-cases as well (unless users need to define custom
> composites, I/O connectors, etc.).
> >>
> >> Also, thanks for providing prototype implementations with examples.
> >>
> >> - Cham
> >>
> >>
> >> On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev <
> dev@beam.apache.org> wrote:
> >>>
> >>> To build on Kenn's point, if we leverage existing stuff like dbt we
> get access to a ready made community which can help drive both adoption and
> incremental innovation by bringing more folks to Beam
> >>>
> >>> On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles 
> wrote:
> 
>  1. I love the idea. Back in the early days people talked about an
> "XML SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at
> the time. Portability and specifically cross-language schema transforms
> gives the right infrastructure so this is the perfect time: unique names
> (URNs) for transforms and explicit lists of 

Re: A Declarative API for Apache Beam

2022-12-15 Thread Robert Bradshaw via dev
On Thu, Dec 15, 2022 at 3:37 AM Steven van Rossum
 wrote:
>
> This is great! I developed a similar template a year or two ago as a 
> reference for a customer to speed up their development process and 
> unsurprisingly it did speed up their development.
> Here's an example of the config layout I came up with at the time:
>
> options:
>   runner: DirectRunner
>
> pipeline:
> # - 
> #   label: PubSub XML source
> #   transform:
> # !PTransform:apache_beam.io.ReadFromPubSub
> # subscription: projects/PROJECT/subscriptions/SUBSCRIPTION
> - _source_1
>   label: XML source 1
>   transform:
> !PTransform:apache_beam.Create
> values:
> - /path/to/file.xml
> - _source_2
>   label: XML source 2
>   transform:
> !PTransform:apache_beam.Create
> values:
> - /path/to/another/file.xml
> - _xml
>   label: XMLs
>   inputs:
>   - step: *message_source_1
>   - step: *message_source_2
>   transform:
> !PTransform:utils.transforms.ParseXmlDocument {}
> - _messages
>   label: Validate XMLs
>   inputs:
>   - step: *message_xml
> tag: success
>   transform:
> !PTransform:utils.transforms.ValidateXmlDocumentWithXmlSchema
> schema: /path/to/file.xsd
> - _messages
>   label: Convert XMLs
>   inputs:
>   - step: *validated_messages
>   transform:
> !PTransform:utils.transforms.ConvertXmlDocumentToDictionary
> schema: /path/to/file.xsd
> - label: Print XMLs
>   inputs:
>   - step: *converted_messages
>   transform:
> !PTransform:utils.transforms.Print {}
>
> Highlights:
> Pipeline options are supplied under an options property.

Yep, I was thinking exactly the same:
https://github.com/apache/beam/blob/c5518014d47a42651df94419e3ccbc79eaf96cb3/sdks/python/apache_beam/yaml/main.py#L51

> A pipeline is a flat set of all transforms in the pipeline.

One can certainly enumerate the transforms as a flat set, but I do
think being able to define a composite structure is nice. In addition,
the "chain" composite allows one to automatically infer the
input-output relation rather than having to spell it out (much as one
can chain multiple transforms in the various SDKs rather than have to
assign each result to a intermediate).

> Transforms are defined using a YAML tag and named properties and can be used 
> by constructing a YAML reference.

That's an interesting idea. Can it be done inline as well?

> DAG construction is done using a simple topological sort of transforms and 
> their dependencies.

Same.

> Named side outputs can be referenced using a tag field.

I didn't put this in any of the examples, but I do the same. If a
transform Foo produces multiple outputs, one can (in fact must)
reference the various outputs by Foo.output1, Foo.output2, etc.

> Multiple inputs are merged with a Flatten transform.

PTransfoms can have named inputs as well (they're not always
symmetric), so I let inputs be a map if they care to distinguish them.

> Not sure if there's any inspiration left to take from this, but I figured I'd 
> throw it up here to share.

Thanks. It's neat to see others coming up with the same idea, with
very similar conventions, and validates that it'd be both natural and
useful.


> On Thu, Dec 15, 2022 at 12:48 AM Chamikara Jayalath via dev 
>  wrote:
>>
>> +1 for these proposals and agree that these will simplify and demystify Beam 
>> for many new users. I think when combined with the x-lang/Schema-Aware 
>> transform binding, these might end up being adequate solutions for many 
>> production use-cases as well (unless users need to define custom composites, 
>> I/O connectors, etc.).
>>
>> Also, thanks for providing prototype implementations with examples.
>>
>> - Cham
>>
>>
>> On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev  
>> wrote:
>>>
>>> To build on Kenn's point, if we leverage existing stuff like dbt we get 
>>> access to a ready made community which can help drive both adoption and 
>>> incremental innovation by bringing more folks to Beam
>>>
>>> On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles  wrote:

 1. I love the idea. Back in the early days people talked about an "XML 
 SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the 
 time. Portability and specifically cross-language schema transforms gives 
 the right infrastructure so this is the perfect time: unique names (URNs) 
 for transforms and explicit lists of parameters they require.

 2. I like the idea of re-using some existing thing like dbt if it is 
 pretty much what we were going to do anyhow. I don't think we should hold 
 ourselves back. I also don't think we'll gain anything in terms of 
 implementation. But at least it could fast-forward our design process 
 because we simply don't have to make most of the decisions because they 
 are made for us.



 On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev  
 wrote:
>
> And I guess also a PR for completeness to make it easier to find going 

Re: A Declarative API for Apache Beam

2022-12-15 Thread Steven van Rossum via dev
This is great! I developed a similar template a year or two ago as a
reference for a customer to speed up their development process and
unsurprisingly it did speed up their development.
Here's an example of the config layout I came up with at the time:

options:
  runner: DirectRunner

pipeline:
# - 
#   label: PubSub XML source
#   transform:
# !PTransform:apache_beam.io.ReadFromPubSub
# subscription: projects/PROJECT/subscriptions/SUBSCRIPTION
- _source_1
  label: XML source 1
  transform:
!PTransform:apache_beam.Create
values:
- /path/to/file.xml
- _source_2
  label: XML source 2
  transform:
!PTransform:apache_beam.Create
values:
- /path/to/another/file.xml
- _xml
  label: XMLs
  inputs:
  - step: *message_source_1
  - step: *message_source_2
  transform:
!PTransform:utils.transforms.ParseXmlDocument {}
- _messages
  label: Validate XMLs
  inputs:
  - step: *message_xml
tag: success
  transform:
!PTransform:utils.transforms.ValidateXmlDocumentWithXmlSchema
schema: /path/to/file.xsd
- _messages
  label: Convert XMLs
  inputs:
  - step: *validated_messages
  transform:
!PTransform:utils.transforms.ConvertXmlDocumentToDictionary
schema: /path/to/file.xsd
- label: Print XMLs
  inputs:
  - step: *converted_messages
  transform:
!PTransform:utils.transforms.Print {}

Highlights:
Pipeline options are supplied under an options property.
A pipeline is a flat set of all transforms in the pipeline.
Transforms are defined using a YAML tag and named properties and can be
used by constructing a YAML reference.
DAG construction is done using a simple topological sort of transforms and
their dependencies.
Named side outputs can be referenced using a tag field.
Multiple inputs are merged with a Flatten transform.

Not sure if there's any inspiration left to take from this, but I figured
I'd throw it up here to share.

Cheers,

Steve

On Thu, Dec 15, 2022 at 12:48 AM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

> +1 for these proposals and agree that these will simplify and demystify
> Beam for many new users. I think when combined with the x-lang/Schema-Aware
> transform binding, these might end up being adequate solutions for many
> production use-cases as well (unless users need to define custom
> composites, I/O connectors, etc.).
>
> Also, thanks for providing prototype implementations with examples.
>
> - Cham
>
>
> On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev <
> dev@beam.apache.org> wrote:
>
>> To build on Kenn's point, if we leverage existing stuff like dbt we get
>> access to a ready made community which can help drive both adoption and
>> incremental innovation by bringing more folks to Beam
>>
>> On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles  wrote:
>>
>>> 1. I love the idea. Back in the early days people talked about an "XML
>>> SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the
>>> time. Portability and specifically cross-language schema transforms gives
>>> the right infrastructure so this is the perfect time: unique names (URNs)
>>> for transforms and explicit lists of parameters they require.
>>>
>>> 2. I like the idea of re-using some existing thing like dbt if it is
>>> pretty much what we were going to do anyhow. I don't think we should hold
>>> ourselves back. I also don't think we'll gain anything in terms of
>>> implementation. But at least it could fast-forward our design process
>>> because we simply don't have to make most of the decisions because they are
>>> made for us.
>>>
>>>
>>>
>>> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev 
>>> wrote:
>>>
 And I guess also a PR for completeness to make it easier to find going
 forward instead of my random repo:
 https://github.com/apache/beam/pull/24670

 On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis 
 wrote:

> Since Robert opened that can of worms (and we happened to talk about
> it yesterday)... :-)
>
> I figured I'd also share my start on a "port" of dbt to the Beam SDK.
> This would be complementary as it doesn't really provide a way of
> specifying a pipeline, more orchestrating and packaging a complex
> pipeline---dbt itself supports SQL and Python Dataframes, which both seem
> like reasonable things for Beam and it wouldn't be a stretch to include
> something like the format above. Though in my head I had imagined people
> would tend to write composite transforms in the SDK of their choosing that
> are then exposed at this layer. I decided to go with dbt as it also
> provides a number of nice "quality of life" features for its users like
> documentation, validation, environments and so on,
>
> I did a really quick proof-of-viability implementation here:
> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>
> And you can see a really simple pipeline that reads a seed file
> (TextIO), runs it through a couple of 

Re: A Declarative API for Apache Beam

2022-12-14 Thread Chamikara Jayalath via dev
+1 for these proposals and agree that these will simplify and demystify
Beam for many new users. I think when combined with the x-lang/Schema-Aware
transform binding, these might end up being adequate solutions for many
production use-cases as well (unless users need to define custom
composites, I/O connectors, etc.).

Also, thanks for providing prototype implementations with examples.

- Cham


On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev 
wrote:

> To build on Kenn's point, if we leverage existing stuff like dbt we get
> access to a ready made community which can help drive both adoption and
> incremental innovation by bringing more folks to Beam
>
> On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles  wrote:
>
>> 1. I love the idea. Back in the early days people talked about an "XML
>> SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the
>> time. Portability and specifically cross-language schema transforms gives
>> the right infrastructure so this is the perfect time: unique names (URNs)
>> for transforms and explicit lists of parameters they require.
>>
>> 2. I like the idea of re-using some existing thing like dbt if it is
>> pretty much what we were going to do anyhow. I don't think we should hold
>> ourselves back. I also don't think we'll gain anything in terms of
>> implementation. But at least it could fast-forward our design process
>> because we simply don't have to make most of the decisions because they are
>> made for us.
>>
>>
>>
>> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev 
>> wrote:
>>
>>> And I guess also a PR for completeness to make it easier to find going
>>> forward instead of my random repo:
>>> https://github.com/apache/beam/pull/24670
>>>
>>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis 
>>> wrote:
>>>
 Since Robert opened that can of worms (and we happened to talk about it
 yesterday)... :-)

 I figured I'd also share my start on a "port" of dbt to the Beam SDK.
 This would be complementary as it doesn't really provide a way of
 specifying a pipeline, more orchestrating and packaging a complex
 pipeline---dbt itself supports SQL and Python Dataframes, which both seem
 like reasonable things for Beam and it wouldn't be a stretch to include
 something like the format above. Though in my head I had imagined people
 would tend to write composite transforms in the SDK of their choosing that
 are then exposed at this layer. I decided to go with dbt as it also
 provides a number of nice "quality of life" features for its users like
 documentation, validation, environments and so on,

 I did a really quick proof-of-viability implementation here:
 https://github.com/byronellis/beam/tree/structured-pipeline-definitions

 And you can see a really simple pipeline that reads a seed file
 (TextIO), runs it through a couple of SQLTransforms and then drops it out
 to a logger via a simple DoFn here:
 https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline

 I've also heard a rumor there might also be a textproto-based
 representation floating around too :-)

 Best,
 B





 On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <
 dev@beam.apache.org> wrote:

> Hello Robert,
>
> I'm replying to say that I've been waiting for something like this
> ever since I started learning Beam and I'm grateful you are pushing this
> forward.
>
> Best,
>
> Damon
>
> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw 
> wrote:
>
>> While Beam provides powerful APIs for authoring sophisticated data
>> processing pipelines, it often still has too high a barrier for
>> getting started and authoring simple pipelines. Even setting up the
>> environment, installing the dependencies, and setting up the project
>> can be an overwhelming amount of boilerplate for some (though
>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
>> way in making this easier). At the other extreme, the Dataflow project
>> has the notion of templates which are pre-built Beam pipelines that
>> can be easily launched from the command line, or even from your
>> browser, but they are fairly restrictive, limited to pre-assembled
>> pipelines taking a small number of parameters.
>>
>> The idea of creating a yaml-based description of pipelines has come up
>> several times in several contexts and this last week I decided to code
>> up what it could look like. Here's a proposal.
>>
>> pipeline:
>>   - type: chain
>> transforms:
>>   - type: ReadFromText
>> args:
>>  file_pattern: "wordcount.yaml"
>>   - type: PyMap
>> fn: "str.lower"
>>   - type: PyFlatMap
>> fn: "import re\nlambda 

Re: A Declarative API for Apache Beam

2022-12-14 Thread Sachin Agarwal via dev
To build on Kenn's point, if we leverage existing stuff like dbt we get
access to a ready made community which can help drive both adoption and
incremental innovation by bringing more folks to Beam

On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles  wrote:

> 1. I love the idea. Back in the early days people talked about an "XML
> SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the
> time. Portability and specifically cross-language schema transforms gives
> the right infrastructure so this is the perfect time: unique names (URNs)
> for transforms and explicit lists of parameters they require.
>
> 2. I like the idea of re-using some existing thing like dbt if it is
> pretty much what we were going to do anyhow. I don't think we should hold
> ourselves back. I also don't think we'll gain anything in terms of
> implementation. But at least it could fast-forward our design process
> because we simply don't have to make most of the decisions because they are
> made for us.
>
>
>
> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev 
> wrote:
>
>> And I guess also a PR for completeness to make it easier to find going
>> forward instead of my random repo:
>> https://github.com/apache/beam/pull/24670
>>
>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis 
>> wrote:
>>
>>> Since Robert opened that can of worms (and we happened to talk about it
>>> yesterday)... :-)
>>>
>>> I figured I'd also share my start on a "port" of dbt to the Beam SDK.
>>> This would be complementary as it doesn't really provide a way of
>>> specifying a pipeline, more orchestrating and packaging a complex
>>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem
>>> like reasonable things for Beam and it wouldn't be a stretch to include
>>> something like the format above. Though in my head I had imagined people
>>> would tend to write composite transforms in the SDK of their choosing that
>>> are then exposed at this layer. I decided to go with dbt as it also
>>> provides a number of nice "quality of life" features for its users like
>>> documentation, validation, environments and so on,
>>>
>>> I did a really quick proof-of-viability implementation here:
>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>>>
>>> And you can see a really simple pipeline that reads a seed file
>>> (TextIO), runs it through a couple of SQLTransforms and then drops it out
>>> to a logger via a simple DoFn here:
>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>>>
>>> I've also heard a rumor there might also be a textproto-based
>>> representation floating around too :-)
>>>
>>> Best,
>>> B
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Hello Robert,

 I'm replying to say that I've been waiting for something like this ever
 since I started learning Beam and I'm grateful you are pushing this 
 forward.

 Best,

 Damon

 On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw 
 wrote:

> While Beam provides powerful APIs for authoring sophisticated data
> processing pipelines, it often still has too high a barrier for
> getting started and authoring simple pipelines. Even setting up the
> environment, installing the dependencies, and setting up the project
> can be an overwhelming amount of boilerplate for some (though
> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
> way in making this easier). At the other extreme, the Dataflow project
> has the notion of templates which are pre-built Beam pipelines that
> can be easily launched from the command line, or even from your
> browser, but they are fairly restrictive, limited to pre-assembled
> pipelines taking a small number of parameters.
>
> The idea of creating a yaml-based description of pipelines has come up
> several times in several contexts and this last week I decided to code
> up what it could look like. Here's a proposal.
>
> pipeline:
>   - type: chain
> transforms:
>   - type: ReadFromText
> args:
>  file_pattern: "wordcount.yaml"
>   - type: PyMap
> fn: "str.lower"
>   - type: PyFlatMap
> fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>   - type: PyTransform
> name: Count
> constructor:
> "apache_beam.transforms.combiners.Count.PerElement"
>   - type: PyMap
> fn: str
>   - type: WriteToText
> file_path_prefix: "counts.txt"
>
> Some more examples at
> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>
> A prototype (feedback welcome) can be found at
> https://github.com/apache/beam/pull/24667. It can be invoked as
>
> python -m 

Re: A Declarative API for Apache Beam

2022-12-14 Thread Robert Burke
I like the idea of a common spec for something like this so we can actually
cross validate all the SDK behaviours. It would make testing significantly
easier.

On Wed, Dec 14, 2022, 2:57 PM Kenneth Knowles  wrote:

> 1. I love the idea. Back in the early days people talked about an "XML
> SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at the
> time. Portability and specifically cross-language schema transforms gives
> the right infrastructure so this is the perfect time: unique names (URNs)
> for transforms and explicit lists of parameters they require.
>
> 2. I like the idea of re-using some existing thing like dbt if it is
> pretty much what we were going to do anyhow. I don't think we should hold
> ourselves back. I also don't think we'll gain anything in terms of
> implementation. But at least it could fast-forward our design process
> because we simply don't have to make most of the decisions because they are
> made for us.
>
>
>
> On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev 
> wrote:
>
>> And I guess also a PR for completeness to make it easier to find going
>> forward instead of my random repo:
>> https://github.com/apache/beam/pull/24670
>>
>> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis 
>> wrote:
>>
>>> Since Robert opened that can of worms (and we happened to talk about it
>>> yesterday)... :-)
>>>
>>> I figured I'd also share my start on a "port" of dbt to the Beam SDK.
>>> This would be complementary as it doesn't really provide a way of
>>> specifying a pipeline, more orchestrating and packaging a complex
>>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem
>>> like reasonable things for Beam and it wouldn't be a stretch to include
>>> something like the format above. Though in my head I had imagined people
>>> would tend to write composite transforms in the SDK of their choosing that
>>> are then exposed at this layer. I decided to go with dbt as it also
>>> provides a number of nice "quality of life" features for its users like
>>> documentation, validation, environments and so on,
>>>
>>> I did a really quick proof-of-viability implementation here:
>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>>>
>>> And you can see a really simple pipeline that reads a seed file
>>> (TextIO), runs it through a couple of SQLTransforms and then drops it out
>>> to a logger via a simple DoFn here:
>>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>>>
>>> I've also heard a rumor there might also be a textproto-based
>>> representation floating around too :-)
>>>
>>> Best,
>>> B
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Hello Robert,

 I'm replying to say that I've been waiting for something like this ever
 since I started learning Beam and I'm grateful you are pushing this 
 forward.

 Best,

 Damon

 On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw 
 wrote:

> While Beam provides powerful APIs for authoring sophisticated data
> processing pipelines, it often still has too high a barrier for
> getting started and authoring simple pipelines. Even setting up the
> environment, installing the dependencies, and setting up the project
> can be an overwhelming amount of boilerplate for some (though
> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
> way in making this easier). At the other extreme, the Dataflow project
> has the notion of templates which are pre-built Beam pipelines that
> can be easily launched from the command line, or even from your
> browser, but they are fairly restrictive, limited to pre-assembled
> pipelines taking a small number of parameters.
>
> The idea of creating a yaml-based description of pipelines has come up
> several times in several contexts and this last week I decided to code
> up what it could look like. Here's a proposal.
>
> pipeline:
>   - type: chain
> transforms:
>   - type: ReadFromText
> args:
>  file_pattern: "wordcount.yaml"
>   - type: PyMap
> fn: "str.lower"
>   - type: PyFlatMap
> fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>   - type: PyTransform
> name: Count
> constructor:
> "apache_beam.transforms.combiners.Count.PerElement"
>   - type: PyMap
> fn: str
>   - type: WriteToText
> file_path_prefix: "counts.txt"
>
> Some more examples at
> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>
> A prototype (feedback welcome) can be found at
> https://github.com/apache/beam/pull/24667. It can be invoked as
>
> python -m apache_beam.yaml.main --pipeline_spec_file
> 

Re: A Declarative API for Apache Beam

2022-12-14 Thread Kenneth Knowles
1. I love the idea. Back in the early days people talked about an "XML SDK"
or "JSON SDK" or "YAML SDK" and it didn't really make sense at the time.
Portability and specifically cross-language schema transforms gives the
right infrastructure so this is the perfect time: unique names (URNs) for
transforms and explicit lists of parameters they require.

2. I like the idea of re-using some existing thing like dbt if it is pretty
much what we were going to do anyhow. I don't think we should hold
ourselves back. I also don't think we'll gain anything in terms of
implementation. But at least it could fast-forward our design process
because we simply don't have to make most of the decisions because they are
made for us.



On Wed, Dec 14, 2022 at 2:44 PM Byron Ellis via dev 
wrote:

> And I guess also a PR for completeness to make it easier to find going
> forward instead of my random repo:
> https://github.com/apache/beam/pull/24670
>
> On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis  wrote:
>
>> Since Robert opened that can of worms (and we happened to talk about it
>> yesterday)... :-)
>>
>> I figured I'd also share my start on a "port" of dbt to the Beam SDK.
>> This would be complementary as it doesn't really provide a way of
>> specifying a pipeline, more orchestrating and packaging a complex
>> pipeline---dbt itself supports SQL and Python Dataframes, which both seem
>> like reasonable things for Beam and it wouldn't be a stretch to include
>> something like the format above. Though in my head I had imagined people
>> would tend to write composite transforms in the SDK of their choosing that
>> are then exposed at this layer. I decided to go with dbt as it also
>> provides a number of nice "quality of life" features for its users like
>> documentation, validation, environments and so on,
>>
>> I did a really quick proof-of-viability implementation here:
>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>>
>> And you can see a really simple pipeline that reads a seed file (TextIO),
>> runs it through a couple of SQLTransforms and then drops it out to a logger
>> via a simple DoFn here:
>> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>>
>> I've also heard a rumor there might also be a textproto-based
>> representation floating around too :-)
>>
>> Best,
>> B
>>
>>
>>
>>
>>
>> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hello Robert,
>>>
>>> I'm replying to say that I've been waiting for something like this ever
>>> since I started learning Beam and I'm grateful you are pushing this forward.
>>>
>>> Best,
>>>
>>> Damon
>>>
>>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw 
>>> wrote:
>>>
 While Beam provides powerful APIs for authoring sophisticated data
 processing pipelines, it often still has too high a barrier for
 getting started and authoring simple pipelines. Even setting up the
 environment, installing the dependencies, and setting up the project
 can be an overwhelming amount of boilerplate for some (though
 https://beam.apache.org/blog/beam-starter-projects/ has gone a long
 way in making this easier). At the other extreme, the Dataflow project
 has the notion of templates which are pre-built Beam pipelines that
 can be easily launched from the command line, or even from your
 browser, but they are fairly restrictive, limited to pre-assembled
 pipelines taking a small number of parameters.

 The idea of creating a yaml-based description of pipelines has come up
 several times in several contexts and this last week I decided to code
 up what it could look like. Here's a proposal.

 pipeline:
   - type: chain
 transforms:
   - type: ReadFromText
 args:
  file_pattern: "wordcount.yaml"
   - type: PyMap
 fn: "str.lower"
   - type: PyFlatMap
 fn: "import re\nlambda line: re.findall('[a-z]+', line)"
   - type: PyTransform
 name: Count
 constructor: "apache_beam.transforms.combiners.Count.PerElement"
   - type: PyMap
 fn: str
   - type: WriteToText
 file_path_prefix: "counts.txt"

 Some more examples at
 https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a

 A prototype (feedback welcome) can be found at
 https://github.com/apache/beam/pull/24667. It can be invoked as

 python -m apache_beam.yaml.main --pipeline_spec_file
 [path/to/file.yaml] [other_pipene_args]

 or

 python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
 [other_pipene_args]

 For example, to play around with this one could do

 python -m apache_beam.yaml.main  \
 --pipeline_spec "$(curl

 

Re: A Declarative API for Apache Beam

2022-12-14 Thread Byron Ellis via dev
And I guess also a PR for completeness to make it easier to find going
forward instead of my random repo: https://github.com/apache/beam/pull/24670

On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis  wrote:

> Since Robert opened that can of worms (and we happened to talk about it
> yesterday)... :-)
>
> I figured I'd also share my start on a "port" of dbt to the Beam SDK. This
> would be complementary as it doesn't really provide a way of specifying a
> pipeline, more orchestrating and packaging a complex pipeline---dbt itself
> supports SQL and Python Dataframes, which both seem like reasonable things
> for Beam and it wouldn't be a stretch to include something like the format
> above. Though in my head I had imagined people would tend to write
> composite transforms in the SDK of their choosing that are then exposed at
> this layer. I decided to go with dbt as it also provides a number of nice
> "quality of life" features for its users like documentation, validation,
> environments and so on,
>
> I did a really quick proof-of-viability implementation here:
> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>
> And you can see a really simple pipeline that reads a seed file (TextIO),
> runs it through a couple of SQLTransforms and then drops it out to a logger
> via a simple DoFn here:
> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>
> I've also heard a rumor there might also be a textproto-based
> representation floating around too :-)
>
> Best,
> B
>
>
>
>
>
> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev 
> wrote:
>
>> Hello Robert,
>>
>> I'm replying to say that I've been waiting for something like this ever
>> since I started learning Beam and I'm grateful you are pushing this forward.
>>
>> Best,
>>
>> Damon
>>
>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw 
>> wrote:
>>
>>> While Beam provides powerful APIs for authoring sophisticated data
>>> processing pipelines, it often still has too high a barrier for
>>> getting started and authoring simple pipelines. Even setting up the
>>> environment, installing the dependencies, and setting up the project
>>> can be an overwhelming amount of boilerplate for some (though
>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
>>> way in making this easier). At the other extreme, the Dataflow project
>>> has the notion of templates which are pre-built Beam pipelines that
>>> can be easily launched from the command line, or even from your
>>> browser, but they are fairly restrictive, limited to pre-assembled
>>> pipelines taking a small number of parameters.
>>>
>>> The idea of creating a yaml-based description of pipelines has come up
>>> several times in several contexts and this last week I decided to code
>>> up what it could look like. Here's a proposal.
>>>
>>> pipeline:
>>>   - type: chain
>>> transforms:
>>>   - type: ReadFromText
>>> args:
>>>  file_pattern: "wordcount.yaml"
>>>   - type: PyMap
>>> fn: "str.lower"
>>>   - type: PyFlatMap
>>> fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>>>   - type: PyTransform
>>> name: Count
>>> constructor: "apache_beam.transforms.combiners.Count.PerElement"
>>>   - type: PyMap
>>> fn: str
>>>   - type: WriteToText
>>> file_path_prefix: "counts.txt"
>>>
>>> Some more examples at
>>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>>>
>>> A prototype (feedback welcome) can be found at
>>> https://github.com/apache/beam/pull/24667. It can be invoked as
>>>
>>> python -m apache_beam.yaml.main --pipeline_spec_file
>>> [path/to/file.yaml] [other_pipene_args]
>>>
>>> or
>>>
>>> python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
>>> [other_pipene_args]
>>>
>>> For example, to play around with this one could do
>>>
>>> python -m apache_beam.yaml.main  \
>>> --pipeline_spec "$(curl
>>>
>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>> )"
>>> \
>>> --runner=apache_beam.runners.render.RenderRunner \
>>> --render_out=out.png
>>>
>>> Alternatively one can run it as a docker container with no need to
>>> install any SDK
>>>
>>> docker run --rm \
>>> --entrypoint /usr/local/bin/python \
>>> gcr.io/apache-beam-testing/yaml_template:dev
>>> /dataflow/template/main.py \
>>> --pipeline_spec="$(curl
>>>
>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>> )"
>>>
>>> Though of course one would have to set up the appropriate mount points
>>> to do any local filesystem io and/or credentials.
>>>
>>> This is also available as a Dataflow template and can be invoked as
>>>
>>> gcloud dataflow flex-template run \
>>> 

Re: A Declarative API for Apache Beam

2022-12-14 Thread Byron Ellis via dev
Since Robert opened that can of worms (and we happened to talk about it
yesterday)... :-)

I figured I'd also share my start on a "port" of dbt to the Beam SDK. This
would be complementary as it doesn't really provide a way of specifying a
pipeline, more orchestrating and packaging a complex pipeline---dbt itself
supports SQL and Python Dataframes, which both seem like reasonable things
for Beam and it wouldn't be a stretch to include something like the format
above. Though in my head I had imagined people would tend to write
composite transforms in the SDK of their choosing that are then exposed at
this layer. I decided to go with dbt as it also provides a number of nice
"quality of life" features for its users like documentation, validation,
environments and so on,

I did a really quick proof-of-viability implementation here:
https://github.com/byronellis/beam/tree/structured-pipeline-definitions

And you can see a really simple pipeline that reads a seed file (TextIO),
runs it through a couple of SQLTransforms and then drops it out to a logger
via a simple DoFn here:
https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline

I've also heard a rumor there might also be a textproto-based
representation floating around too :-)

Best,
B





On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev 
wrote:

> Hello Robert,
>
> I'm replying to say that I've been waiting for something like this ever
> since I started learning Beam and I'm grateful you are pushing this forward.
>
> Best,
>
> Damon
>
> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw 
> wrote:
>
>> While Beam provides powerful APIs for authoring sophisticated data
>> processing pipelines, it often still has too high a barrier for
>> getting started and authoring simple pipelines. Even setting up the
>> environment, installing the dependencies, and setting up the project
>> can be an overwhelming amount of boilerplate for some (though
>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
>> way in making this easier). At the other extreme, the Dataflow project
>> has the notion of templates which are pre-built Beam pipelines that
>> can be easily launched from the command line, or even from your
>> browser, but they are fairly restrictive, limited to pre-assembled
>> pipelines taking a small number of parameters.
>>
>> The idea of creating a yaml-based description of pipelines has come up
>> several times in several contexts and this last week I decided to code
>> up what it could look like. Here's a proposal.
>>
>> pipeline:
>>   - type: chain
>> transforms:
>>   - type: ReadFromText
>> args:
>>  file_pattern: "wordcount.yaml"
>>   - type: PyMap
>> fn: "str.lower"
>>   - type: PyFlatMap
>> fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>>   - type: PyTransform
>> name: Count
>> constructor: "apache_beam.transforms.combiners.Count.PerElement"
>>   - type: PyMap
>> fn: str
>>   - type: WriteToText
>> file_path_prefix: "counts.txt"
>>
>> Some more examples at
>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>>
>> A prototype (feedback welcome) can be found at
>> https://github.com/apache/beam/pull/24667. It can be invoked as
>>
>> python -m apache_beam.yaml.main --pipeline_spec_file
>> [path/to/file.yaml] [other_pipene_args]
>>
>> or
>>
>> python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
>> [other_pipene_args]
>>
>> For example, to play around with this one could do
>>
>> python -m apache_beam.yaml.main  \
>> --pipeline_spec "$(curl
>>
>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>> )"
>> \
>> --runner=apache_beam.runners.render.RenderRunner \
>> --render_out=out.png
>>
>> Alternatively one can run it as a docker container with no need to
>> install any SDK
>>
>> docker run --rm \
>> --entrypoint /usr/local/bin/python \
>> gcr.io/apache-beam-testing/yaml_template:dev
>> /dataflow/template/main.py \
>> --pipeline_spec="$(curl
>>
>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>> )"
>>
>> Though of course one would have to set up the appropriate mount points
>> to do any local filesystem io and/or credentials.
>>
>> This is also available as a Dataflow template and can be invoked as
>>
>> gcloud dataflow flex-template run \
>> "yaml-template-job" \
>>  --template-file-gcs-location
>> gs://apache-beam-testing-robertwb/yaml_template.json \
>> --parameters ^~^pipeline_spec="$(curl
>>
>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>> )"
>> \
>> --parameters 

Re: A Declarative API for Apache Beam

2022-12-14 Thread Damon Douglas via dev
Hello Robert,

I'm replying to say that I've been waiting for something like this ever
since I started learning Beam and I'm grateful you are pushing this forward.

Best,

Damon

On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw  wrote:

> While Beam provides powerful APIs for authoring sophisticated data
> processing pipelines, it often still has too high a barrier for
> getting started and authoring simple pipelines. Even setting up the
> environment, installing the dependencies, and setting up the project
> can be an overwhelming amount of boilerplate for some (though
> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
> way in making this easier). At the other extreme, the Dataflow project
> has the notion of templates which are pre-built Beam pipelines that
> can be easily launched from the command line, or even from your
> browser, but they are fairly restrictive, limited to pre-assembled
> pipelines taking a small number of parameters.
>
> The idea of creating a yaml-based description of pipelines has come up
> several times in several contexts and this last week I decided to code
> up what it could look like. Here's a proposal.
>
> pipeline:
>   - type: chain
> transforms:
>   - type: ReadFromText
> args:
>  file_pattern: "wordcount.yaml"
>   - type: PyMap
> fn: "str.lower"
>   - type: PyFlatMap
> fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>   - type: PyTransform
> name: Count
> constructor: "apache_beam.transforms.combiners.Count.PerElement"
>   - type: PyMap
> fn: str
>   - type: WriteToText
> file_path_prefix: "counts.txt"
>
> Some more examples at
> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>
> A prototype (feedback welcome) can be found at
> https://github.com/apache/beam/pull/24667. It can be invoked as
>
> python -m apache_beam.yaml.main --pipeline_spec_file
> [path/to/file.yaml] [other_pipene_args]
>
> or
>
> python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
> [other_pipene_args]
>
> For example, to play around with this one could do
>
> python -m apache_beam.yaml.main  \
> --pipeline_spec "$(curl
>
> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
> )"
> \
> --runner=apache_beam.runners.render.RenderRunner \
> --render_out=out.png
>
> Alternatively one can run it as a docker container with no need to
> install any SDK
>
> docker run --rm \
> --entrypoint /usr/local/bin/python \
> gcr.io/apache-beam-testing/yaml_template:dev
> /dataflow/template/main.py \
> --pipeline_spec="$(curl
>
> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
> )"
>
> Though of course one would have to set up the appropriate mount points
> to do any local filesystem io and/or credentials.
>
> This is also available as a Dataflow template and can be invoked as
>
> gcloud dataflow flex-template run \
> "yaml-template-job" \
>  --template-file-gcs-location
> gs://apache-beam-testing-robertwb/yaml_template.json \
> --parameters ^~^pipeline_spec="$(curl
>
> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
> )"
> \
> --parameters pickle_library=cloudpickle \
> --project=apache-beam-testing \
> --region us-central1
>
> (Note the escaping required for the parameter (use cat for a local
> file), and the debug cycle here could be greatly improved, so I'd
> recommend trying things locally first.)
>
> A key point of this implementation is that it heavily uses the
> expansion service and cross language transforms, tying into the
> proposal at  https://s.apache.org/easy-multi-language . Though all the
> examples use transforms defined in the Beam SDK, any appropriately
> packaged libraries may be used.
>
> There are many ways this could be extended. For example
>
> * It would be useful to be able to templatize yaml descriptions. This
> could be done with $SIGIL type notation or some other way. This would
> even allow one to define reusable, parameterized composite PTransform
> types in yaml itself.
>
> * It would be good to have a more principled way of merging
> environments. Currently each set of dependencies is a unique Beam
> environment, and while Beam has sophisticated cross-language
> capabilities, it would be nice if environments sharing the same
> language (and likely also the same Beam version) could be fused
> in-process (e.g. with separate class loaders or compatibility checks
> for packages).
>
> * Publishing and discovery of transformations could be improved,
> possibly via shared standards and some kind of a transform catalog. An
> ecosystem of easily sharable transforms (similar to what huggingface

A Declarative API for Apache Beam

2022-12-14 Thread Robert Bradshaw via dev
While Beam provides powerful APIs for authoring sophisticated data
processing pipelines, it often still has too high a barrier for
getting started and authoring simple pipelines. Even setting up the
environment, installing the dependencies, and setting up the project
can be an overwhelming amount of boilerplate for some (though
https://beam.apache.org/blog/beam-starter-projects/ has gone a long
way in making this easier). At the other extreme, the Dataflow project
has the notion of templates which are pre-built Beam pipelines that
can be easily launched from the command line, or even from your
browser, but they are fairly restrictive, limited to pre-assembled
pipelines taking a small number of parameters.

The idea of creating a yaml-based description of pipelines has come up
several times in several contexts and this last week I decided to code
up what it could look like. Here's a proposal.

pipeline:
  - type: chain
transforms:
  - type: ReadFromText
args:
 file_pattern: "wordcount.yaml"
  - type: PyMap
fn: "str.lower"
  - type: PyFlatMap
fn: "import re\nlambda line: re.findall('[a-z]+', line)"
  - type: PyTransform
name: Count
constructor: "apache_beam.transforms.combiners.Count.PerElement"
  - type: PyMap
fn: str
  - type: WriteToText
file_path_prefix: "counts.txt"

Some more examples at
https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a

A prototype (feedback welcome) can be found at
https://github.com/apache/beam/pull/24667. It can be invoked as

python -m apache_beam.yaml.main --pipeline_spec_file
[path/to/file.yaml] [other_pipene_args]

or

python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
[other_pipene_args]

For example, to play around with this one could do

python -m apache_beam.yaml.main  \
--pipeline_spec "$(curl
https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml)"
\
--runner=apache_beam.runners.render.RenderRunner \
--render_out=out.png

Alternatively one can run it as a docker container with no need to
install any SDK

docker run --rm \
--entrypoint /usr/local/bin/python \
gcr.io/apache-beam-testing/yaml_template:dev
/dataflow/template/main.py \
--pipeline_spec="$(curl
https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml)"

Though of course one would have to set up the appropriate mount points
to do any local filesystem io and/or credentials.

This is also available as a Dataflow template and can be invoked as

gcloud dataflow flex-template run \
"yaml-template-job" \
 --template-file-gcs-location
gs://apache-beam-testing-robertwb/yaml_template.json \
--parameters ^~^pipeline_spec="$(curl
https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml)"
\
--parameters pickle_library=cloudpickle \
--project=apache-beam-testing \
--region us-central1

(Note the escaping required for the parameter (use cat for a local
file), and the debug cycle here could be greatly improved, so I'd
recommend trying things locally first.)

A key point of this implementation is that it heavily uses the
expansion service and cross language transforms, tying into the
proposal at  https://s.apache.org/easy-multi-language . Though all the
examples use transforms defined in the Beam SDK, any appropriately
packaged libraries may be used.

There are many ways this could be extended. For example

* It would be useful to be able to templatize yaml descriptions. This
could be done with $SIGIL type notation or some other way. This would
even allow one to define reusable, parameterized composite PTransform
types in yaml itself.

* It would be good to have a more principled way of merging
environments. Currently each set of dependencies is a unique Beam
environment, and while Beam has sophisticated cross-language
capabilities, it would be nice if environments sharing the same
language (and likely also the same Beam version) could be fused
in-process (e.g. with separate class loaders or compatibility checks
for packages).

* Publishing and discovery of transformations could be improved,
possibly via shared standards and some kind of a transform catalog. An
ecosystem of easily sharable transforms (similar to what huggingface
provides for ML models) could provide a useful platform for making it
easy to build pipelines and open up Beam to a whole new set of users.

Let me know what you think.

- Robert