Re: [YAML/Python] SchemadParDo

Robert Bradshaw Mon, 12 May 2025 08:14:15 -0700

On Sun, May 11, 2025 at 7:35 PM Reuven Lax via dev <dev@beam.apache.org>
wrote:


> My first thought is that this should go in contrib for now.
>
> BTW in the Java SDK, field access is integrated directly into ParDo. e.g.
> you can write
>
> new DoFn<> {
>    @ProcessElement
>    public void process(@FieldAccess("field1") Type1
> field1, @FieldAccess("field2") Type2 field2) {
>       ...
>     }
> }
>
> It also supports selecting wildcards (e.g. @FieldAccess("top.*")).
>

BTW, how is this different than @FieldAccess("top")?


> I'm not sure how this pattern would translate into the Python SDK though.
>

It would probably look like

  def process(field1=FieldAccess("field1"), ...)

though in Python there's much less need as Row objects are not as
cumbersome to use, e.g.

  def process(row):
    # access row.field1 directly here

though it could be useful for optimizations like projection lifting.


As for the original question, the value here is being able to adapt a
DoFn<T, O> to apply to a single field of a Row? This certainly seems to
have value. I might suggest a syntax like

    schema_pcoll | DoToField(SomeDoFn(), input_field="element",
output_field="word")

and preserve rather than elide the input field (at least as an option).

Does this handle the full DoFn spec (e.g. bundle start/finish, WIndowFn
params, state and timers, etc.)?

On Sat, May 10, 2025 at 3:35 AM Joey Tran <joey.t...@schrodinger.com> wrote:
>
>> Not currently
>>
>> On Sat, May 10, 2025, 12:48 AM Reuven Lax <re...@google.com> wrote:
>>
>>> Does this work with nested fields? Can you specify Input_field="a.b.c"?
>>>
>>> On Fri, May 9, 2025 at 7:18 PM Joey Tran <joey.t...@schrodinger.com>
>>> wrote:
>>>
>>>> Sure!
>>>>
>>>> Given a DoFn that has...
>>>>
>>>> def process(self, sentence):
>>>>     yield from sentence.split()
>>>>
>>>>
>>>> You could use it with SchemadParDo as:
>>>>
>>>> (p | beam.Create([pvalue.Row(element="hello world", id="id")])
>>>> | SchemadParDo(SchemadParDo(SplitSentenceDoFn(), input_field="element",
>>>> output_field="word"))
>>>>
>>>> And it'd produce Row(word="hello", id="id") and Row(word=""world",
>>>> id="id")
>>>>
>>>> On Fri, May 9, 2025, 9:57 PM Reuven Lax via dev <dev@beam.apache.org>
>>>> wrote:
>>>>
>>>>> Can you explain a bit how SchemadParDo works?
>>>>>
>>>>> On Fri, May 9, 2025 at 4:49 PM Joey Tran <joey.t...@schrodinger.com>
>>>>> wrote:
>>>>>
>>>>>> I've written a `SchemadParDo(input_field: str, output_field,
>>>>>> dofn:DoFn)` transform for more easily writing a Schemad transform given a
>>>>>> DoFn.
>>>>>>
>>>>>> Is this something worth upstreaming into the Beam Python SDK? I wrote
>>>>>> it to make it easier to convert our current set of dofn's into
>>>>>> schemad dofns for use with the YAML SDK. Just wanted to gauge interest
>>>>>> before setting up the dev env again
>>>>>>
>>>>>

Re: [YAML/Python] SchemadParDo

Reply via email to