On Sun, May 11, 2025 at 7:35 PM Reuven Lax via dev <[email protected]>
wrote:
> My first thought is that this should go in contrib for now.
>
> BTW in the Java SDK, field access is integrated directly into ParDo. e.g.
> you can write
>
> new DoFn<> {
> @ProcessElement
> public void process(@FieldAccess("field1") Type1
> field1, @FieldAccess("field2") Type2 field2) {
> ...
> }
> }
>
> It also supports selecting wildcards (e.g. @FieldAccess("top.*")).
>
BTW, how is this different than @FieldAccess("top")?
> I'm not sure how this pattern would translate into the Python SDK though.
>
It would probably look like
def process(field1=FieldAccess("field1"), ...)
though in Python there's much less need as Row objects are not as
cumbersome to use, e.g.
def process(row):
# access row.field1 directly here
though it could be useful for optimizations like projection lifting.
As for the original question, the value here is being able to adapt a
DoFn<T, O> to apply to a single field of a Row? This certainly seems to
have value. I might suggest a syntax like
schema_pcoll | DoToField(SomeDoFn(), input_field="element",
output_field="word")
and preserve rather than elide the input field (at least as an option).
Does this handle the full DoFn spec (e.g. bundle start/finish, WIndowFn
params, state and timers, etc.)?
On Sat, May 10, 2025 at 3:35 AM Joey Tran <[email protected]> wrote:
>
>> Not currently
>>
>> On Sat, May 10, 2025, 12:48 AM Reuven Lax <[email protected]> wrote:
>>
>>> Does this work with nested fields? Can you specify Input_field="a.b.c"?
>>>
>>> On Fri, May 9, 2025 at 7:18 PM Joey Tran <[email protected]>
>>> wrote:
>>>
>>>> Sure!
>>>>
>>>> Given a DoFn that has...
>>>>
>>>> def process(self, sentence):
>>>> yield from sentence.split()
>>>>
>>>>
>>>> You could use it with SchemadParDo as:
>>>>
>>>> (p | beam.Create([pvalue.Row(element="hello world", id="id")])
>>>> | SchemadParDo(SchemadParDo(SplitSentenceDoFn(), input_field="element",
>>>> output_field="word"))
>>>>
>>>> And it'd produce Row(word="hello", id="id") and Row(word=""world",
>>>> id="id")
>>>>
>>>> On Fri, May 9, 2025, 9:57 PM Reuven Lax via dev <[email protected]>
>>>> wrote:
>>>>
>>>>> Can you explain a bit how SchemadParDo works?
>>>>>
>>>>> On Fri, May 9, 2025 at 4:49 PM Joey Tran <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I've written a `SchemadParDo(input_field: str, output_field,
>>>>>> dofn:DoFn)` transform for more easily writing a Schemad transform given a
>>>>>> DoFn.
>>>>>>
>>>>>> Is this something worth upstreaming into the Beam Python SDK? I wrote
>>>>>> it to make it easier to convert our current set of dofn's into
>>>>>> schemad dofns for use with the YAML SDK. Just wanted to gauge interest
>>>>>> before setting up the dev env again
>>>>>>
>>>>>