On Sun, May 11, 2025 at 7:35 PM Reuven Lax via dev <dev@beam.apache.org> wrote:
> My first thought is that this should go in contrib for now. > > BTW in the Java SDK, field access is integrated directly into ParDo. e.g. > you can write > > new DoFn<> { > @ProcessElement > public void process(@FieldAccess("field1") Type1 > field1, @FieldAccess("field2") Type2 field2) { > ... > } > } > > It also supports selecting wildcards (e.g. @FieldAccess("top.*")). > BTW, how is this different than @FieldAccess("top")? > I'm not sure how this pattern would translate into the Python SDK though. > It would probably look like def process(field1=FieldAccess("field1"), ...) though in Python there's much less need as Row objects are not as cumbersome to use, e.g. def process(row): # access row.field1 directly here though it could be useful for optimizations like projection lifting. As for the original question, the value here is being able to adapt a DoFn<T, O> to apply to a single field of a Row? This certainly seems to have value. I might suggest a syntax like schema_pcoll | DoToField(SomeDoFn(), input_field="element", output_field="word") and preserve rather than elide the input field (at least as an option). Does this handle the full DoFn spec (e.g. bundle start/finish, WIndowFn params, state and timers, etc.)? On Sat, May 10, 2025 at 3:35 AM Joey Tran <joey.t...@schrodinger.com> wrote: > >> Not currently >> >> On Sat, May 10, 2025, 12:48 AM Reuven Lax <re...@google.com> wrote: >> >>> Does this work with nested fields? Can you specify Input_field="a.b.c"? >>> >>> On Fri, May 9, 2025 at 7:18 PM Joey Tran <joey.t...@schrodinger.com> >>> wrote: >>> >>>> Sure! >>>> >>>> Given a DoFn that has... >>>> >>>> def process(self, sentence): >>>> yield from sentence.split() >>>> >>>> >>>> You could use it with SchemadParDo as: >>>> >>>> (p | beam.Create([pvalue.Row(element="hello world", id="id")]) >>>> | SchemadParDo(SchemadParDo(SplitSentenceDoFn(), input_field="element", >>>> output_field="word")) >>>> >>>> And it'd produce Row(word="hello", id="id") and Row(word=""world", >>>> id="id") >>>> >>>> On Fri, May 9, 2025, 9:57 PM Reuven Lax via dev <dev@beam.apache.org> >>>> wrote: >>>> >>>>> Can you explain a bit how SchemadParDo works? >>>>> >>>>> On Fri, May 9, 2025 at 4:49 PM Joey Tran <joey.t...@schrodinger.com> >>>>> wrote: >>>>> >>>>>> I've written a `SchemadParDo(input_field: str, output_field, >>>>>> dofn:DoFn)` transform for more easily writing a Schemad transform given a >>>>>> DoFn. >>>>>> >>>>>> Is this something worth upstreaming into the Beam Python SDK? I wrote >>>>>> it to make it easier to convert our current set of dofn's into >>>>>> schemad dofns for use with the YAML SDK. Just wanted to gauge interest >>>>>> before setting up the dev env again >>>>>> >>>>>