> Regardless, the Encode Row step would end up decoding the String and then re-encoding it, perhaps you're envisioning we could short-circuit this and access the encoded String? Yes, exactly. The avro encoder/decoder does something very similar to this for the same reason.
On Tue, Nov 30, 2021 at 11:26 AM Brian Hulette <bhule...@google.com> wrote: > Sources should generally produce instances of RowWithGetters [1] which > lazily accesses fields from some underlying object. This should at least > avoid decoding a String for your first two steps as long as it's not > accessed. I'm not sure if accesses are memoized though - we may be > re-decoding if the String is accessed multiple times. > > Regardless, the Encode Row step would end up decoding the String and then > re-encoding it, perhaps you're envisioning we could short-circuit this and > access the encoded String? > > [1] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/RowWithGetters.java > > On Tue, Nov 30, 2021 at 8:17 AM Reuven Lax <re...@google.com> wrote: > >> I'm intrigued - how do you imagine doing this in RowCoder? >> >> On Tue, Nov 30, 2021 at 7:49 AM Steve Niemitz <sniem...@twitter.com> >> wrote: >> >>> A common use case we're running into with beam rows is something like: >>> - Read data from source X >>> - Convert to Row >>> - Encode row (generally for xlang) >>> >>> In cases like this, I've noticed that we spend a significant (30%+) >>> amount of time just decoding and re-encoding strings. >>> >>> Avro has a nice solution to this with its Utf8 class [1] which defers >>> decoding the string until actually needed. I'm curious if there's been any >>> thought around optimizing this in beam as well? It doesn't seem like it'd >>> be hard to support it in the RowCoder implementation right now. >>> >>> [1] >>> https://avro.apache.org/docs/1.4.1/api/java/org/apache/avro/util/Utf8.html >>> >>