> Regardless, the Encode Row step would end up decoding the String and then
re-encoding it, perhaps you're envisioning we could short-circuit this and
access the encoded String?
Yes, exactly.  The avro encoder/decoder does something very similar to this
for the same reason.

On Tue, Nov 30, 2021 at 11:26 AM Brian Hulette <bhule...@google.com> wrote:

> Sources should generally produce instances of RowWithGetters [1] which
> lazily accesses fields from some underlying object. This should at least
> avoid decoding a String for your first two steps as long as it's not
> accessed. I'm not sure if accesses are memoized though - we may be
> re-decoding if the String is accessed multiple times.
>
> Regardless, the Encode Row step would end up decoding the String and then
> re-encoding it, perhaps you're envisioning we could short-circuit this and
> access the encoded String?
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/RowWithGetters.java
>
> On Tue, Nov 30, 2021 at 8:17 AM Reuven Lax <re...@google.com> wrote:
>
>> I'm intrigued - how do you imagine doing this in RowCoder?
>>
>> On Tue, Nov 30, 2021 at 7:49 AM Steve Niemitz <sniem...@twitter.com>
>> wrote:
>>
>>> A common use case we're running into with beam rows is something like:
>>> - Read data from source X
>>> - Convert to Row
>>> - Encode row (generally for xlang)
>>>
>>> In cases like this, I've noticed that we spend a significant (30%+)
>>> amount of time just decoding and re-encoding strings.
>>>
>>> Avro has a nice solution to this with its Utf8 class [1] which defers
>>> decoding the string until actually needed.  I'm curious if there's been any
>>> thought around optimizing this in beam as well?  It doesn't seem like it'd
>>> be hard to support it in the RowCoder implementation right now.
>>>
>>> [1]
>>> https://avro.apache.org/docs/1.4.1/api/java/org/apache/avro/util/Utf8.html
>>>
>>

Reply via email to