Re: UTF-8 passthrough with beam Rows

Steve Niemitz Tue, 30 Nov 2021 08:26:17 -0800

My initial thought would be (assuming we have some Utf8 equivalent class in
beam) to build something like a "CharSequenceUtf8Coder" and plumb that into
RowCoder's component coders.  For Utf8 instances it could simply unwrap the
contained bytes, for anything else it would fallback to the normal
StringUtf8Coder behavior.

On decode we could also have an option to return CharSequences (backed by
the Utf8 class) as well rather than Strings.  This would have the added
benefit of making a round trip decode/encode for Rows much cheaper (which
is pretty common) if they contain strings.  This would be a breaking change
though for existing pipelines so it'd have to be opt-in.

On Tue, Nov 30, 2021 at 11:17 AM Reuven Lax <re...@google.com> wrote:

> I'm intrigued - how do you imagine doing this in RowCoder?
>
> On Tue, Nov 30, 2021 at 7:49 AM Steve Niemitz <sniem...@twitter.com>
> wrote:
>
>> A common use case we're running into with beam rows is something like:
>> - Read data from source X
>> - Convert to Row
>> - Encode row (generally for xlang)
>>
>> In cases like this, I've noticed that we spend a significant (30%+)
>> amount of time just decoding and re-encoding strings.
>>
>> Avro has a nice solution to this with its Utf8 class [1] which defers
>> decoding the string until actually needed.  I'm curious if there's been any
>> thought around optimizing this in beam as well?  It doesn't seem like it'd
>> be hard to support it in the RowCoder implementation right now.
>>
>> [1]
>> https://avro.apache.org/docs/1.4.1/api/java/org/apache/avro/util/Utf8.html
>>
>

Re: UTF-8 passthrough with beam Rows

Reply via email to