[GitHub] [beam] KevinGG commented on pull request #15217: [BEAM-12506] Changed WindowedValueHolder into a Row type

GitBox Tue, 27 Jul 2021 17:03:43 -0700


KevinGG commented on pull request #15217:
URL: https://github.com/apache/beam/pull/15217#issuecomment-887911629

> Looks good at a glance, just wondering about the performance. By turning
the WindowedValueHolder into a Row coder, how does this impact performance? I'm
assuming that the performance increases. Subjectively, do you see a difference?

For performance, the main difference is that I've added the `urn` field to
differentiate a decoded `WindowedValueHolder` from a `Row` created by the user
happening to contain a `windowed_value` field (which is a scenario too specific
that shouldn't happen that often). So it should take more space in encoded
bytes and cost more when encoding and decoding. The extra cost is
`constant_for_urn_field * #_of_elements`. Since it's used in test_stream, it
shouldn't affect real production jobs. If the performance regression is not
acceptable, we can always remove the `urn` field.

As for `RowCoder` and pickled python coder, I don't think there is much
difference. The `RowCoder` seems to fall back to `FastPrimitiveCoder` when the
Row is very simple. The advantage is that the `WindowedValueHolder` encoded in
a language can be decoded in another without issues. Then we can use the cache
as a medium for external transforms.

One thing I think we should re-visit is to figure out how to encode/decode
in the streaming cache.

1. The `AddElement` should only accept encoded `bytes` as the elements, see
[here](https://github.com/apache/beam/blob/7d98ad2b6554258eeccf9f7c1b017f9a001bfd87/model/pipeline/src/main/proto/beam_runner_api.proto#L662).
We should encode the elements when creating the `TestStreamFileRecord` protos.
The missing encoding is
[here](https://github.com/apache/beam/blob/205fbb10998c9e2a1c7842ab7efd88aef80828a6/sdks/python/apache_beam/testing/test_stream.py#L611).

2. Our cache uses
[SerializeToString](https://github.com/apache/beam/blob/205fbb10998c9e2a1c7842ab7efd88aef80828a6/sdks/python/apache_beam/testing/test_stream.py#L477)
to convert `TestStreamFileRecord` to bytes and then encode them with
`SafeFastPrimitvesCoder` before writing to text files. It's basically a `byte`
coder. Thus the coder we register for `labels` in the cache manager has nothing
to do with the underlying values we cache. We need to register 2 coders for
`labels`: one for the `TestStreamFileRecord`, one for the underlying value. Or
simpler, always use a default coder (ProtoCoder? 2 variants for
`TestStreamFileHeader` and `TestStreamFileRecord`) to encode the proto bytes.
The registered coder is only used when encoding/decoding the underlying values.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] KevinGG commented on pull request #15217: [BEAM-12506] Changed WindowedValueHolder into a Row type

Reply via email to