KevinGG commented on pull request #15217:
URL: https://github.com/apache/beam/pull/15217#issuecomment-887911629


   > Looks good at a glance, just wondering about the performance. By turning 
the WindowedValueHolder into a Row coder, how does this impact performance? I'm 
assuming that the performance increases. Subjectively, do you see a difference?
   
   For performance, the main difference is that I've added the `urn` field to 
differentiate a decoded `WindowedValueHolder` from a `Row` created by the user 
happening to contain a `windowed_value` field (which is a scenario too specific 
that shouldn't happen that often). So it should take more space in encoded 
bytes and cost more when encoding and decoding. The extra cost is 
`constant_for_urn_field * #_of_elements`. Since it's used in test_stream, it 
shouldn't affect real production jobs. If the performance regression is not 
acceptable, we can always remove the `urn` field.
   
   As for `RowCoder` and pickled python coder, I don't think there is much 
difference. The `RowCoder` seems to fall back to `FastPrimitiveCoder` when the 
Row is very simple. The advantage is that the `WindowedValueHolder` encoded in 
a language can be decoded in another without issues. Then we can use the cache 
as a medium for external transforms.
   
   One thing I think we should re-visit is to figure out how to encode/decode 
in the streaming cache. 
   
   1. The `AddElement` should only accept encoded `bytes` as the elements, see 
[here](https://github.com/apache/beam/blob/7d98ad2b6554258eeccf9f7c1b017f9a001bfd87/model/pipeline/src/main/proto/beam_runner_api.proto#L662).
 We should encode the elements when creating the `TestStreamFileRecord` protos. 
The missing encoding is 
[here](https://github.com/apache/beam/blob/205fbb10998c9e2a1c7842ab7efd88aef80828a6/sdks/python/apache_beam/testing/test_stream.py#L611).
  
   2. Our cache uses 
[SerializeToString](https://github.com/apache/beam/blob/205fbb10998c9e2a1c7842ab7efd88aef80828a6/sdks/python/apache_beam/testing/test_stream.py#L477)
 to convert `TestStreamFileRecord` to bytes and then encode them with 
`SafeFastPrimitvesCoder` before writing to text files. It's basically a `byte` 
coder. Thus the coder we register for `labels` in the cache manager has nothing 
to do with the underlying values we cache. We need to register 2 coders for 
`labels`: one for the `TestStreamFileRecord`, one for the underlying value. Or 
simpler, always use a default coder (ProtoCoder? 2 variants for 
`TestStreamFileHeader` and `TestStreamFileRecord`) to encode the proto bytes. 
The registered coder is only used when encoding/decoding the underlying values.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to