PubsubUnboundedSink parameters on Dataflow runner

Marcin Kuthan Tue, 19 May 2020 04:14:19 -0700

I'm looking for the Pubsub publication details on unbounded collections
when Dataflow runner is used and streaming engine is on.


As I understood correctly the PubsubUnboundedSink transform is overridden
by internal implementation.

https://lists.apache.org/thread.html/26e2bfdb6eaa7319ea3cc65f9d8a0bfeb7be6a6d88f0167ebad0591d%40%3Cuser.beam.apache.org%3E

Questions:

1. Should I expect that parameters maxBatchByteSize, batchSize are
respected, or Dataflow internal implementation just ignores them?
2. What about pubsubClientFactory? The default one is
PubsubJsonClientFactory, and this is somehow important if I want to
configure maxBatchByteSize under Pubsub 10MB limit. Json factory encodes
messages using base64, so the limit should be lowered to 10MB * 0.75 (minus
some safety margin).
3. Should I expect any differences for bounded and unbounded collections?
There are different defaults in the Beam code: e.g: maxBatchByteSize is
~7.5MB for bounded and ~400kB for unbounded collections, batchSize is 100
for bounded, and 1000 for unbounded. I also don't understand the reasons
behind default settings.
4. How to estimate streaming engine costs for internal shuffling in
PubsubUnboundedSink, if any? The default PubsubUnboundedSink implementation
shuffles data before publication but I don't know how how it is done by
internal implementation. And I don't need to know, as long as it does not
generate extra costs :)

Many questions about Dataflow internals but it would be nice to know some
details, the details important from the performance and costs perspective.

Thanks,
Marcin

PubsubUnboundedSink parameters on Dataflow runner

Reply via email to