I'm looking for the Pubsub publication details on unbounded collections when Dataflow runner is used and streaming engine is on.
As I understood correctly the PubsubUnboundedSink transform is overridden by internal implementation. https://lists.apache.org/thread.html/26e2bfdb6eaa7319ea3cc65f9d8a0bfeb7be6a6d88f0167ebad0591d%40%3Cuser.beam.apache.org%3E Questions: 1. Should I expect that parameters maxBatchByteSize, batchSize are respected, or Dataflow internal implementation just ignores them? 2. What about pubsubClientFactory? The default one is PubsubJsonClientFactory, and this is somehow important if I want to configure maxBatchByteSize under Pubsub 10MB limit. Json factory encodes messages using base64, so the limit should be lowered to 10MB * 0.75 (minus some safety margin). 3. Should I expect any differences for bounded and unbounded collections? There are different defaults in the Beam code: e.g: maxBatchByteSize is ~7.5MB for bounded and ~400kB for unbounded collections, batchSize is 100 for bounded, and 1000 for unbounded. I also don't understand the reasons behind default settings. 4. How to estimate streaming engine costs for internal shuffling in PubsubUnboundedSink, if any? The default PubsubUnboundedSink implementation shuffles data before publication but I don't know how how it is done by internal implementation. And I don't need to know, as long as it does not generate extra costs :) Many questions about Dataflow internals but it would be nice to know some details, the details important from the performance and costs perspective. Thanks, Marcin