On Tue, 19 May 2020 at 17:40, Reuven Lax <[email protected]> wrote: > > On Tue, May 19, 2020 at 4:14 AM Marcin Kuthan <[email protected]> > wrote: > >> I'm looking for the Pubsub publication details on unbounded collections >> when Dataflow runner is used and streaming engine is on. >> >> As I understood correctly the PubsubUnboundedSink transform is overridden >> by internal implementation. >> >> >> https://lists.apache.org/thread.html/26e2bfdb6eaa7319ea3cc65f9d8a0bfeb7be6a6d88f0167ebad0591d%40%3Cuser.beam.apache.org%3E >> > > Only the streaming runner. >
Indeed! After your response I checked "org.apache.beam.runners.dataflow.DataflowRunner" again and the internal implementation is used only if streaming is on. Thanks for pointing that out. > >> >> >> Questions: >> >> 1. Should I expect that parameters maxBatchByteSize, batchSize are >> respected, or Dataflow internal implementation just ignores them? >> > > I don't think that Dataflow pays attention to this. > It's very tricky to figure out where is a boundary between Apache Beam and Dataflow Runner. Parameters like maxBatchSize and batch size look like regular Apache Beam tuning knobs. Are you aware of any Dataflow documentation where I could find more information how Dataflow handle Apache Beam calls, what is important and what is not? > > >> 4. How to estimate streaming engine costs for internal shuffling in >> PubsubUnboundedSink, if any? The default PubsubUnboundedSink implementation >> shuffles data before publication but I don't know how how it is done by >> internal implementation. And I don't need to know, as long as it does not >> generate extra costs :) >> > > The internal implementation does not add any extra cost. Dataflow charges > for every MB read from Streaming Engine as a "shuffle" charge, and this > includes the records read from PubSub. The external Beam implementation > includes a full shuffle, which would be more expensive as it includes both > a write and a read. > > Very interesting! Please correct me if I'm wrong with my understanding: 1. Worker harness does not pull data directly from the Pubsub. Data is effectively pulled by streaming engine and the worker harness reads the data from the streaming engine. Who is responsible for ack-ing the messages? 2. Worker harness does not publish data directly to the Pubsub because the data from the previous pipeline step is already in the streaming engine. The data is published to pubsub directly by the streaming engine. Thank you Reuven for your time and for sharing Dataflow and Streaming Engine details. >
