+dev <dev@beam.apache.org> On Mon, Aug 5, 2019 at 12:49 PM Dmitry Minaev <mina...@gmail.com> wrote:
> Hi there, > > I'm building streaming pipelines in Beam (using Google Dataflow runner) > and using Google Pubsub as a message broker. I've made a couple of > experiments with a very simple pipeline: consume events from Pubsub > subscription, add a timestamp to the message body, emit the new event to > another Pubsub topic. I'm using all the default parameters when producing > and consuming messages. > > I've noticed a pretty high latency while consuming messages in Dataflow > from Pubsub. My observations show that average duration between the event > create timestamp (simple producer that publishes events to Pubsub) and > event consume timestamp (Google Dataflow using PubsubIO) is more than 2 > seconds. I've been publishing messages at different rates, e.g. 10 msg/sec, > 1000 msg/sec, 10,000 msg/sec. And the latency never went lower than 2 > seconds. Such latency looks really high. I've tried with direct runner and > it has high latency too. > > I've made a few other experiments with Kafka (very small Kafka cluster) > and the same kind of pipeline: consume from Kafka, add timestamp, publish > to another Kafka topic. I saw the latency is much lower, on average it's > about 150 milliseconds. > > I suspect there is some batching in PubsubIO that makes the latency so > high. > > My questions are: what should be expected latency in this kind of > scenarios? Is there any recommendations to achieve lower latency? > > I appreciate any help on this! > > Thank you, > Dmitry. >