What runner are you using (Flink, Spark, Google Cloud Dataflow, Apex, ...)?
On Wed, May 24, 2017 at 8:09 AM, Ankur Chauhan <[email protected]> wrote: > Sorry that was an autocorrect error. I meant to ask - what dataflow runner > are you using? If you are using google cloud dataflow then the PubsubIO > class is not the one doing the reading from the pubsub topic. They provide > a custom implementation at run time. > > Ankur Chauhan > Sent from my iPhone > > On May 24, 2017, at 07:52, Josh <[email protected]> wrote: > > Hi Ankur, > > What do you mean by runner address? > Would you be able to link me to the comment you're referring to? > > I am using the PubsubIO.Read class from Beam 2.0.0 as found here: > https://github.com/apache/beam/blob/release-2.0.0/sdks/ > java/io/google-cloud-platform/src/main/java/org/apache/beam/ > sdk/io/gcp/pubsub/PubsubIO.java > > Thanks, > Josh > > On Wed, May 24, 2017 at 3:36 PM, Ankur Chauhan <[email protected]> wrote: > >> What runner address you using. Google cloud dataflow uses a closed source >> version of the pubsub reader as noted in a comment on Read class. >> >> Ankur Chauhan >> Sent from my iPhone >> >> On May 24, 2017, at 04:05, Josh <[email protected]> wrote: >> >> Hi all, >> >> I'm using PubsubIO.Read to consume a Pubsub stream, and my job then >> writes the data out to Bigtable. I'm currently seeing a latency of 3-5 >> seconds between the messages being published and being written to Bigtable. >> >> I want to try and decrease the latency to <1s if possible - does anyone >> have any tips for doing this? >> >> I noticed that there is a PubsubGrpcClient https://github.com/apache/beam >> /blob/release-2.0.0/sdks/java/io/google-cloud-platform/src/m >> ain/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubGrpcClient.java however >> the PubsubUnboundedSource is initialised with a PubsubJsonClient, so the >> Grpc client doesn't appear to be being used. Is there a way to switch to >> the Grpc client - as perhaps that would give better performance? >> >> Also, I am running my job on Dataflow using autoscaling, which has only >> allocated one n1-standard-4 instance to the job, which is running at >> ~50% CPU. Could forcing a higher number of nodes help improve latency? >> >> Thanks for any advice, >> Josh >> >> >
