Ah ok - I am using the Dataflow runner. I didn't realise about the custom
implementation being provided at runtime...

Any ideas of how to tweak my job to either lower the latency consuming from
PubSub or to lower the latency in writing to Bigtable?


On Wed, May 24, 2017 at 4:14 PM, Lukasz Cwik <lc...@google.com> wrote:

> What runner are you using (Flink, Spark, Google Cloud Dataflow, Apex, ...)?
>
> On Wed, May 24, 2017 at 8:09 AM, Ankur Chauhan <an...@malloc64.com> wrote:
>
>> Sorry that was an autocorrect error. I meant to ask - what dataflow
>> runner are you using? If you are using google cloud dataflow then the
>> PubsubIO class is not the one doing the reading from the pubsub topic. They
>> provide a custom implementation at run time.
>>
>> Ankur Chauhan
>> Sent from my iPhone
>>
>> On May 24, 2017, at 07:52, Josh <jof...@gmail.com> wrote:
>>
>> Hi Ankur,
>>
>> What do you mean by runner address?
>> Would you be able to link me to the comment you're referring to?
>>
>> I am using the PubsubIO.Read class from Beam 2.0.0 as found here:
>> https://github.com/apache/beam/blob/release-2.0.0/sdks/java/
>> io/google-cloud-platform/src/main/java/org/apache/beam/sdk/
>> io/gcp/pubsub/PubsubIO.java
>>
>> Thanks,
>> Josh
>>
>> On Wed, May 24, 2017 at 3:36 PM, Ankur Chauhan <an...@malloc64.com>
>> wrote:
>>
>>> What runner address you using. Google cloud dataflow uses a closed
>>> source version of the pubsub reader as noted in a comment on Read class.
>>>
>>> Ankur Chauhan
>>> Sent from my iPhone
>>>
>>> On May 24, 2017, at 04:05, Josh <jof...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> I'm using PubsubIO.Read to consume a Pubsub stream, and my job then
>>> writes the data out to Bigtable. I'm currently seeing a latency of 3-5
>>> seconds between the messages being published and being written to Bigtable.
>>>
>>> I want to try and decrease the latency to <1s if possible - does anyone
>>> have any tips for doing this?
>>>
>>> I noticed that there is a PubsubGrpcClient
>>> https://github.com/apache/beam/blob/release-2.0.0/sdks/java/
>>> io/google-cloud-platform/src/main/java/org/apache/beam/sdk/i
>>> o/gcp/pubsub/PubsubGrpcClient.java however the PubsubUnboundedSource is
>>> initialised with a PubsubJsonClient, so the Grpc client doesn't appear to
>>> be being used. Is there a way to switch to the Grpc client - as perhaps
>>> that would give better performance?
>>>
>>> Also, I am running my job on Dataflow using autoscaling, which has only
>>> allocated one n1-standard-4 instance to the job, which is running at
>>> ~50% CPU. Could forcing a higher number of nodes help improve latency?
>>>
>>> Thanks for any advice,
>>> Josh
>>>
>>>
>>
>

Reply via email to