Re: Scaling Beam pipeline on Data Flow - Join bounded and non-bounded source

Maulik Gandhi Wed, 20 Mar 2019 08:41:38 -0700

I think now as I understand this more clearly, there are a couple of things
going on.  I will try to re-explain what I am trying to achieve.


- I am reading from 2 Sources
  - Bounded (AVRO from GCS)
  - Unbounded (AVRO from PubSub)
- I want to prime Beam pipeline state, with data from GCS (bounded source),
using UserId as key, even when data from PubSub is not flowing through.
- Later, when data from PubSub (un-bounded source), starts flowing I would
update/add to Beam state, using (same) UserId as key.

How big is your bounded-source
> 16.01 GiB total data from AVRO files.  But it can be b/w 10-100s of GBs

How much pressure (messages per seconds) your unbounded source is
receiving?
> Initially no pressure, to prime the Beam state, but later there will be
data flowing through PubSub.

I also add a parameter on mvn command, as below and could get 7 (105/15)
worker, as per guide:
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#autoscaling

mvn compile exec:java -pl <project-name>
-Dexec.mainClass=<class-name> \
-Dexec.args="--runner=DataflowRunner --project=<project-name> \
             --stagingLocation=gs://<gcs-location> \
             --maxNumWorkers=105 \
             --autoscalingAlgorithm=THROUGHPUT_BASED \
             --templateLocation=gs://<gcs-location>"

Thanks.
- Maulik


On Wed, Mar 20, 2019 at 3:17 AM Juan Carlos Garcia <[email protected]>
wrote:

> Your auto scaling algorithm is THROUGHPUT_BASED, it will kicks in only
> when it feels the pipeline is not able to keep it up with the incoming
> source. How big is your bounded-source and how much pressure (messages per
> seconds) your unbounded source is receiving?
>
> Maulik Gandhi <[email protected]> schrieb am Di., 19. März 2019, 21:06:
>
>> Hi Juan,
>>
>> Thanks for replying.  I believe I am using correct configurations.
>>
>> I have posted more details with code snippet and Data Flow job template
>> configuration on Stack Overflow post:
>> https://stackoverflow.com/q/55242684/11226631
>>
>> Thanks.
>> - Maulik
>>
>> On Tue, Mar 19, 2019 at 2:53 PM Juan Carlos Garcia <[email protected]>
>> wrote:
>>
>>> Hi Maulik,
>>>
>>> Have you submitted your job with the correct configuration to enable
>>> autoscaling?
>>>
>>> --autoscalingAlgorithm=
>>> --maxWorkers=
>>>
>>> I am on my phone right now and can't tell if the flags name are 100%
>>> correct.
>>>
>>>
>>> Maulik Gandhi <[email protected]> schrieb am Di., 19. März 2019, 18:13:
>>>
>>>>
>>>> Maulik Gandhi <[email protected]>
>>>> 10:19 AM (1 hour ago)
>>>> to user
>>>> Hi Beam Community,
>>>>
>>>> I am working on Beam processing pipeline, which reads data from the
>>>> non-bounded and bounded source and want to leverage Beam state management
>>>> in my pipeline.  For putting data in Beam state, I have to transfer the
>>>> data in key-value (eg: KV<String, Object>.  As I am reading data from the
>>>> non-bounded and bounded source, I am forced to perform Window + Triggering,
>>>> before grouping data by key.  I have chosen to use GlobalWindows().
>>>>
>>>> I am able to kick-off the Data Flow job, which would run my Beam
>>>> pipeline.  I have noticed Data Flow would use only 1 Worker node to perform
>>>> the work, and would not scale the job to use more worker nodes, thus not
>>>> leveraging the benefit of distributed processing.
>>>>
>>>> I have posted the question on Stack Overflow:
>>>> https://stackoverflow.com/questions/55242684/join-bounded-and-non-bounded-source-data-flow-job-not-scaling
>>>>  but
>>>> reaching out on the mailing list, to get some help, or learn what I
>>>> am missing.
>>>>
>>>> Any help would be appreciated.
>>>>
>>>> Thanks.
>>>> - Maulik
>>>>
>>>

Re: Scaling Beam pipeline on Data Flow - Join bounded and non-bounded source

Reply via email to