Re: Scaling Beam pipeline on Data Flow - Join bounded and non-bounded source

Maulik Gandhi Wed, 20 Mar 2019 12:15:28 -0700

How big is your bounded-source
- 16.01 GiB total data from AVRO files.  But it can be b/w 10-100s of GBs


How much pressure (messages per seconds) your unbounded source is
receiving?
-  Initially no pressure, to prime the Beam state, but later there will be
data flowing through PubSub.

I also add a parameter on mvn command, as below and could get 7 (105/15)
worker, as per guide:
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#autoscaling

mvn compile exec:java -pl <project-name>
-Dexec.mainClass=<class-name> \
-Dexec.args="--runner=DataflowRunner --project=<project-name> \
             --stagingLocation=gs://<gcs-location> \
             --maxNumWorkers=105 \
             --autoscalingAlgorithm=THROUGHPUT_BASED \
             --templateLocation=gs://<gcs-location>"

Even though I got 7 worker nodes, when processing GCS data (bounded source)
and adding it to Beam state, I think the work was just being performed on 1
node, as it took more than 16+ hours.

Can someone point me to documentation, on how to figure out how much data
is being processed by each worker node (like reading GCS part AVRO files,
input counts, etc), rather than just high-level count of input and output
element from ParDo.

Thanks.
- Maulik

On Wed, Mar 20, 2019 at 3:17 AM Juan Carlos Garcia <[email protected]>
wrote:

> Your auto scaling algorithm is THROUGHPUT_BASED, it will kicks in only
> when it feels the pipeline is not able to keep it up with the incoming
> source. How big is your bounded-source and how much pressure (messages per
> seconds) your unbounded source is receiving?
>
> Maulik Gandhi <[email protected]> schrieb am Di., 19. März 2019, 21:06:
>
>> Hi Juan,
>>
>> Thanks for replying.  I believe I am using correct configurations.
>>
>> I have posted more details with code snippet and Data Flow job template
>> configuration on Stack Overflow post:
>> https://stackoverflow.com/q/55242684/11226631
>>
>> Thanks.
>> - Maulik
>>
>> On Tue, Mar 19, 2019 at 2:53 PM Juan Carlos Garcia <[email protected]>
>> wrote:
>>
>>> Hi Maulik,
>>>
>>> Have you submitted your job with the correct configuration to enable
>>> autoscaling?
>>>
>>> --autoscalingAlgorithm=
>>> --maxWorkers=
>>>
>>> I am on my phone right now and can't tell if the flags name are 100%
>>> correct.
>>>
>>>
>>> Maulik Gandhi <[email protected]> schrieb am Di., 19. März 2019, 18:13:
>>>
>>>>
>>>> Maulik Gandhi <[email protected]>
>>>> 10:19 AM (1 hour ago)
>>>> to user
>>>> Hi Beam Community,
>>>>
>>>> I am working on Beam processing pipeline, which reads data from the
>>>> non-bounded and bounded source and want to leverage Beam state management
>>>> in my pipeline.  For putting data in Beam state, I have to transfer the
>>>> data in key-value (eg: KV<String, Object>.  As I am reading data from the
>>>> non-bounded and bounded source, I am forced to perform Window + Triggering,
>>>> before grouping data by key.  I have chosen to use GlobalWindows().
>>>>
>>>> I am able to kick-off the Data Flow job, which would run my Beam
>>>> pipeline.  I have noticed Data Flow would use only 1 Worker node to perform
>>>> the work, and would not scale the job to use more worker nodes, thus not
>>>> leveraging the benefit of distributed processing.
>>>>
>>>> I have posted the question on Stack Overflow:
>>>> https://stackoverflow.com/questions/55242684/join-bounded-and-non-bounded-source-data-flow-job-not-scaling
>>>>  but
>>>> reaching out on the mailing list, to get some help, or learn what I
>>>> am missing.
>>>>
>>>> Any help would be appreciated.
>>>>
>>>> Thanks.
>>>> - Maulik
>>>>
>>>

Re: Scaling Beam pipeline on Data Flow - Join bounded and non-bounded source

Reply via email to