I would recommend going to the compute engine service and check the vm
where the pipeline is working, from there you might have more insight if
you have a bottleneck on your pipeline (cpu, io, network) that is
preventing to process it faster.



Maulik Gandhi <[email protected]> schrieb am Mi., 20. März 2019, 20:15:

> How big is your bounded-source
> - 16.01 GiB total data from AVRO files.  But it can be b/w 10-100s of GBs
>
> How much pressure (messages per seconds) your unbounded source is
> receiving?
> -  Initially no pressure, to prime the Beam state, but later there will be
> data flowing through PubSub.
>
> I also add a parameter on mvn command, as below and could get 7 (105/15)
> worker, as per guide:
> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#autoscaling
>
> mvn compile exec:java -pl <project-name>
> -Dexec.mainClass=<class-name> \
> -Dexec.args="--runner=DataflowRunner --project=<project-name> \
>              --stagingLocation=gs://<gcs-location> \
>              --maxNumWorkers=105 \
>              --autoscalingAlgorithm=THROUGHPUT_BASED \
>              --templateLocation=gs://<gcs-location>"
>
> Even though I got 7 worker nodes, when processing GCS data (bounded
> source) and adding it to Beam state, I think the work was just being
> performed on 1 node, as it took more than 16+ hours.
>
> Can someone point me to documentation, on how to figure out how much data
> is being processed by each worker node (like reading GCS part AVRO files,
> input counts, etc), rather than just high-level count of input and output
> element from ParDo.
>
> Thanks.
> - Maulik
>
> On Wed, Mar 20, 2019 at 3:17 AM Juan Carlos Garcia <[email protected]>
> wrote:
>
>> Your auto scaling algorithm is THROUGHPUT_BASED, it will kicks in only
>> when it feels the pipeline is not able to keep it up with the incoming
>> source. How big is your bounded-source and how much pressure (messages per
>> seconds) your unbounded source is receiving?
>>
>> Maulik Gandhi <[email protected]> schrieb am Di., 19. März 2019, 21:06:
>>
>>> Hi Juan,
>>>
>>> Thanks for replying.  I believe I am using correct configurations.
>>>
>>> I have posted more details with code snippet and Data Flow job template
>>> configuration on Stack Overflow post:
>>> https://stackoverflow.com/q/55242684/11226631
>>>
>>> Thanks.
>>> - Maulik
>>>
>>> On Tue, Mar 19, 2019 at 2:53 PM Juan Carlos Garcia <[email protected]>
>>> wrote:
>>>
>>>> Hi Maulik,
>>>>
>>>> Have you submitted your job with the correct configuration to enable
>>>> autoscaling?
>>>>
>>>> --autoscalingAlgorithm=
>>>> --maxWorkers=
>>>>
>>>> I am on my phone right now and can't tell if the flags name are 100%
>>>> correct.
>>>>
>>>>
>>>> Maulik Gandhi <[email protected]> schrieb am Di., 19. März 2019, 18:13:
>>>>
>>>>>
>>>>> Maulik Gandhi <[email protected]>
>>>>> 10:19 AM (1 hour ago)
>>>>> to user
>>>>> Hi Beam Community,
>>>>>
>>>>> I am working on Beam processing pipeline, which reads data from the
>>>>> non-bounded and bounded source and want to leverage Beam state management
>>>>> in my pipeline.  For putting data in Beam state, I have to transfer the
>>>>> data in key-value (eg: KV<String, Object>.  As I am reading data from the
>>>>> non-bounded and bounded source, I am forced to perform Window + 
>>>>> Triggering,
>>>>> before grouping data by key.  I have chosen to use GlobalWindows().
>>>>>
>>>>> I am able to kick-off the Data Flow job, which would run my Beam
>>>>> pipeline.  I have noticed Data Flow would use only 1 Worker node to 
>>>>> perform
>>>>> the work, and would not scale the job to use more worker nodes, thus not
>>>>> leveraging the benefit of distributed processing.
>>>>>
>>>>> I have posted the question on Stack Overflow:
>>>>> https://stackoverflow.com/questions/55242684/join-bounded-and-non-bounded-source-data-flow-job-not-scaling
>>>>>  but
>>>>> reaching out on the mailing list, to get some help, or learn what I
>>>>> am missing.
>>>>>
>>>>> Any help would be appreciated.
>>>>>
>>>>> Thanks.
>>>>> - Maulik
>>>>>
>>>>

Reply via email to