Hi, How many keys do you have flowing through that global window? If there is a wide key space any chance you have a few very hot keys?
Cheers Rez On Thu, 21 Mar 2019, 04:04 Juan Carlos Garcia, <[email protected]> wrote: > I would recommend going to the compute engine service and check the vm > where the pipeline is working, from there you might have more insight if > you have a bottleneck on your pipeline (cpu, io, network) that is > preventing to process it faster. > > > > Maulik Gandhi <[email protected]> schrieb am Mi., 20. März 2019, 20:15: > >> How big is your bounded-source >> - 16.01 GiB total data from AVRO files. But it can be b/w 10-100s of GBs >> >> How much pressure (messages per seconds) your unbounded source is >> receiving? >> - Initially no pressure, to prime the Beam state, but later there will >> be data flowing through PubSub. >> >> I also add a parameter on mvn command, as below and could get 7 (105/15) >> worker, as per guide: >> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#autoscaling >> >> mvn compile exec:java -pl <project-name> >> -Dexec.mainClass=<class-name> \ >> -Dexec.args="--runner=DataflowRunner --project=<project-name> \ >> --stagingLocation=gs://<gcs-location> \ >> --maxNumWorkers=105 \ >> --autoscalingAlgorithm=THROUGHPUT_BASED \ >> --templateLocation=gs://<gcs-location>" >> >> Even though I got 7 worker nodes, when processing GCS data (bounded >> source) and adding it to Beam state, I think the work was just being >> performed on 1 node, as it took more than 16+ hours. >> >> Can someone point me to documentation, on how to figure out how much data >> is being processed by each worker node (like reading GCS part AVRO files, >> input counts, etc), rather than just high-level count of input and output >> element from ParDo. >> >> Thanks. >> - Maulik >> >> On Wed, Mar 20, 2019 at 3:17 AM Juan Carlos Garcia <[email protected]> >> wrote: >> >>> Your auto scaling algorithm is THROUGHPUT_BASED, it will kicks in only >>> when it feels the pipeline is not able to keep it up with the incoming >>> source. How big is your bounded-source and how much pressure (messages per >>> seconds) your unbounded source is receiving? >>> >>> Maulik Gandhi <[email protected]> schrieb am Di., 19. März 2019, 21:06: >>> >>>> Hi Juan, >>>> >>>> Thanks for replying. I believe I am using correct configurations. >>>> >>>> I have posted more details with code snippet and Data Flow job template >>>> configuration on Stack Overflow post: >>>> https://stackoverflow.com/q/55242684/11226631 >>>> >>>> Thanks. >>>> - Maulik >>>> >>>> On Tue, Mar 19, 2019 at 2:53 PM Juan Carlos Garcia <[email protected]> >>>> wrote: >>>> >>>>> Hi Maulik, >>>>> >>>>> Have you submitted your job with the correct configuration to enable >>>>> autoscaling? >>>>> >>>>> --autoscalingAlgorithm= >>>>> --maxWorkers= >>>>> >>>>> I am on my phone right now and can't tell if the flags name are 100% >>>>> correct. >>>>> >>>>> >>>>> Maulik Gandhi <[email protected]> schrieb am Di., 19. März 2019, 18:13: >>>>> >>>>>> >>>>>> Maulik Gandhi <[email protected]> >>>>>> 10:19 AM (1 hour ago) >>>>>> to user >>>>>> Hi Beam Community, >>>>>> >>>>>> I am working on Beam processing pipeline, which reads data from the >>>>>> non-bounded and bounded source and want to leverage Beam state management >>>>>> in my pipeline. For putting data in Beam state, I have to transfer the >>>>>> data in key-value (eg: KV<String, Object>. As I am reading data from the >>>>>> non-bounded and bounded source, I am forced to perform Window + >>>>>> Triggering, >>>>>> before grouping data by key. I have chosen to use GlobalWindows(). >>>>>> >>>>>> I am able to kick-off the Data Flow job, which would run my Beam >>>>>> pipeline. I have noticed Data Flow would use only 1 Worker node to >>>>>> perform >>>>>> the work, and would not scale the job to use more worker nodes, thus not >>>>>> leveraging the benefit of distributed processing. >>>>>> >>>>>> I have posted the question on Stack Overflow: >>>>>> https://stackoverflow.com/questions/55242684/join-bounded-and-non-bounded-source-data-flow-job-not-scaling >>>>>> but >>>>>> reaching out on the mailing list, to get some help, or learn what I >>>>>> am missing. >>>>>> >>>>>> Any help would be appreciated. >>>>>> >>>>>> Thanks. >>>>>> - Maulik >>>>>> >>>>>
