I would recommend going to the compute engine service and check the vm where the pipeline is working, from there you might have more insight if you have a bottleneck on your pipeline (cpu, io, network) that is preventing to process it faster.
Maulik Gandhi <[email protected]> schrieb am Mi., 20. März 2019, 20:15: > How big is your bounded-source > - 16.01 GiB total data from AVRO files. But it can be b/w 10-100s of GBs > > How much pressure (messages per seconds) your unbounded source is > receiving? > - Initially no pressure, to prime the Beam state, but later there will be > data flowing through PubSub. > > I also add a parameter on mvn command, as below and could get 7 (105/15) > worker, as per guide: > https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#autoscaling > > mvn compile exec:java -pl <project-name> > -Dexec.mainClass=<class-name> \ > -Dexec.args="--runner=DataflowRunner --project=<project-name> \ > --stagingLocation=gs://<gcs-location> \ > --maxNumWorkers=105 \ > --autoscalingAlgorithm=THROUGHPUT_BASED \ > --templateLocation=gs://<gcs-location>" > > Even though I got 7 worker nodes, when processing GCS data (bounded > source) and adding it to Beam state, I think the work was just being > performed on 1 node, as it took more than 16+ hours. > > Can someone point me to documentation, on how to figure out how much data > is being processed by each worker node (like reading GCS part AVRO files, > input counts, etc), rather than just high-level count of input and output > element from ParDo. > > Thanks. > - Maulik > > On Wed, Mar 20, 2019 at 3:17 AM Juan Carlos Garcia <[email protected]> > wrote: > >> Your auto scaling algorithm is THROUGHPUT_BASED, it will kicks in only >> when it feels the pipeline is not able to keep it up with the incoming >> source. How big is your bounded-source and how much pressure (messages per >> seconds) your unbounded source is receiving? >> >> Maulik Gandhi <[email protected]> schrieb am Di., 19. März 2019, 21:06: >> >>> Hi Juan, >>> >>> Thanks for replying. I believe I am using correct configurations. >>> >>> I have posted more details with code snippet and Data Flow job template >>> configuration on Stack Overflow post: >>> https://stackoverflow.com/q/55242684/11226631 >>> >>> Thanks. >>> - Maulik >>> >>> On Tue, Mar 19, 2019 at 2:53 PM Juan Carlos Garcia <[email protected]> >>> wrote: >>> >>>> Hi Maulik, >>>> >>>> Have you submitted your job with the correct configuration to enable >>>> autoscaling? >>>> >>>> --autoscalingAlgorithm= >>>> --maxWorkers= >>>> >>>> I am on my phone right now and can't tell if the flags name are 100% >>>> correct. >>>> >>>> >>>> Maulik Gandhi <[email protected]> schrieb am Di., 19. März 2019, 18:13: >>>> >>>>> >>>>> Maulik Gandhi <[email protected]> >>>>> 10:19 AM (1 hour ago) >>>>> to user >>>>> Hi Beam Community, >>>>> >>>>> I am working on Beam processing pipeline, which reads data from the >>>>> non-bounded and bounded source and want to leverage Beam state management >>>>> in my pipeline. For putting data in Beam state, I have to transfer the >>>>> data in key-value (eg: KV<String, Object>. As I am reading data from the >>>>> non-bounded and bounded source, I am forced to perform Window + >>>>> Triggering, >>>>> before grouping data by key. I have chosen to use GlobalWindows(). >>>>> >>>>> I am able to kick-off the Data Flow job, which would run my Beam >>>>> pipeline. I have noticed Data Flow would use only 1 Worker node to >>>>> perform >>>>> the work, and would not scale the job to use more worker nodes, thus not >>>>> leveraging the benefit of distributed processing. >>>>> >>>>> I have posted the question on Stack Overflow: >>>>> https://stackoverflow.com/questions/55242684/join-bounded-and-non-bounded-source-data-flow-job-not-scaling >>>>> but >>>>> reaching out on the mailing list, to get some help, or learn what I >>>>> am missing. >>>>> >>>>> Any help would be appreciated. >>>>> >>>>> Thanks. >>>>> - Maulik >>>>> >>>>
