If you want to control the total number of elements being processed across all workers at a time, you can do this by assigning random keys of the form RandomInteger() % TotalDesiredConcurrency followed by a GroupByKey.
If you want to control the number of elements being processed in parallel per VM, you can use the fact that Dataflow assigns one work item per core, so an n1-standard-4 would process 4 elements in parallel, an n1-highmem-2 would process 2 elements in parallel, etc. You could also control this explicitly by using a global (per worker) semaphore in your code. If you do this you may want to proceed your rate-limited DoFn with a Reshuffle to ensure fair (and dynamic) work distribution. This should be much easier than trying to coordinate multiple parallel pipelines. On Fri, May 28, 2021 at 5:16 AM Eila Oriel Research <e...@orielresearch.org> wrote: > > Thanks Robert. > I found the following explanation for the number of threads for 4 cores: > You have 4 CPU sockets, each CPU can have, up to, 12 cores and each core can > have two threads. Your max thread count is, 4 CPU x 12 cores x 2 threads per > core, so 12 x 4 x 2 is 96 > Can I limit the threads using the pipeline options in some way? 10-20 > elements per worker will work for me. > > My current practice to work around that issue is to limit the number of > elements in each dataflow pipeline (providing ~10 elements for each pipeline) > Once I have completed around 200 elements processing = 20 pipelines (google > does not allow more than 25 dataflow pipelines per region) with 10 elements > each, I am launching the next 20 pipelines. > > This is ofcourse missing the benefit of serverless. > > Any idea, how to work around this? > > Best, > Eila > > > On Mon, May 17, 2021 at 1:27 PM Robert Bradshaw <rober...@google.com> wrote: >> >> Note that workers generally process one element per thread at a time. The >> number of threads defaults to the number of cores of the VM that you're >> using. >> >> On Mon, May 17, 2021 at 10:18 AM Brian Hulette <bhule...@google.com> wrote: >>> >>> What type of files are you reading? If they can be split and read by >>> multiple workers this might be a good candidate for a Splittable DoFn (SDF). >>> >>> Brian >>> >>> On Wed, May 12, 2021 at 6:18 AM Eila Oriel Research >>> <e...@orielresearch.org> wrote: >>>> >>>> Hi, >>>> I am running out of resources on the workers machines. >>>> The reasons are: >>>> 1. Every pcollection is a reference to a LARGE file that is copied into >>>> the worker >>>> 2. The worker makes calculations on the copied file using a software >>>> library that consumes memory / storage / compute resources >>>> >>>> I have changed the workers' CPUs and memory size. At some point, I am >>>> running out of resources with this method as well >>>> I am looking to limit the number of pCollection / elements that are being >>>> processed in parallel on each worker at a time. >>>> >>>> Many thank for any advice, >>>> Best wishes, >>>> -- >>>> Eila >>>> >>>> Meetup > > > > -- > Eila > > Meetup