If you want to control the total number of elements being processed
across all workers at a time, you can do this by assigning random keys
of the form RandomInteger() % TotalDesiredConcurrency followed by a
GroupByKey.

If you want to control the number of elements being processed in
parallel per VM, you can use the fact that Dataflow assigns one work
item per core, so an n1-standard-4 would process 4 elements in
parallel, an n1-highmem-2 would process 2 elements in parallel, etc.

You could also control this explicitly by using a global (per worker)
semaphore in your code. If you do this you may want to proceed your
rate-limited DoFn with a Reshuffle to ensure fair (and dynamic) work
distribution. This should be much easier than trying to coordinate
multiple parallel pipelines.

On Fri, May 28, 2021 at 5:16 AM Eila Oriel Research
<e...@orielresearch.org> wrote:
>
> Thanks Robert.
> I found the following explanation for the number of threads for 4 cores:
> You have 4 CPU sockets, each CPU can have, up to, 12 cores and each core can 
> have two threads. Your max thread count is, 4 CPU x 12 cores x 2 threads per 
> core, so 12 x 4 x 2 is 96
> Can I limit the threads using the pipeline options in some way? 10-20 
> elements per worker will work for me.
>
> My current practice to work around that issue is to limit the number of 
> elements in each dataflow pipeline (providing ~10 elements for each pipeline)
> Once I have completed around 200 elements processing = 20 pipelines (google 
> does not allow more than 25 dataflow pipelines per region) with 10 elements 
> each, I am launching the next 20 pipelines.
>
> This is ofcourse missing the benefit of serverless.
>
> Any idea, how to work around this?
>
> Best,
> Eila
>
>
> On Mon, May 17, 2021 at 1:27 PM Robert Bradshaw <rober...@google.com> wrote:
>>
>> Note that workers generally process one element per thread at a time. The 
>> number of threads defaults to the number of cores of the VM that you're 
>> using.
>>
>> On Mon, May 17, 2021 at 10:18 AM Brian Hulette <bhule...@google.com> wrote:
>>>
>>> What type of files are you reading? If they can be split and read by 
>>> multiple workers this might be a good candidate for a Splittable DoFn (SDF).
>>>
>>> Brian
>>>
>>> On Wed, May 12, 2021 at 6:18 AM Eila Oriel Research 
>>> <e...@orielresearch.org> wrote:
>>>>
>>>> Hi,
>>>> I am running out of resources on the workers machines.
>>>> The reasons are:
>>>> 1. Every pcollection is a reference to a LARGE file that is copied into 
>>>> the worker
>>>> 2. The worker makes calculations on the copied file using a software 
>>>> library that consumes memory / storage / compute resources
>>>>
>>>> I have changed the workers' CPUs and memory size. At some point, I am 
>>>> running out of resources with this method as well
>>>> I am looking to limit the number of pCollection / elements that are being 
>>>> processed in parallel on each worker at a time.
>>>>
>>>> Many thank for any advice,
>>>> Best wishes,
>>>> --
>>>> Eila
>>>>
>>>> Meetup
>
>
>
> --
> Eila
>
> Meetup

Reply via email to