Hi Roberts,
Thank you. I usually work with the custom worker configuration options
I will custom it to low number of cores with large memory and see if it solves 
my problem

Thanks so much,
—
Eila
www.orielresearch.com
https://www.meetup.com/Deep-Learning-In-Production 


Sent from my iPhone

> On Jun 2, 2021, at 2:10 PM, Robert Bradshaw <rober...@google.com> wrote:
> 
> If you want to control the total number of elements being processed
> across all workers at a time, you can do this by assigning random keys
> of the form RandomInteger() % TotalDesiredConcurrency followed by a
> GroupByKey.
> 
> If you want to control the number of elements being processed in
> parallel per VM, you can use the fact that Dataflow assigns one work
> item per core, so an n1-standard-4 would process 4 elements in
> parallel, an n1-highmem-2 would process 2 elements in parallel, etc.
> 
> You could also control this explicitly by using a global (per worker)
> semaphore in your code. If you do this you may want to proceed your
> rate-limited DoFn with a Reshuffle to ensure fair (and dynamic) work
> distribution. This should be much easier than trying to coordinate
> multiple parallel pipelines.
> 
>> On Fri, May 28, 2021 at 5:16 AM Eila Oriel Research
>> <e...@orielresearch.org> wrote:
>> 
>> Thanks Robert.
>> I found the following explanation for the number of threads for 4 cores:
>> You have 4 CPU sockets, each CPU can have, up to, 12 cores and each core can 
>> have two threads. Your max thread count is, 4 CPU x 12 cores x 2 threads per 
>> core, so 12 x 4 x 2 is 96
>> Can I limit the threads using the pipeline options in some way? 10-20 
>> elements per worker will work for me.
>> 
>> My current practice to work around that issue is to limit the number of 
>> elements in each dataflow pipeline (providing ~10 elements for each pipeline)
>> Once I have completed around 200 elements processing = 20 pipelines (google 
>> does not allow more than 25 dataflow pipelines per region) with 10 elements 
>> each, I am launching the next 20 pipelines.
>> 
>> This is ofcourse missing the benefit of serverless.
>> 
>> Any idea, how to work around this?
>> 
>> Best,
>> Eila
>> 
>> 
>>> On Mon, May 17, 2021 at 1:27 PM Robert Bradshaw <rober...@google.com> wrote:
>>> 
>>> Note that workers generally process one element per thread at a time. The 
>>> number of threads defaults to the number of cores of the VM that you're 
>>> using.
>>> 
>>> On Mon, May 17, 2021 at 10:18 AM Brian Hulette <bhule...@google.com> wrote:
>>>> 
>>>> What type of files are you reading? If they can be split and read by 
>>>> multiple workers this might be a good candidate for a Splittable DoFn 
>>>> (SDF).
>>>> 
>>>> Brian
>>>> 
>>>> On Wed, May 12, 2021 at 6:18 AM Eila Oriel Research 
>>>> <e...@orielresearch.org> wrote:
>>>>> 
>>>>> Hi,
>>>>> I am running out of resources on the workers machines.
>>>>> The reasons are:
>>>>> 1. Every pcollection is a reference to a LARGE file that is copied into 
>>>>> the worker
>>>>> 2. The worker makes calculations on the copied file using a software 
>>>>> library that consumes memory / storage / compute resources
>>>>> 
>>>>> I have changed the workers' CPUs and memory size. At some point, I am 
>>>>> running out of resources with this method as well
>>>>> I am looking to limit the number of pCollection / elements that are being 
>>>>> processed in parallel on each worker at a time.
>>>>> 
>>>>> Many thank for any advice,
>>>>> Best wishes,
>>>>> --
>>>>> Eila
>>>>> 
>>>>> Meetup
>> 
>> 
>> 
>> --
>> Eila
>> 
>> Meetup

Reply via email to