Hi Roberts, Thank you. I usually work with the custom worker configuration options I will custom it to low number of cores with large memory and see if it solves my problem
Thanks so much, — Eila www.orielresearch.com https://www.meetup.com/Deep-Learning-In-Production Sent from my iPhone > On Jun 2, 2021, at 2:10 PM, Robert Bradshaw <rober...@google.com> wrote: > > If you want to control the total number of elements being processed > across all workers at a time, you can do this by assigning random keys > of the form RandomInteger() % TotalDesiredConcurrency followed by a > GroupByKey. > > If you want to control the number of elements being processed in > parallel per VM, you can use the fact that Dataflow assigns one work > item per core, so an n1-standard-4 would process 4 elements in > parallel, an n1-highmem-2 would process 2 elements in parallel, etc. > > You could also control this explicitly by using a global (per worker) > semaphore in your code. If you do this you may want to proceed your > rate-limited DoFn with a Reshuffle to ensure fair (and dynamic) work > distribution. This should be much easier than trying to coordinate > multiple parallel pipelines. > >> On Fri, May 28, 2021 at 5:16 AM Eila Oriel Research >> <e...@orielresearch.org> wrote: >> >> Thanks Robert. >> I found the following explanation for the number of threads for 4 cores: >> You have 4 CPU sockets, each CPU can have, up to, 12 cores and each core can >> have two threads. Your max thread count is, 4 CPU x 12 cores x 2 threads per >> core, so 12 x 4 x 2 is 96 >> Can I limit the threads using the pipeline options in some way? 10-20 >> elements per worker will work for me. >> >> My current practice to work around that issue is to limit the number of >> elements in each dataflow pipeline (providing ~10 elements for each pipeline) >> Once I have completed around 200 elements processing = 20 pipelines (google >> does not allow more than 25 dataflow pipelines per region) with 10 elements >> each, I am launching the next 20 pipelines. >> >> This is ofcourse missing the benefit of serverless. >> >> Any idea, how to work around this? >> >> Best, >> Eila >> >> >>> On Mon, May 17, 2021 at 1:27 PM Robert Bradshaw <rober...@google.com> wrote: >>> >>> Note that workers generally process one element per thread at a time. The >>> number of threads defaults to the number of cores of the VM that you're >>> using. >>> >>> On Mon, May 17, 2021 at 10:18 AM Brian Hulette <bhule...@google.com> wrote: >>>> >>>> What type of files are you reading? If they can be split and read by >>>> multiple workers this might be a good candidate for a Splittable DoFn >>>> (SDF). >>>> >>>> Brian >>>> >>>> On Wed, May 12, 2021 at 6:18 AM Eila Oriel Research >>>> <e...@orielresearch.org> wrote: >>>>> >>>>> Hi, >>>>> I am running out of resources on the workers machines. >>>>> The reasons are: >>>>> 1. Every pcollection is a reference to a LARGE file that is copied into >>>>> the worker >>>>> 2. The worker makes calculations on the copied file using a software >>>>> library that consumes memory / storage / compute resources >>>>> >>>>> I have changed the workers' CPUs and memory size. At some point, I am >>>>> running out of resources with this method as well >>>>> I am looking to limit the number of pCollection / elements that are being >>>>> processed in parallel on each worker at a time. >>>>> >>>>> Many thank for any advice, >>>>> Best wishes, >>>>> -- >>>>> Eila >>>>> >>>>> Meetup >> >> >> >> -- >> Eila >> >> Meetup