On Wed, Jun 2, 2021 at 11:11 AM Robert Bradshaw <rober...@google.com> wrote:
> If you want to control the total number of elements being processed > across all workers at a time, you can do this by assigning random keys > of the form RandomInteger() % TotalDesiredConcurrency followed by a > GroupByKey. > > If you want to control the number of elements being processed in > parallel per VM, you can use the fact that Dataflow assigns one work > item per core, so an n1-standard-4 would process 4 elements in > parallel, an n1-highmem-2 would process 2 elements in parallel, etc. > > You could also control this explicitly by using a global (per worker) > semaphore in your code. If you do this you may want to proceed your > rate-limited DoFn with a Reshuffle to ensure fair (and dynamic) work > distribution. This should be much easier than trying to coordinate > multiple parallel pipelines. > > Is there a risk here of having an OOM error due to 'build up' of in memory elements from a streaming input? Or do the runners have some concept of throttling bundles based on progress of stages further down the pipeline? > On Fri, May 28, 2021 at 5:16 AM Eila Oriel Research > <e...@orielresearch.org> wrote: > > > > Thanks Robert. > > I found the following explanation for the number of threads for 4 cores: > > You have 4 CPU sockets, each CPU can have, up to, 12 cores and each core > can have two threads. Your max thread count is, 4 CPU x 12 cores x 2 > threads per core, so 12 x 4 x 2 is 96 > > Can I limit the threads using the pipeline options in some way? 10-20 > elements per worker will work for me. > > > > My current practice to work around that issue is to limit the number of > elements in each dataflow pipeline (providing ~10 elements for each > pipeline) > > Once I have completed around 200 elements processing = 20 pipelines > (google does not allow more than 25 dataflow pipelines per region) with 10 > elements each, I am launching the next 20 pipelines. > > > > This is ofcourse missing the benefit of serverless. > > > > Any idea, how to work around this? > > > > Best, > > Eila > > > > > > On Mon, May 17, 2021 at 1:27 PM Robert Bradshaw <rober...@google.com> > wrote: > >> > >> Note that workers generally process one element per thread at a time. > The number of threads defaults to the number of cores of the VM that you're > using. > >> > >> On Mon, May 17, 2021 at 10:18 AM Brian Hulette <bhule...@google.com> > wrote: > >>> > >>> What type of files are you reading? If they can be split and read by > multiple workers this might be a good candidate for a Splittable DoFn (SDF). > >>> > >>> Brian > >>> > >>> On Wed, May 12, 2021 at 6:18 AM Eila Oriel Research < > e...@orielresearch.org> wrote: > >>>> > >>>> Hi, > >>>> I am running out of resources on the workers machines. > >>>> The reasons are: > >>>> 1. Every pcollection is a reference to a LARGE file that is copied > into the worker > >>>> 2. The worker makes calculations on the copied file using a software > library that consumes memory / storage / compute resources > >>>> > >>>> I have changed the workers' CPUs and memory size. At some point, I am > running out of resources with this method as well > >>>> I am looking to limit the number of pCollection / elements that are > being processed in parallel on each worker at a time. > >>>> > >>>> Many thank for any advice, > >>>> Best wishes, > >>>> -- > >>>> Eila > >>>> > >>>> Meetup > > > > > > > > -- > > Eila > > > > Meetup > ~Vincent