On Wed, Jun 2, 2021 at 11:11 AM Robert Bradshaw <rober...@google.com> wrote:

> If you want to control the total number of elements being processed
> across all workers at a time, you can do this by assigning random keys
> of the form RandomInteger() % TotalDesiredConcurrency followed by a
> GroupByKey.
>
> If you want to control the number of elements being processed in
> parallel per VM, you can use the fact that Dataflow assigns one work
> item per core, so an n1-standard-4 would process 4 elements in
> parallel, an n1-highmem-2 would process 2 elements in parallel, etc.
>
> You could also control this explicitly by using a global (per worker)
> semaphore in your code. If you do this you may want to proceed your
> rate-limited DoFn with a Reshuffle to ensure fair (and dynamic) work
> distribution. This should be much easier than trying to coordinate
> multiple parallel pipelines.
>
>
Is there a risk here of having an OOM error due to 'build up' of in memory
elements from a streaming input?  Or do the runners have some concept of
throttling bundles based on progress of stages further down the pipeline?




> On Fri, May 28, 2021 at 5:16 AM Eila Oriel Research
> <e...@orielresearch.org> wrote:
> >
> > Thanks Robert.
> > I found the following explanation for the number of threads for 4 cores:
> > You have 4 CPU sockets, each CPU can have, up to, 12 cores and each core
> can have two threads. Your max thread count is, 4 CPU x 12 cores x 2
> threads per core, so 12 x 4 x 2 is 96
> > Can I limit the threads using the pipeline options in some way? 10-20
> elements per worker will work for me.
> >
> > My current practice to work around that issue is to limit the number of
> elements in each dataflow pipeline (providing ~10 elements for each
> pipeline)
> > Once I have completed around 200 elements processing = 20 pipelines
> (google does not allow more than 25 dataflow pipelines per region) with 10
> elements each, I am launching the next 20 pipelines.
> >
> > This is ofcourse missing the benefit of serverless.
> >
> > Any idea, how to work around this?
> >
> > Best,
> > Eila
> >
> >
> > On Mon, May 17, 2021 at 1:27 PM Robert Bradshaw <rober...@google.com>
> wrote:
> >>
> >> Note that workers generally process one element per thread at a time.
> The number of threads defaults to the number of cores of the VM that you're
> using.
> >>
> >> On Mon, May 17, 2021 at 10:18 AM Brian Hulette <bhule...@google.com>
> wrote:
> >>>
> >>> What type of files are you reading? If they can be split and read by
> multiple workers this might be a good candidate for a Splittable DoFn (SDF).
> >>>
> >>> Brian
> >>>
> >>> On Wed, May 12, 2021 at 6:18 AM Eila Oriel Research <
> e...@orielresearch.org> wrote:
> >>>>
> >>>> Hi,
> >>>> I am running out of resources on the workers machines.
> >>>> The reasons are:
> >>>> 1. Every pcollection is a reference to a LARGE file that is copied
> into the worker
> >>>> 2. The worker makes calculations on the copied file using a software
> library that consumes memory / storage / compute resources
> >>>>
> >>>> I have changed the workers' CPUs and memory size. At some point, I am
> running out of resources with this method as well
> >>>> I am looking to limit the number of pCollection / elements that are
> being processed in parallel on each worker at a time.
> >>>>
> >>>> Many thank for any advice,
> >>>> Best wishes,
> >>>> --
> >>>> Eila
> >>>>
> >>>> Meetup
> >
> >
> >
> > --
> > Eila
> >
> > Meetup
>


~Vincent

Reply via email to