On Wed, Jun 2, 2021 at 11:18 AM Vincent Marquez
<vincent.marq...@gmail.com> wrote:
>
> On Wed, Jun 2, 2021 at 11:11 AM Robert Bradshaw <rober...@google.com> wrote:
>>
>> If you want to control the total number of elements being processed
>> across all workers at a time, you can do this by assigning random keys
>> of the form RandomInteger() % TotalDesiredConcurrency followed by a
>> GroupByKey.
>>
>> If you want to control the number of elements being processed in
>> parallel per VM, you can use the fact that Dataflow assigns one work
>> item per core, so an n1-standard-4 would process 4 elements in
>> parallel, an n1-highmem-2 would process 2 elements in parallel, etc.
>>
>> You could also control this explicitly by using a global (per worker)
>> semaphore in your code. If you do this you may want to proceed your
>> rate-limited DoFn with a Reshuffle to ensure fair (and dynamic) work
>> distribution. This should be much easier than trying to coordinate
>> multiple parallel pipelines.
>>
>
> Is there a risk here of having an OOM error due to 'build up' of in memory 
> elements from a streaming input?  Or do the runners have some concept of 
> throttling bundles based on progress of stages further down the pipeline?

For streaming pipelines, hundreds of threads (aka work items) are
allocated for each worker, so limiting the number of concurrent items
per worker is harder there.

>> On Fri, May 28, 2021 at 5:16 AM Eila Oriel Research
>> <e...@orielresearch.org> wrote:
>> >
>> > Thanks Robert.
>> > I found the following explanation for the number of threads for 4 cores:
>> > You have 4 CPU sockets, each CPU can have, up to, 12 cores and each core 
>> > can have two threads. Your max thread count is, 4 CPU x 12 cores x 2 
>> > threads per core, so 12 x 4 x 2 is 96
>> > Can I limit the threads using the pipeline options in some way? 10-20 
>> > elements per worker will work for me.
>> >
>> > My current practice to work around that issue is to limit the number of 
>> > elements in each dataflow pipeline (providing ~10 elements for each 
>> > pipeline)
>> > Once I have completed around 200 elements processing = 20 pipelines 
>> > (google does not allow more than 25 dataflow pipelines per region) with 10 
>> > elements each, I am launching the next 20 pipelines.
>> >
>> > This is ofcourse missing the benefit of serverless.
>> >
>> > Any idea, how to work around this?
>> >
>> > Best,
>> > Eila
>> >
>> >
>> > On Mon, May 17, 2021 at 1:27 PM Robert Bradshaw <rober...@google.com> 
>> > wrote:
>> >>
>> >> Note that workers generally process one element per thread at a time. The 
>> >> number of threads defaults to the number of cores of the VM that you're 
>> >> using.
>> >>
>> >> On Mon, May 17, 2021 at 10:18 AM Brian Hulette <bhule...@google.com> 
>> >> wrote:
>> >>>
>> >>> What type of files are you reading? If they can be split and read by 
>> >>> multiple workers this might be a good candidate for a Splittable DoFn 
>> >>> (SDF).
>> >>>
>> >>> Brian
>> >>>
>> >>> On Wed, May 12, 2021 at 6:18 AM Eila Oriel Research 
>> >>> <e...@orielresearch.org> wrote:
>> >>>>
>> >>>> Hi,
>> >>>> I am running out of resources on the workers machines.
>> >>>> The reasons are:
>> >>>> 1. Every pcollection is a reference to a LARGE file that is copied into 
>> >>>> the worker
>> >>>> 2. The worker makes calculations on the copied file using a software 
>> >>>> library that consumes memory / storage / compute resources
>> >>>>
>> >>>> I have changed the workers' CPUs and memory size. At some point, I am 
>> >>>> running out of resources with this method as well
>> >>>> I am looking to limit the number of pCollection / elements that are 
>> >>>> being processed in parallel on each worker at a time.
>> >>>>
>> >>>> Many thank for any advice,
>> >>>> Best wishes,
>> >>>> --
>> >>>> Eila
>> >>>>
>> >>>> Meetup
>> >
>> >
>> >
>> > --
>> > Eila
>> >
>> > Meetup
>
>
>
> ~Vincent

Reply via email to