Hi Chad,

We have (Blink) jobs each running with over 10 thousands of TMs.
In our experience, the main regression caused by large scale TMs is the in
TM allocation stage in ResourceManager, that some times it fails to
allocate enough TMs before the allocation timeout.
It does not deteriorate much once the Flink cluster has reached a stable
state.

The main loads, In my mind, increases with the task scale and edge scale of
a submitted job.
JM can be overwhelmed by frequent and slow GCs caused by task deployment if
the JM memory is not fine tuned.
The JM can also be slower due to more PRCs to JM main thread and increased
computation complexity of each RPC handling.

Thanks,
Zhu Zhu

qi luo <luoqi...@gmail.com> 于2019年8月11日周日 下午6:17写道:

> Hi Chad,
>
> In our cases, 1~2k TMs with up to ~10k TM slots are used in one job. In
> general, the CPU/memory of Job Manager should be increased with more TMs.
>
> Regards,
> Qi
>
> > On Aug 11, 2019, at 2:03 AM, Chad Dombrova <chad...@gmail.com> wrote:
> >
> > Hi,
> > I'm still on my task management investigation, and I'm curious to know
> how many task managers people are reliably using with Flink.  We're
> currently using AWS | Thinkbox Deadline, and we're able to easily utilize
> over 300 workers, and I've heard from other customers who use several
> thousand, so I'm curious how Flink compares in this regard.  Also, what
> aspects of the system begin to deteriorate at higher scales?
> >
> > thanks in advance!
> >
> > -chad
> >
>
>

Reply via email to