Hi Paul,

Thanks for reporting this.

Indeed, Flink's RM currently performs several HDFS operations in the rpc
main thread when preparing the TM context, which may block the main thread
when HDFS is slow.

Unfortunately, I don't see any out-of-box approach that fixes the problem
at the moment, except for increasing the heartbeat timeout.

As for the long run solution, I think there's an easier approach. We can
move creating of the TM contexts away from the rpc main thread. Ideally, we
should try to avoid performing any heavy operations which do not modify the
RM's internal states in the rpc main thread. With FLINK-19241, this can be
achieved easily by delegating the work to the io executor.

Thank you~

Xintong Song



On Mon, Oct 12, 2020 at 12:44 PM Paul Lam <paullin3...@gmail.com> wrote:

> Hi,
>
> After FLINK-13184 is implemented (even with Flink 1.11), occasionally
> there would still be jobs
> with high parallelism getting TM-RM heartbeat timeouts when RM is busy
> creating TM contexts
> on cluster initialization and HDFS is slow at that moment.
>
> Apart from increasing the TM heartbeat timeout, is there any recommended
>  out of the box
> approach that can reduce the chance of getting the timeouts?
>
> In the long run, is it possible to limit the number of taskmanager
> contexts that RM creates at
> a time, so that the heartbeat triggers can chime in?
>
> Thanks!
>
> Best,
> Paul Lam
>

Reply via email to