Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-12 Thread Xintong Song
No worries :) Thank you~ Xintong Song On Mon, Oct 12, 2020 at 2:48 PM Paul Lam wrote: > Sorry for the misspelled name, Xintong > > Best, > Paul Lam > > 2020年10月12日 14:46,Paul Lam 写道: > > Hi Xingtong, > > Thanks a lot for the pointer! > > It’s good to see there would be a new IO executor

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-12 Thread Paul Lam
Sorry for the misspelled name, Xintong Best, Paul Lam > 2020年10月12日 14:46,Paul Lam 写道: > > Hi Xingtong, > > Thanks a lot for the pointer! > > It’s good to see there would be a new IO executor to take care of the TM > contexts. Looking forward to the 1.12 release! > > Best, > Paul Lam > >>

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-12 Thread Paul Lam
Hi Xingtong, Thanks a lot for the pointer! It’s good to see there would be a new IO executor to take care of the TM contexts. Looking forward to the 1.12 release! Best, Paul Lam > 2020年10月12日 14:18,Xintong Song 写道: > > Hi Paul, > > Thanks for reporting this. > > Indeed, Flink's RM

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-12 Thread Xintong Song
FYI, I just created FLINK-19568 for tracking this issue. Thank you~ Xintong Song [1] https://issues.apache.org/jira/browse/FLINK-19568 On Mon, Oct 12, 2020 at 2:18 PM Xintong Song wrote: > Hi Paul, > > Thanks for reporting this. > > Indeed, Flink's RM currently performs several HDFS

Re: TM heartbeat timeout due to ResourceManager being busy

2020-10-12 Thread Xintong Song
Hi Paul, Thanks for reporting this. Indeed, Flink's RM currently performs several HDFS operations in the rpc main thread when preparing the TM context, which may block the main thread when HDFS is slow. Unfortunately, I don't see any out-of-box approach that fixes the problem at the moment,

TM heartbeat timeout due to ResourceManager being busy

2020-10-11 Thread Paul Lam
Hi, After FLINK-13184 is implemented (even with Flink 1.11), occasionally there would still be jobs with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating TM contexts on cluster initialization and HDFS is slow at that moment. Apart from increasing the TM heartbeat