Sorry for the misspelled name, Xintong Best, Paul Lam
> 2020年10月12日 14:46,Paul Lam <paullin3...@gmail.com> 写道: > > Hi Xingtong, > > Thanks a lot for the pointer! > > It’s good to see there would be a new IO executor to take care of the TM > contexts. Looking forward to the 1.12 release! > > Best, > Paul Lam > >> 2020年10月12日 14:18,Xintong Song <tonysong...@gmail.com >> <mailto:tonysong...@gmail.com>> 写道: >> >> Hi Paul, >> >> Thanks for reporting this. >> >> Indeed, Flink's RM currently performs several HDFS operations in the rpc >> main thread when preparing the TM context, which may block the main thread >> when HDFS is slow. >> >> Unfortunately, I don't see any out-of-box approach that fixes the problem at >> the moment, except for increasing the heartbeat timeout. >> >> As for the long run solution, I think there's an easier approach. We can >> move creating of the TM contexts away from the rpc main thread. Ideally, we >> should try to avoid performing any heavy operations which do not modify the >> RM's internal states in the rpc main thread. With FLINK-19241, this can be >> achieved easily by delegating the work to the io executor. >> >> Thank you~ >> Xintong Song >> >> >> On Mon, Oct 12, 2020 at 12:44 PM Paul Lam <paullin3...@gmail.com >> <mailto:paullin3...@gmail.com>> wrote: >> Hi, >> >> After FLINK-13184 is implemented (even with Flink 1.11), occasionally there >> would still be jobs >> with high parallelism getting TM-RM heartbeat timeouts when RM is busy >> creating TM contexts >> on cluster initialization and HDFS is slow at that moment. >> >> Apart from increasing the TM heartbeat timeout, is there any recommended >> out of the box >> approach that can reduce the chance of getting the timeouts? >> >> In the long run, is it possible to limit the number of taskmanager contexts >> that RM creates at >> a time, so that the heartbeat triggers can chime in? >> >> Thanks! >> >> Best, >> Paul Lam >