Re: TM heartbeat timeout due to ResourceManager being busy

Paul Lam Sun, 11 Oct 2020 23:48:51 -0700

Sorry for the misspelled name, Xintong

Best,
Paul Lam


> 2020年10月12日 14:46，Paul Lam <paullin3...@gmail.com> 写道：
> 
> Hi Xingtong,
> 
> Thanks a lot for the pointer!
> 
> It’s good to see there would be a new IO executor to take care of the TM 
> contexts. Looking forward to the 1.12 release!
> 
> Best,
> Paul Lam
> 
>> 2020年10月12日 14:18，Xintong Song <tonysong...@gmail.com 
>> <mailto:tonysong...@gmail.com>> 写道：
>> 
>> Hi Paul,
>> 
>> Thanks for reporting this.
>> 
>> Indeed, Flink's RM currently performs several HDFS operations in the rpc 
>> main thread when preparing the TM context, which may block the main thread 
>> when HDFS is slow.
>> 
>> Unfortunately, I don't see any out-of-box approach that fixes the problem at 
>> the moment, except for increasing the heartbeat timeout.
>> 
>> As for the long run solution, I think there's an easier approach. We can 
>> move creating of the TM contexts away from the rpc main thread. Ideally, we 
>> should try to avoid performing any heavy operations which do not modify the 
>> RM's internal states in the rpc main thread. With FLINK-19241, this can be 
>> achieved easily by delegating the work to the io executor.
>> 
>> Thank you~
>> Xintong Song
>> 
>> 
>> On Mon, Oct 12, 2020 at 12:44 PM Paul Lam <paullin3...@gmail.com 
>> <mailto:paullin3...@gmail.com>> wrote:
>> Hi,
>> 
>> After FLINK-13184 is implemented (even with Flink 1.11), occasionally there 
>> would still be jobs 
>> with high parallelism getting TM-RM heartbeat timeouts when RM is busy 
>> creating TM contexts 
>> on cluster initialization and HDFS is slow at that moment. 
>> 
>> Apart from increasing the TM heartbeat timeout, is there any recommended  
>> out of the box 
>> approach that can reduce the chance of getting the timeouts? 
>> 
>> In the long run, is it possible to limit the number of taskmanager contexts 
>> that RM creates at 
>> a time, so that the heartbeat triggers can chime in? 
>> 
>> Thanks!
>> 
>> Best,
>> Paul Lam
>

Re: TM heartbeat timeout due to ResourceManager being busy

Reply via email to