Hi,
Thanks for driving this @Till Rohrmann <trohrm...@apache.org> . I would
give +1 on reducing the heartbeat timeout and interval, though I'm not sure
if 15s and 3s would be enough either.

IMO, except for the standalone cluster, where the heartbeat mechanism in
Flink is totally relied, reducing the heartbeat can also help JM to find
out faster TaskExecutors in abnormal conditions that can not respond to the
heartbeat requests, e.g., continuously Full GC, though the process of
TaskExecutor is alive and may not be known by the deployment system. Since
there are cases that can benefit from this change, I think it could be done
if it won't break the experience in other scenarios.

If we can address what will block the main threads from processing
heartbeats, or enlarge the GC costs, we can try to get rid of them to have
a more predictable response time of heartbeat, or give some advices to
users if their jobs may encounter these issues. For example, as far as I
know JM of a large scale job will be more busy and may not able to process
heartbeats in time, then we can give a advice that users working with job
large than 5000 tasks should enlarge there heartbeat interval to 10s and
timeout to 50s. The numbers are written casually.

As for the issue in FLINK-23216, I think it should be fixed and may not be
a main concern for this case.

On Wed, Jul 21, 2021 at 6:26 PM Till Rohrmann <trohrm...@apache.org> wrote:

> Thanks for sharing these insights.
>
> I think it is no longer true that the ResourceManager notifies the
> JobMaster about lost TaskExecutors. See FLINK-23216 [1] for more details.
>
> Given the GC pauses, would you then be ok with decreasing the heartbeat
> timeout to 20 seconds? This should give enough time to do the GC and then
> still send/receive a heartbeat request.
>
> I also wanted to add that we are about to get rid of one big cause of
> blocking I/O operations from the main thread. With FLINK-22483 [2] we will
> get rid of Filesystem accesses to retrieve completed checkpoints. This
> leaves us with one additional file system access from the main thread which
> is the one completing a pending checkpoint. I think it should be possible
> to get rid of this access because as Stephan said it only writes
> information to disk that is already written before. Maybe solving these two
> issues could ease concerns about long pauses of unresponsiveness of Flink.
>
> [1] https://issues.apache.org/jira/browse/FLINK-23216
> [2] https://issues.apache.org/jira/browse/FLINK-22483
>
> Cheers,
> Till
>
> On Wed, Jul 21, 2021 at 4:58 AM Yang Wang <danrtsey...@gmail.com> wrote:
>
>> Thanks @Till Rohrmann <trohrm...@apache.org>  for starting this
>> discussion
>>
>> Firstly, I try to understand the benefit of shorter heartbeat timeout.
>> IIUC, it will make the JobManager aware of
>> TaskManager faster. However, it seems that only the standalone cluster
>> could benefit from this. For Yarn and
>> native Kubernetes deployment, the Flink ResourceManager should get the
>> TaskManager lost event in a very short time.
>>
>> * About 8 seconds, 3s for Yarn NM -> Yarn RM, 5s for Yarn RM -> Flink RM
>> * Less than 1 second, Flink RM has a watch for all the TaskManager pods
>>
>> Secondly, I am not very confident to decrease the timeout to 15s. I have
>> quickly checked the TaskManager GC logs
>> in the past week of our internal Flink workloads and find more than 100
>> 10-seconds Full GC logs, but no one is bigger than 15s.
>> We are using CMS GC for old generation.
>>
>>
>> Best,
>> Yang
>>
>> Till Rohrmann <trohrm...@apache.org> 于2021年7月17日周六 上午1:05写道:
>>
>>> Hi everyone,
>>>
>>> Since Flink 1.5 we have the same heartbeat timeout and interval default
>>> values that are defined as heartbeat.timeout: 50s and heartbeat.interval:
>>> 10s. These values were mainly chosen to compensate for lengthy GC pauses
>>> and blocking operations that were executed in the main threads of Flink's
>>> components. Since then, there were quite some advancements wrt the JVM's
>>> GCs and we also got rid of a lot of blocking calls that were executed in
>>> the main thread. Moreover, a long heartbeat.timeout causes long recovery
>>> times in case of a TaskManager loss because the system can only properly
>>> recover after the dead TaskManager has been removed from the scheduler.
>>> Hence, I wanted to propose to change the timeout and interval to:
>>>
>>> heartbeat.timeout: 15s
>>> heartbeat.interval: 3s
>>>
>>> Since there is no perfect solution that fits all use cases, I would
>>> really
>>> like to hear from you what you think about it and how you configure these
>>> heartbeat options. Based on your experience we might actually come up
>>> with
>>> better default values that allow us to be resilient but also to detect
>>> failed components fast. FLIP-185 can be found here [1].
>>>
>>> [1] https://cwiki.apache.org/confluence/x/GAoBCw
>>>
>>> Cheers,
>>> Till
>>>
>>

Reply via email to