Hello,

From a user perspective: we have some (rare) use cases where we use “coarse 
grain” datasets, with big beans and tasks that do lengthy operation (such as ML 
training). In these cases we had to increase the time out to huge values 
(heartbeat.timeout: 500000) so that our app is not killed.
I’m aware this is not the way Flink was meant to be used, but it’s a convenient 
way to distribute our workload on datanodes without having to use another 
concurrency framework (such as M/R) that would require the recoding of sources 
and sinks.

In some other (most common) cases, our tasks do some R/W accesses to RAM-cached 
repositories backed by a key-value storage such as Kudu (or Hbase). If most of 
those calls are very fast, sometimes when the system is under heavy load they 
may block more than a few seconds, and having our app killed because of a short 
timeout is not an option.

That’s why I’m not in favor of very short timeouts… Because in my experience it 
really depends on what user code does in the tasks. (I understand that 
normally, as user code is not a JVM-blocking activity such as a GC, it should 
have no impact on heartbeats, but from experience, it really does)

Cheers,
Arnaud


De : Gen Luo <luogen...@gmail.com>
Envoyé : jeudi 22 juillet 2021 05:46
À : Till Rohrmann <trohrm...@apache.org>
Cc : Yang Wang <danrtsey...@gmail.com>; dev <d...@flink.apache.org>; user 
<user@flink.apache.org>
Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default 
values

Hi,
Thanks for driving this @Till Rohrmann<mailto:trohrm...@apache.org> . I would 
give +1 on reducing the heartbeat timeout and interval, though I'm not sure if 
15s and 3s would be enough either.

IMO, except for the standalone cluster, where the heartbeat mechanism in Flink 
is totally relied, reducing the heartbeat can also help JM to find out faster 
TaskExecutors in abnormal conditions that can not respond to the heartbeat 
requests, e.g., continuously Full GC, though the process of TaskExecutor is 
alive and may not be known by the deployment system. Since there are cases that 
can benefit from this change, I think it could be done if it won't break the 
experience in other scenarios.

If we can address what will block the main threads from processing heartbeats, 
or enlarge the GC costs, we can try to get rid of them to have a more 
predictable response time of heartbeat, or give some advices to users if their 
jobs may encounter these issues. For example, as far as I know JM of a large 
scale job will be more busy and may not able to process heartbeats in time, 
then we can give a advice that users working with job large than 5000 tasks 
should enlarge there heartbeat interval to 10s and timeout to 50s. The numbers 
are written casually.

As for the issue in FLINK-23216, I think it should be fixed and may not be a 
main concern for this case.

On Wed, Jul 21, 2021 at 6:26 PM Till Rohrmann 
<trohrm...@apache.org<mailto:trohrm...@apache.org>> wrote:
Thanks for sharing these insights.

I think it is no longer true that the ResourceManager notifies the JobMaster 
about lost TaskExecutors. See FLINK-23216 [1] for more details.

Given the GC pauses, would you then be ok with decreasing the heartbeat timeout 
to 20 seconds? This should give enough time to do the GC and then still 
send/receive a heartbeat request.

I also wanted to add that we are about to get rid of one big cause of blocking 
I/O operations from the main thread. With FLINK-22483 [2] we will get rid of 
Filesystem accesses to retrieve completed checkpoints. This leaves us with one 
additional file system access from the main thread which is the one completing 
a pending checkpoint. I think it should be possible to get rid of this access 
because as Stephan said it only writes information to disk that is already 
written before. Maybe solving these two issues could ease concerns about long 
pauses of unresponsiveness of Flink.

[1] https://issues.apache.org/jira/browse/FLINK-23216
[2] https://issues.apache.org/jira/browse/FLINK-22483

Cheers,
Till

On Wed, Jul 21, 2021 at 4:58 AM Yang Wang 
<danrtsey...@gmail.com<mailto:danrtsey...@gmail.com>> wrote:
Thanks @Till Rohrmann<mailto:trohrm...@apache.org>  for starting this discussion

Firstly, I try to understand the benefit of shorter heartbeat timeout. IIUC, it 
will make the JobManager aware of
TaskManager faster. However, it seems that only the standalone cluster could 
benefit from this. For Yarn and
native Kubernetes deployment, the Flink ResourceManager should get the 
TaskManager lost event in a very short time.

* About 8 seconds, 3s for Yarn NM -> Yarn RM, 5s for Yarn RM -> Flink RM
* Less than 1 second, Flink RM has a watch for all the TaskManager pods

Secondly, I am not very confident to decrease the timeout to 15s. I have 
quickly checked the TaskManager GC logs
in the past week of our internal Flink workloads and find more than 100 
10-seconds Full GC logs, but no one is bigger than 15s.
We are using CMS GC for old generation.


Best,
Yang

Till Rohrmann <trohrm...@apache.org<mailto:trohrm...@apache.org>> 于2021年7月17日周六 
上午1:05写道:
Hi everyone,

Since Flink 1.5 we have the same heartbeat timeout and interval default
values that are defined as heartbeat.timeout: 50s and heartbeat.interval:
10s. These values were mainly chosen to compensate for lengthy GC pauses
and blocking operations that were executed in the main threads of Flink's
components. Since then, there were quite some advancements wrt the JVM's
GCs and we also got rid of a lot of blocking calls that were executed in
the main thread. Moreover, a long heartbeat.timeout causes long recovery
times in case of a TaskManager loss because the system can only properly
recover after the dead TaskManager has been removed from the scheduler.
Hence, I wanted to propose to change the timeout and interval to:

heartbeat.timeout: 15s
heartbeat.interval: 3s

Since there is no perfect solution that fits all use cases, I would really
like to hear from you what you think about it and how you configure these
heartbeat options. Based on your experience we might actually come up with
better default values that allow us to be resilient but also to detect
failed components fast. FLIP-185 can be found here [1].

[1] https://cwiki.apache.org/confluence/x/GAoBCw

Cheers,
Till

________________________________

L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.

Reply via email to