Thanks for sharing the thoughts Chesnay, and I overall agree with you. We
can't give a default value suitable for all jobs, but we can figure out
whether the current default value is too large for most of the jobs, and
that is the guideline for this topic. Configurability is reserved for the
others.

But maybe we should list the benefits of the change again. The motivation
presented in the FLIP is mostly about why we can, but may not be enough
about why we should, considering that FLINK-23216 will be fixed.

By the way, I'd like to make sure if the behavior of jobs using default
values will be changed when they upgrade their Flink version and resume
from savepoints. If so, we have to give a big warning  to users if we
finally decide to change this, since the change is silent but can be
critical in some cases.

@Till
> I am, however, not so sure whether we can give hard threshold
like 5000 tasks
A hard threshold can't apply to all cases indeed. I meant to suggest a
configuration validation phase, where we can check if the configuration is
potentially not suitable for the job, or if some options are not able to be
used together with other options. Users will be warned and if anything
wrong happens, this will be one of the debug guides. This is a little like
Chesnay's suggest, but in compiling phase. Large scale jobs with too short
heartbeat intervals is one of the cases, but I agree it's hard to decide
when we should show the warning.

On Thu, Jul 22, 2021 at 11:09 PM Chesnay Schepler <ches...@apache.org>
wrote:

> I'm wondering if this discussion isn't going in the wrong direction.
> It is clear that we cannot support all use-case with the defaults, so
> let's not try that. We won't find it.
> And I would argue that is also not their purpose; they are configurable
> for a reason.
> I would say the defaults should provide a good experience to users
> starting out with Flink.
>
> Because, users with heavy workloads don't deploy Flink over night. They
> approach a production deployment step-by-step, for the very purpose of
> working out kinks in the configuration/stability.
> If in this process the default heartbeat configurations ends up being too
> harsh, *then that is not a problem*.
> *If *there is sufficient information presented to the user on how to
> transition to a working setup, that is.
> So far the purpose of that was the production checklist to some extent,
> but maybe we should also add a separate page for working under bigger loads.
>
> A far greater issue in my opinion is that users don't get warnings that
> something is about to go wrong.
> The heartbeat system right now is fairly binary. It works fine while the
> configuration is suitable, until it no longer is and everything goes up in
> flames.
>
> If we were to log warnings if Flink was close to hitting the heartbeat
> timeout, or even expose metrics for heartbeat round-trip times or similar,
> I think we could alleviate many concerns that people have.
>
> Or we just provide 2 configs with the distribution, one for starting out,
> one for more serious workloads. ¯\_(ツ)_/¯
> On 22/07/2021 13:22, 刘建刚 wrote:
>
> Thanks, Till. There are many reasons to reduce the heartbeat interval and
> timeout. But I am not sure what values are suitable. In our cases, the GC
> time and big job can be related factors. Since most flink jobs are pipeline
> and a total failover can cost some time, we should tolerate some stop-world
> situations. Also, I think that the FLINK-23216 should be solved to detect
> lost container fast and react to it. For my side, I suggest
> reducing the values gradually.
>
> Till Rohrmann <trohrm...@apache.org> 于2021年7月22日周四 下午5:33写道:
>
>> Thanks for your inputs Gen and Arnaud.
>>
>> I do agree with you, Gen, that we need better guidance for our users on
>> when to change the heartbeat configuration. I think this should happen in
>> any case. I am, however, not so sure whether we can give hard threshold
>> like 5000 tasks, for example, because as Arnaud said it strongly depends
>> on
>> the workload. Maybe we can explain it based on symptoms a user might
>> experience and what to do then.
>>
>> Concerning your workloads, Arnaud, I'd be interested to learn a bit more.
>> The user code runs in its own thread. This means that its operation won't
>> block the main thread/heartbeat. The only thing that can happen is that
>> the
>> user code starves the heartbeat in terms of CPU cycles or causes a lot of
>> GC pauses. If you are observing the former problem, then we might think
>> about changing the priorities of the respective threads. This should then
>> improve Flink's stability for these workloads and a shorter heartbeat
>> timeout should be possible.
>>
>> Also for the RAM-cached repositories, what exactly is causing the
>> heartbeat
>> to time out? Is it because you have a lot of GC or that the heartbeat
>> thread does not get enough CPU cycles?
>>
>> Cheers,
>> Till
>>
>> On Thu, Jul 22, 2021 at 9:16 AM LINZ, Arnaud <al...@bouyguestelecom.fr>
>> wrote:
>>
>> > Hello,
>> >
>> >
>> >
>> > From a user perspective: we have some (rare) use cases where we use
>> > “coarse grain” datasets, with big beans and tasks that do lengthy
>> operation
>> > (such as ML training). In these cases we had to increase the time out to
>> > huge values (heartbeat.timeout: 500000) so that our app is not killed.
>> >
>> > I’m aware this is not the way Flink was meant to be used, but it’s a
>> > convenient way to distribute our workload on datanodes without having to
>> > use another concurrency framework (such as M/R) that would require the
>> > recoding of sources and sinks.
>> >
>> >
>> >
>> > In some other (most common) cases, our tasks do some R/W accesses to
>> > RAM-cached repositories backed by a key-value storage such as Kudu (or
>> > Hbase). If most of those calls are very fast, sometimes when the system
>> is
>> > under heavy load they may block more than a few seconds, and having our
>> app
>> > killed because of a short timeout is not an option.
>> >
>> >
>> >
>> > That’s why I’m not in favor of very short timeouts… Because in my
>> > experience it really depends on what user code does in the tasks. (I
>> > understand that normally, as user code is not a JVM-blocking activity
>> such
>> > as a GC, it should have no impact on heartbeats, but from experience, it
>> > really does)
>> >
>> >
>> >
>> > Cheers,
>> >
>> > Arnaud
>> >
>> >
>> >
>> >
>> >
>> > *De :* Gen Luo <luogen...@gmail.com>
>> > *Envoyé :* jeudi 22 juillet 2021 05:46
>> > *À :* Till Rohrmann <trohrm...@apache.org>
>> > *Cc :* Yang Wang <danrtsey...@gmail.com>; dev <dev@flink.apache.org>;
>> > user <u...@flink.apache.org>
>> > *Objet :* Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval
>> > default values
>> >
>> >
>> >
>> > Hi,
>> >
>> > Thanks for driving this @Till Rohrmann <trohrm...@apache.org> . I would
>> > give +1 on reducing the heartbeat timeout and interval, though I'm not
>> sure
>> > if 15s and 3s would be enough either.
>> >
>> >
>> >
>> > IMO, except for the standalone cluster, where the heartbeat mechanism in
>> > Flink is totally relied, reducing the heartbeat can also help JM to find
>> > out faster TaskExecutors in abnormal conditions that can not respond to
>> the
>> > heartbeat requests, e.g., continuously Full GC, though the process of
>> > TaskExecutor is alive and may not be known by the deployment system.
>> Since
>> > there are cases that can benefit from this change, I think it could be
>> done
>> > if it won't break the experience in other scenarios.
>> >
>> >
>> >
>> > If we can address what will block the main threads from processing
>> > heartbeats, or enlarge the GC costs, we can try to get rid of them to
>> have
>> > a more predictable response time of heartbeat, or give some advices to
>> > users if their jobs may encounter these issues. For example, as far as I
>> > know JM of a large scale job will be more busy and may not able to
>> process
>> > heartbeats in time, then we can give a advice that users working with
>> job
>> > large than 5000 tasks should enlarge there heartbeat interval to 10s and
>> > timeout to 50s. The numbers are written casually.
>> >
>> >
>> >
>> > As for the issue in FLINK-23216, I think it should be fixed and may not
>> be
>> > a main concern for this case.
>> >
>> >
>> >
>> > On Wed, Jul 21, 2021 at 6:26 PM Till Rohrmann <trohrm...@apache.org>
>> > wrote:
>> >
>> > Thanks for sharing these insights.
>> >
>> >
>> >
>> > I think it is no longer true that the ResourceManager notifies the
>> > JobMaster about lost TaskExecutors. See FLINK-23216 [1] for more
>> details.
>> >
>> >
>> >
>> > Given the GC pauses, would you then be ok with decreasing the heartbeat
>> > timeout to 20 seconds? This should give enough time to do the GC and
>> then
>> > still send/receive a heartbeat request.
>> >
>> >
>> >
>> > I also wanted to add that we are about to get rid of one big cause of
>> > blocking I/O operations from the main thread. With FLINK-22483 [2] we
>> will
>> > get rid of Filesystem accesses to retrieve completed checkpoints. This
>> > leaves us with one additional file system access from the main thread
>> which
>> > is the one completing a pending checkpoint. I think it should be
>> possible
>> > to get rid of this access because as Stephan said it only writes
>> > information to disk that is already written before. Maybe solving these
>> two
>> > issues could ease concerns about long pauses of unresponsiveness of
>> Flink.
>> >
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-23216
>> >
>> > [2] https://issues.apache.org/jira/browse/FLINK-22483
>> >
>> >
>> >
>> > Cheers,
>> >
>> > Till
>> >
>> >
>> >
>> > On Wed, Jul 21, 2021 at 4:58 AM Yang Wang <danrtsey...@gmail.com>
>> wrote:
>> >
>> > Thanks @Till Rohrmann <trohrm...@apache.org>  for starting this
>> discussion
>> >
>> >
>> >
>> > Firstly, I try to understand the benefit of shorter heartbeat timeout.
>> > IIUC, it will make the JobManager aware of
>> >
>> > TaskManager faster. However, it seems that only the standalone cluster
>> > could benefit from this. For Yarn and
>> >
>> > native Kubernetes deployment, the Flink ResourceManager should get the
>> > TaskManager lost event in a very short time.
>> >
>> >
>> >
>> > * About 8 seconds, 3s for Yarn NM -> Yarn RM, 5s for Yarn RM -> Flink RM
>> >
>> > * Less than 1 second, Flink RM has a watch for all the TaskManager pods
>> >
>> >
>> >
>> > Secondly, I am not very confident to decrease the timeout to 15s. I have
>> > quickly checked the TaskManager GC logs
>> >
>> > in the past week of our internal Flink workloads and find more than 100
>> > 10-seconds Full GC logs, but no one is bigger than 15s.
>> >
>> > We are using CMS GC for old generation.
>> >
>> >
>> >
>> >
>> >
>> > Best,
>> >
>> > Yang
>> >
>> >
>> >
>> > Till Rohrmann <trohrm...@apache.org> 于2021年7月17日周六 上午1:05写道:
>> >
>> > Hi everyone,
>> >
>> > Since Flink 1.5 we have the same heartbeat timeout and interval default
>> > values that are defined as heartbeat.timeout: 50s and
>> heartbeat.interval:
>> > 10s. These values were mainly chosen to compensate for lengthy GC pauses
>> > and blocking operations that were executed in the main threads of
>> Flink's
>> > components. Since then, there were quite some advancements wrt the JVM's
>> > GCs and we also got rid of a lot of blocking calls that were executed in
>> > the main thread. Moreover, a long heartbeat.timeout causes long recovery
>> > times in case of a TaskManager loss because the system can only properly
>> > recover after the dead TaskManager has been removed from the scheduler.
>> > Hence, I wanted to propose to change the timeout and interval to:
>> >
>> > heartbeat.timeout: 15s
>> > heartbeat.interval: 3s
>> >
>> > Since there is no perfect solution that fits all use cases, I would
>> really
>> > like to hear from you what you think about it and how you configure
>> these
>> > heartbeat options. Based on your experience we might actually come up
>> with
>> > better default values that allow us to be resilient but also to detect
>> > failed components fast. FLIP-185 can be found here [1].
>> >
>> > [1] https://cwiki.apache.org/confluence/x/GAoBCw
>> >
>> > Cheers,
>> > Till
>> >
>> >
>> > ------------------------------
>> >
>> > L'intégrité de ce message n'étant pas assurée sur internet, la société
>> > expéditrice ne peut être tenue responsable de son contenu ni de ses
>> pièces
>> > jointes. Toute utilisation ou diffusion non autorisée est interdite. Si
>> > vous n'êtes pas destinataire de ce message, merci de le détruire et
>> > d'avertir l'expéditeur.
>> >
>> > The integrity of this message cannot be guaranteed on the Internet. The
>> > company that sent this message cannot therefore be held liable for its
>> > content nor attachments. Any unauthorized use or dissemination is
>> > prohibited. If you are not the intended recipient of this message, then
>> > please delete it and notify the sender.
>> >
>>
>
>

Reply via email to