Hi Team, Please unsubscribe me from all these emails....
On Fri, 23 Jul 2021 at 2:19 PM, LINZ, Arnaud <al...@bouyguestelecom.fr> wrote: > Hello, > > It’s hard to say what caused the timeout to trigger – I agree with you > that it should not have stopped the heartbeat thread, but it did. The easy > fix was to increase it until we no longer see our app self-killed. The task > was using a CPU-intensive computation (with a few threads created at some > points… Somehow breaking the “slot number” contract). > For the RAM cache, I believe that the hearbeat timeout may also times out > because of a busy network. > > Cheers, > Arnaud > > > De : Till Rohrmann <trohrm...@apache.org> > Envoyé : jeudi 22 juillet 2021 11:33 > À : LINZ, Arnaud <al...@bouyguestelecom.fr> > Cc : Gen Luo <luogen...@gmail.com>; Yang Wang <danrtsey...@gmail.com>; > dev <dev@flink.apache.org>; user <u...@flink.apache.org> > Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval > default values > > Thanks for your inputs Gen and Arnaud. > > I do agree with you, Gen, that we need better guidance for our users on > when to change the heartbeat configuration. I think this should happen in > any case. I am, however, not so sure whether we can give hard threshold > like 5000 tasks, for example, because as Arnaud said it strongly depends on > the workload. Maybe we can explain it based on symptoms a user might > experience and what to do then. > > Concerning your workloads, Arnaud, I'd be interested to learn a bit more. > The user code runs in its own thread. This means that its operation won't > block the main thread/heartbeat. The only thing that can happen is that the > user code starves the heartbeat in terms of CPU cycles or causes a lot of > GC pauses. If you are observing the former problem, then we might think > about changing the priorities of the respective threads. This should then > improve Flink's stability for these workloads and a shorter heartbeat > timeout should be possible. > > Also for the RAM-cached repositories, what exactly is causing the > heartbeat to time out? Is it because you have a lot of GC or that the > heartbeat thread does not get enough CPU cycles? > > Cheers, > Till > > On Thu, Jul 22, 2021 at 9:16 AM LINZ, Arnaud <al...@bouyguestelecom.fr > <mailto:al...@bouyguestelecom.fr>> wrote: > Hello, > > From a user perspective: we have some (rare) use cases where we use > “coarse grain” datasets, with big beans and tasks that do lengthy operation > (such as ML training). In these cases we had to increase the time out to > huge values (heartbeat.timeout: 500000) so that our app is not killed. > I’m aware this is not the way Flink was meant to be used, but it’s a > convenient way to distribute our workload on datanodes without having to > use another concurrency framework (such as M/R) that would require the > recoding of sources and sinks. > > In some other (most common) cases, our tasks do some R/W accesses to > RAM-cached repositories backed by a key-value storage such as Kudu (or > Hbase). If most of those calls are very fast, sometimes when the system is > under heavy load they may block more than a few seconds, and having our app > killed because of a short timeout is not an option. > > That’s why I’m not in favor of very short timeouts… Because in my > experience it really depends on what user code does in the tasks. (I > understand that normally, as user code is not a JVM-blocking activity such > as a GC, it should have no impact on heartbeats, but from experience, it > really does) > > Cheers, > Arnaud > > > De : Gen Luo <luogen...@gmail.com<mailto:luogen...@gmail.com>> > Envoyé : jeudi 22 juillet 2021 05:46 > À : Till Rohrmann <trohrm...@apache.org<mailto:trohrm...@apache.org>> > Cc : Yang Wang <danrtsey...@gmail.com<mailto:danrtsey...@gmail.com>>; dev > <dev@flink.apache.org<mailto:dev@flink.apache.org>>; user < > u...@flink.apache.org<mailto:u...@flink.apache.org>> > Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval > default values > > Hi, > Thanks for driving this @Till Rohrmann<mailto:trohrm...@apache.org> . I > would give +1 on reducing the heartbeat timeout and interval, though I'm > not sure if 15s and 3s would be enough either. > > IMO, except for the standalone cluster, where the heartbeat mechanism in > Flink is totally relied, reducing the heartbeat can also help JM to find > out faster TaskExecutors in abnormal conditions that can not respond to the > heartbeat requests, e.g., continuously Full GC, though the process of > TaskExecutor is alive and may not be known by the deployment system. Since > there are cases that can benefit from this change, I think it could be done > if it won't break the experience in other scenarios. > > If we can address what will block the main threads from processing > heartbeats, or enlarge the GC costs, we can try to get rid of them to have > a more predictable response time of heartbeat, or give some advices to > users if their jobs may encounter these issues. For example, as far as I > know JM of a large scale job will be more busy and may not able to process > heartbeats in time, then we can give a advice that users working with job > large than 5000 tasks should enlarge there heartbeat interval to 10s and > timeout to 50s. The numbers are written casually. > > As for the issue in FLINK-23216, I think it should be fixed and may not be > a main concern for this case. > > On Wed, Jul 21, 2021 at 6:26 PM Till Rohrmann <trohrm...@apache.org > <mailto:trohrm...@apache.org>> wrote: > Thanks for sharing these insights. > > I think it is no longer true that the ResourceManager notifies the > JobMaster about lost TaskExecutors. See FLINK-23216 [1] for more details. > > Given the GC pauses, would you then be ok with decreasing the heartbeat > timeout to 20 seconds? This should give enough time to do the GC and then > still send/receive a heartbeat request. > > I also wanted to add that we are about to get rid of one big cause of > blocking I/O operations from the main thread. With FLINK-22483 [2] we will > get rid of Filesystem accesses to retrieve completed checkpoints. This > leaves us with one additional file system access from the main thread which > is the one completing a pending checkpoint. I think it should be possible > to get rid of this access because as Stephan said it only writes > information to disk that is already written before. Maybe solving these two > issues could ease concerns about long pauses of unresponsiveness of Flink. > > [1] https://issues.apache.org/jira/browse/FLINK-23216 > [2] https://issues.apache.org/jira/browse/FLINK-22483 > > Cheers, > Till > > On Wed, Jul 21, 2021 at 4:58 AM Yang Wang <danrtsey...@gmail.com<mailto: > danrtsey...@gmail.com>> wrote: > Thanks @Till Rohrmann<mailto:trohrm...@apache.org> for starting this > discussion > > Firstly, I try to understand the benefit of shorter heartbeat timeout. > IIUC, it will make the JobManager aware of > TaskManager faster. However, it seems that only the standalone cluster > could benefit from this. For Yarn and > native Kubernetes deployment, the Flink ResourceManager should get the > TaskManager lost event in a very short time. > > * About 8 seconds, 3s for Yarn NM -> Yarn RM, 5s for Yarn RM -> Flink RM > * Less than 1 second, Flink RM has a watch for all the TaskManager pods > > Secondly, I am not very confident to decrease the timeout to 15s. I have > quickly checked the TaskManager GC logs > in the past week of our internal Flink workloads and find more than 100 > 10-seconds Full GC logs, but no one is bigger than 15s. > We are using CMS GC for old generation. > > > Best, > Yang > > Till Rohrmann <trohrm...@apache.org<mailto:trohrm...@apache.org>> > 于2021年7月17日周六 上午1:05写道: > Hi everyone, > > Since Flink 1.5 we have the same heartbeat timeout and interval default > values that are defined as heartbeat.timeout: 50s and heartbeat.interval: > 10s. These values were mainly chosen to compensate for lengthy GC pauses > and blocking operations that were executed in the main threads of Flink's > components. Since then, there were quite some advancements wrt the JVM's > GCs and we also got rid of a lot of blocking calls that were executed in > the main thread. Moreover, a long heartbeat.timeout causes long recovery > times in case of a TaskManager loss because the system can only properly > recover after the dead TaskManager has been removed from the scheduler. > Hence, I wanted to propose to change the timeout and interval to: > > heartbeat.timeout: 15s > heartbeat.interval: 3s > > Since there is no perfect solution that fits all use cases, I would really > like to hear from you what you think about it and how you configure these > heartbeat options. Based on your experience we might actually come up with > better default values that allow us to be resilient but also to detect > failed components fast. FLIP-185 can be found here [1]. > > [1] https://cwiki.apache.org/confluence/x/GAoBCw > > Cheers, > Till > > ________________________________ > > L'intégrité de ce message n'étant pas assurée sur internet, la société > expéditrice ne peut être tenue responsable de son contenu ni de ses pièces > jointes. Toute utilisation ou diffusion non autorisée est interdite. Si > vous n'êtes pas destinataire de ce message, merci de le détruire et > d'avertir l'expéditeur. > > The integrity of this message cannot be guaranteed on the Internet. The > company that sent this message cannot therefore be held liable for its > content nor attachments. Any unauthorized use or dissemination is > prohibited. If you are not the intended recipient of this message, then > please delete it and notify the sender. > -- Thanks & Regards, Bhagi