Hi Gyula, Could you share the logs in the ML? Or is there a Jira issue I missed?
Matthias On Wed, May 17, 2023 at 9:33 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > Hey Devs! > > I am bumping this thread to see if someone has any ideas how to go about > solving this. > > Yang Wang earlier had this comment but I am not sure how to proceed: > > "From the logs you have provided, I find a potential bug in the current > leader retrieval. In DefaultLeaderRetrievalService , if the leader > information does not change, we will not notify the listener. It is indeed > correct in all-most scenarios and could save some following heavy > operations. But in the current case, it might be the root cause. For TM1, > we added 00000000000000000000000000000002 for job leader monitoring at > 2023-01-18 05:31:23,848. However, we never get the next expected log > “Resolved JobManager address, beginning registration”. It just because the > leader information does not change. So the TM1 got stuck at waiting for the > leader and never registered to the JM. Finally, the job failed with no > enough slots." > > I wonder if someone could maybe confirm the current behaviour. > > Thanks > Gyula > > On Mon, Jan 23, 2023 at 4:06 PM Tamir Sagi <tamir.s...@niceactimize.com> > wrote: > >> Hey Gyula, >> >> We encountered similar issues recently . Our Flink stream application >> clusters(v1.15.2) are running in AWS EKS. >> >> >> 1. TM gets disconnected sporadically and never returns. >> >> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with >> id aml-rule-eval-stream-taskmanager-1-1 is no longer reachable. >> >> at >> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1387) >> >> at >> org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.reportHeartbeatRpcFailure(HeartbeatMonitorImpl.java:123) >> >> heartbeat.timeout is set to 15 minutes. >> >> >> There are some heartbeat updates on Flink web-UI >> >> >> There are not enough logs about it and no indication of OOM whatsoever >> within k8s. However, We increased the TMs' memory, and the issue seems to >> be resolved for now. (yet, it might hide a bigger issue). >> >> The 2nd issue is regarding 'NoResourceAvailableException' with the >> following error message >> Caused by: >> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >> Slot request bulk is not fulfillable! Could not allocate the required slot >> within slot request timeout (Enclosed log files.) >> >> I also found this unresolved ticket [1] with suggestion by @Yang Wang >> <danrtsey...@gmail.com> which seems to be working so far. >> >> [1] https://issues.apache.org/jira/browse/FLINK-25649 >> >> Any thoughts? >> >> Thanks, >> Tamir. >> >> ------------------------------ >> *From:* Gyula Fóra <gyula.f...@gmail.com> >> *Sent:* Sunday, January 22, 2023 12:43 AM >> *To:* user <u...@flink.apache.org> >> *Subject:* Job stuck in CREATED state with scheduling failures >> >> >> *EXTERNAL EMAIL* >> >> >> Hi Devs! >> >> We noticed a very strange failure scenario a few times recently with the >> Native Kubernetes integration. >> >> The issue is triggered by a heartbeat timeout (a temporary network >> problem). We observe the following behaviour: >> >> =================================== >> 3 pods (1 JM, 2 TMs), Flink 1.15 (Kubernetes Native Integration): >> >> 1. Temporary network problem >> - Heartbeat failure, TM1 loses JM connection and JM loses TM1 connection. >> - Both the JM and TM1 trigger the job failure on their sides and cancel >> the tasks >> - JM releases TM1 slots >> >> 2. While failing/cancelling the job, the network connection recovers and >> TM1 reconnects to JM: >> *TM1: Resolved JobManager address, beginning registration* >> >> 3. JM tries to resubmit the job using TM1 + TM2 but the scheduler keeps >> failing as it cannot seem to allocate all the resources: >> >> *NoResourceAvailableException: Slot request bulk is not fulfillable! >> Could not allocate the required slot within slot request timeout * >> On TM1 we see the following logs repeating (mutliple times every few >> seconds until the slot request times out after 5 minutes): >> *Receive slot request ... for job ... from resource manager with leader >> id ...* >> *Allocated slot for ...* >> *Receive slot request ... for job ... from resource manager with leader >> id ...* >> *Allocated slot for ....* >> *Free slot TaskSlot(index:0, state:ALLOCATED, resource profile: >> ResourceProfile{...}, allocationId: ..., jobId: ...).* >> >> While all these are happening on TM1 we don't see any allocation related >> INFO logs on TM2. >> =================================== >> >> Seems like something weird happens when TM1 reconnects after the >> heartbeat loss. I feel that the JM should probably shut down the TM and >> create a new one. But instead it gets stuck. >> >> Any ideas what could be happening here? >> >> Thanks >> Gyula >> >> >> Confidentiality: This communication and any attachments are intended for >> the above-named persons only and may be confidential and/or legally >> privileged. Any opinions expressed in this communication are not >> necessarily those of NICE Actimize. If this communication has come to you >> in error you must take no action based on it, nor must you copy or show it >> to anyone; please delete/destroy and inform the sender by e-mail >> immediately. >> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. >> Viruses: Although we have taken steps toward ensuring that this e-mail >> and attachments are free from any virus, we advise that in keeping with >> good computing practice the recipient should ensure they are actually virus >> free. >> >