Re: Job stuck in CREATED state with scheduling failures

Matthias Pohl Tue, 30 May 2023 08:03:28 -0700

Hi Gyula,
Could you share the logs in the ML? Or is there a Jira issue I missed?


Matthias

On Wed, May 17, 2023 at 9:33 PM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Hey Devs!
>
> I am bumping this thread to see if someone has any ideas how to go about
> solving this.
>
> Yang Wang earlier had this comment but I am not sure how to proceed:
>
> "From the logs you have provided, I find a potential bug in the current
> leader retrieval. In DefaultLeaderRetrievalService , if the leader
> information does not change, we will not notify the listener. It is indeed
> correct in all-most scenarios and could save some following heavy
> operations. But in the current case, it might be the root cause. For TM1,
> we added 00000000000000000000000000000002 for job leader monitoring at
> 2023-01-18 05:31:23,848. However, we never get the next expected log
> “Resolved JobManager address, beginning registration”. It just because the
> leader information does not change. So the TM1 got stuck at waiting for the
> leader and never registered to the JM. Finally, the job failed with no
> enough slots."
>
> I wonder if someone could maybe confirm the current behaviour.
>
> Thanks
> Gyula
>
> On Mon, Jan 23, 2023 at 4:06 PM Tamir Sagi <tamir.s...@niceactimize.com>
> wrote:
>
>> Hey Gyula,
>>
>> We encountered similar issues recently . Our Flink stream application
>> clusters(v1.15.2) are running in AWS EKS.
>>
>>
>>    1. TM gets disconnected sporadically and never returns.
>>
>> org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with
>> id aml-rule-eval-stream-taskmanager-1-1 is no longer reachable.
>>
>>     at
>> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyTargetUnreachable(JobMaster.java:1387)
>>
>>     at
>> org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.reportHeartbeatRpcFailure(HeartbeatMonitorImpl.java:123)
>>
>> heartbeat.timeout is set to 15 minutes.
>>
>>
>> There are some heartbeat updates on Flink web-UI
>>
>>
>> There are not enough logs about it and no indication of OOM whatsoever
>> within k8s. However, We increased the TMs' memory, and the issue seems to
>> be resolved for now. (yet, it might hide a bigger issue).
>>
>> The 2nd issue is regarding  'NoResourceAvailableException' with the
>> following error message
>> Caused by:
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Slot request bulk is not fulfillable! Could not allocate the required slot
>> within slot request timeout (Enclosed log files.)
>>
>> I also found this unresolved ticket [1] with suggestion by @Yang Wang
>> <danrtsey...@gmail.com> which seems to be working so far.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-25649
>>
>> Any thoughts?
>>
>> Thanks,
>> Tamir.
>>
>> ------------------------------
>> *From:* Gyula Fóra <gyula.f...@gmail.com>
>> *Sent:* Sunday, January 22, 2023 12:43 AM
>> *To:* user <u...@flink.apache.org>
>> *Subject:* Job stuck in CREATED state with scheduling failures
>>
>>
>> *EXTERNAL EMAIL*
>>
>>
>> Hi Devs!
>>
>> We noticed a very strange failure scenario a few times recently with the
>> Native Kubernetes integration.
>>
>> The issue is triggered by a heartbeat timeout (a temporary network
>> problem). We observe the following behaviour:
>>
>> ===================================
>> 3 pods (1 JM, 2 TMs), Flink 1.15 (Kubernetes Native Integration):
>>
>> 1. Temporary network problem
>>  - Heartbeat failure, TM1 loses JM connection and JM loses TM1 connection.
>>  - Both the JM and TM1 trigger the job failure on their sides and cancel
>> the tasks
>>  - JM releases TM1 slots
>>
>> 2. While failing/cancelling the job, the network connection recovers and
>> TM1 reconnects to JM:
>> *TM1: Resolved JobManager address, beginning registration*
>>
>> 3. JM tries to resubmit the job using TM1 + TM2 but the scheduler keeps
>> failing as it cannot seem to allocate all the resources:
>>
>> *NoResourceAvailableException: Slot request bulk is not fulfillable!
>> Could not allocate the required slot within slot request timeout *
>> On TM1 we see the following logs repeating (mutliple times every few
>> seconds until the slot request times out after 5 minutes):
>> *Receive slot request ... for job ... from resource manager with leader
>> id ...*
>> *Allocated slot for ...*
>> *Receive slot request ... for job ... from resource manager with leader
>> id ...*
>> *Allocated slot for ....*
>> *Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>> ResourceProfile{...}, allocationId: ..., jobId: ...).*
>>
>> While all these are happening on TM1 we don't see any allocation related
>> INFO logs on TM2.
>> ===================================
>>
>> Seems like something weird happens when TM1 reconnects after the
>> heartbeat loss. I feel that the JM should probably shut down the TM and
>> create a new one. But instead it gets stuck.
>>
>> Any ideas what could be happening here?
>>
>> Thanks
>> Gyula
>>
>>
>> Confidentiality: This communication and any attachments are intended for
>> the above-named persons only and may be confidential and/or legally
>> privileged. Any opinions expressed in this communication are not
>> necessarily those of NICE Actimize. If this communication has come to you
>> in error you must take no action based on it, nor must you copy or show it
>> to anyone; please delete/destroy and inform the sender by e-mail
>> immediately.
>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>> Viruses: Although we have taken steps toward ensuring that this e-mail
>> and attachments are free from any virus, we advise that in keeping with
>> good computing practice the recipient should ensure they are actually virus
>> free.
>>
>

Re: Job stuck in CREATED state with scheduling failures

Reply via email to