subject:"\[jira\] \[Commented\] \(FLINK\-19401\) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster"

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-10-22 Thread Arvid Heise (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218851#comment-17218851
 ] 

Arvid Heise commented on FLINK-19401:
-

Merged into release-1.11 as bff79f5efffccd1793e09a5e08d0ceb9fe90cf2
Merged into release-1.10 as 88c06e38d4eff06ac09e4141b988f9a561f286a4
Closing as resolved.

> Job stuck in restart loop due to excessive checkpoint recoveries which block 
> the JobMaster
> --
>
> Key: FLINK-19401
> URL: https://issues.apache.org/jira/browse/FLINK-19401
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.1, 1.11.2
>Reporter: Steven Zhen Wu
>Assignee: Roman Khachatryan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> Flink job sometimes got into a restart loop for many hours and can't recover 
> until redeployed. We had some issue with Kafka that initially caused the job 
> to restart. 
> Below is the first of the many exceptions for "ResourceManagerException: 
> Could not find registered job manager" error.
> {code}
> 2020-09-19 00:03:31,614 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
> rofile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,620 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   
> [flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> 
> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> 
> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED 
> to FAILED.
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> No pooled slot available and request to ResourceManager for new slot failed
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
> at 
> java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113)
> at 
> org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115)
> at 
>

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-10-21 Thread Arvid Heise (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218541#comment-17218541
 ] 

Arvid Heise commented on FLINK-19401:
-

Added to master as be448ecbd97ff9f79daac9ec7b6ec022d8c4e18d.

> Job stuck in restart loop due to excessive checkpoint recoveries which block 
> the JobMaster
> --
>
> Key: FLINK-19401
> URL: https://issues.apache.org/jira/browse/FLINK-19401
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.1, 1.11.2
>Reporter: Steven Zhen Wu
>Assignee: Roman Khachatryan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> Flink job sometimes got into a restart loop for many hours and can't recover 
> until redeployed. We had some issue with Kafka that initially caused the job 
> to restart. 
> Below is the first of the many exceptions for "ResourceManagerException: 
> Could not find registered job manager" error.
> {code}
> 2020-09-19 00:03:31,614 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
> rofile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,620 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   
> [flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> 
> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> 
> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED 
> to FAILED.
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> No pooled slot available and request to ResourceManager for new slot failed
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
> at 
> java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113)
> at 
> org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115)
> at 
>

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-10-06 Thread Roman Khachatryan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208502#comment-17208502
 ] 

Roman Khachatryan commented on FLINK-19401:
---

Thanks for the clarification [~stevenz3wu].

 

I looked at the code again and I think [~trohrmann] is right. The order of 
heartbeat timeouts is because TMs don't request heartbeats, only respond.

> Job stuck in restart loop due to excessive checkpoint recoveries which block 
> the JobMaster
> --
>
> Key: FLINK-19401
> URL: https://issues.apache.org/jira/browse/FLINK-19401
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.1, 1.11.2
>Reporter: Steven Zhen Wu
>Priority: Critical
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> Flink job sometimes got into a restart loop for many hours and can't recover 
> until redeployed. We had some issue with Kafka that initially caused the job 
> to restart. 
> Below is the first of the many exceptions for "ResourceManagerException: 
> Could not find registered job manager" error.
> {code}
> 2020-09-19 00:03:31,614 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
> rofile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,620 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   
> [flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> 
> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> 
> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED 
> to FAILED.
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> No pooled slot available and request to ResourceManager for new slot failed
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
> at 
> java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113)
> at 
> org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115)
> at 
>

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-10-05 Thread Steven Zhen Wu (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208437#comment-17208437
 ] 

Steven Zhen Wu commented on FLINK-19401:


[~roman_khachatryan] I don't know if repeated checkpoint recovery is the 
root/main cause or not. [~trohrmann] identified two problems during his 
investigation. This is one of the identified problem. 

Regarding the logs related to Titusm please ignore them. They are just noise. 
We haven't cleaned up our logs yet.

> Job stuck in restart loop due to excessive checkpoint recoveries which block 
> the JobMaster
> --
>
> Key: FLINK-19401
> URL: https://issues.apache.org/jira/browse/FLINK-19401
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.1, 1.11.2
>Reporter: Steven Zhen Wu
>Priority: Critical
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> Flink job sometimes got into a restart loop for many hours and can't recover 
> until redeployed. We had some issue with Kafka that initially caused the job 
> to restart. 
> Below is the first of the many exceptions for "ResourceManagerException: 
> Could not find registered job manager" error.
> {code}
> 2020-09-19 00:03:31,614 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
> rofile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,620 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   
> [flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> 
> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> 
> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED 
> to FAILED.
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> No pooled slot available and request to ResourceManager for new slot failed
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
> at 
> java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113)
> at 
>

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-10-05 Thread Roman Khachatryan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208335#comment-17208335
 ] 

Roman Khachatryan commented on FLINK-19401:
---

I do see that the same data is downloaded over and over, but I'm not sure 
whether it's the root cause.
 From the logs I see that heartbeat from *on JM* from TMs are expired first, 
and heartbeat *on TMs* expire two minutes later:
{code:java}
flink-19401 $ grep -ah -B1 'Heartbeat of Task' *.log | grep '^2020' | cut -c-16 
| sort | uniq -c
 66 2020-09-18 23:57
 11 2020-09-19 01:06
flink-19401 $ grep -ah -B1 'heartbeat of Job' *.log | grep '^2020' | cut -c-16 
| sort | uniq -c
509 2020-09-18 23:59
  2 2020-09-19 00:03
  2 2020-09-19 00:32
  2 2020-09-19 00:37
  2 2020-09-19 01:06
{code}
(77 vs 517 is due to parallelism level I think)

 

Furthermore, the first 8 recoveries start after failures in Kafka: NPE in 
NetworkClient and "SSLProtocolException: Handshake message sequence violation, 
2".

Then they happen after Kafka failures and the aforementioned TM heartbeat 
timeouts on JM.

 

So it looks like that the root cause is misconfiguration and/or network.

WDYT [~trohrmann], [~stevenz3wu] ?

 

Steven, can you please also explain what does this mean:
{code:java}
$ grep 72cc604b-dcd5-4c75-a466-0321d7c51c3e

2020-09-18 23:55:43,804 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60067 ms.
2020-09-18 23:56:11,256 Deploying Source: ee_clevent_presentation -> 
ee_clevent_presentationRaw -> cleventPresentation -> Flat Map (510/640) 
(attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,256 Deploying Source: k2-logtracevpc -> k2-logtracevpcRaw 
-> (Filter -> cybertronLogReplicated, Filter -> 
cybertronLogReplicatedCrossRegion, gpsRequestPivotAudit, 
gpsRequestPivotAuditCrossRegion) (510/640) (attempt #1) to 
1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e 
(dataPort=39833)
2020-09-18 23:56:11,261 Deploying Source: k2-ee_clevent -> k2-ee_cleventRaw -> 
(cleventAddToPlaylistCommand -> Flat Map, cleventThumbRating -> Flat Map) 
(510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> 
(PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (510/640) 
(attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 Deploying Filter -> Process -> (Sink: 
cybertron_joiner_client_input_0, Sink: cybertron_joiner_client_input_1) 
(510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 Deploying Filter -> Process -> (Sink: 
cybertron_joiner_server_input_0, Sink: cybertron_joiner_server_input_1) 
(510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:43,872 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e 
to TM termination pool because TM is not registered.
2020-09-18 23:57:43,940 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60068 ms.
2020-09-18 23:58:44,008 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e 
to TM termination pool because TM is not registered.
2020-09-18 23:59:11,794 Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> 
(PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (488/640) 
(attempt #3) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:59:27,101 Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> 
(PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (478/640) 
(attempt #2) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:59:44,075 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60066 ms.
2020-09-19 00:00:44,144 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e 
to TM termination pool because TM is not registered.
2020-09-19 00:01:44,634 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60489 ms.
2020-09-19 00:02:44,710 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e 
to TM termination pool because TM is not registered.
2020-09-19 00:03:44,799 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60088 ms.
2020-09-19 00:04:44,871 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e 
to TM termination pool because TM is not registered.
2020-09-19 00:05:44,938 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-09-30 Thread Till Rohrmann (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204827#comment-17204827
 ] 

Till Rohrmann commented on FLINK-19401:
---

No, the 3 operator chains are recovered together. However, we do this operation 
for every subtask index which means that we read the checkpoints 
{{parallelism}} times.

> Job stuck in restart loop due to excessive checkpoint recoveries which block 
> the JobMaster
> --
>
> Key: FLINK-19401
> URL: https://issues.apache.org/jira/browse/FLINK-19401
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.1, 1.11.2
>Reporter: Steven Zhen Wu
>Priority: Critical
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> Flink job sometimes got into a restart loop for many hours and can't recover 
> until redeployed. We had some issue with Kafka that initially caused the job 
> to restart. 
> Below is the first of the many exceptions for "ResourceManagerException: 
> Could not find registered job manager" error.
> {code}
> 2020-09-19 00:03:31,614 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
> rofile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,620 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   
> [flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> 
> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> 
> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED 
> to FAILED.
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> No pooled slot available and request to ResourceManager for new slot failed
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
> at 
> java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113)
> at 
> org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115)
> at 
>

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-09-30 Thread Steven Zhen Wu (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204732#comment-17204732
 ] 

Steven Zhen Wu commented on FLINK-19401:


[~trohrmann] just to clarify, do you mean operator chain caused the repeated 
checkpoint recoveries? i.e.  if an operator chain has 3 operators, we are 
repeating checkpoint recoveries 3 times no matter the 3 operators has state or 
not.

In this job, only two Kafka sources and the co-process functions have state. It 
has 3 operator chains as Till described (two operator chains for source, one 
operator chain for co-process+sink).

 

> Job stuck in restart loop due to excessive checkpoint recoveries which block 
> the JobMaster
> --
>
> Key: FLINK-19401
> URL: https://issues.apache.org/jira/browse/FLINK-19401
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.1, 1.11.2
>Reporter: Steven Zhen Wu
>Priority: Critical
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> Flink job sometimes got into a restart loop for many hours and can't recover 
> until redeployed. We had some issue with Kafka that initially caused the job 
> to restart. 
> Below is the first of the many exceptions for "ResourceManagerException: 
> Could not find registered job manager" error.
> {code}
> 2020-09-19 00:03:31,614 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
> rofile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,620 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   
> [flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> 
> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> 
> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED 
> to FAILED.
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> No pooled slot available and request to ResourceManager for new slot failed
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
> at 
> java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143)
> at 
>

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-09-30 Thread Till Rohrmann (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204503#comment-17204503
 ] 

Till Rohrmann commented on FLINK-19401:
---

In the logs it looks as if you also have an embarrassingly parallel part: 
Several sources which are chained with a flat map and then two filters which 
are chained with a process function and sinks. I believe that this part causes 
the repeated checkpoint recoveries.

> Job stuck in restart loop due to excessive checkpoint recoveries which block 
> the JobMaster
> --
>
> Key: FLINK-19401
> URL: https://issues.apache.org/jira/browse/FLINK-19401
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.1, 1.11.2
>Reporter: Steven Zhen Wu
>Priority: Critical
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> Flink job sometimes got into a restart loop for many hours and can't recover 
> until redeployed. We had some issue with Kafka that initially caused the job 
> to restart. 
> Below is the first of the many exceptions for "ResourceManagerException: 
> Could not find registered job manager" error.
> {code}
> 2020-09-19 00:03:31,614 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
> rofile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,620 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   
> [flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> 
> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> 
> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED 
> to FAILED.
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> No pooled slot available and request to ResourceManager for new slot failed
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
> at 
> java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113)
> at 
> org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115)
> at

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-09-29 Thread Steven Zhen Wu (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204031#comment-17204031
 ] 

Steven Zhen Wu commented on FLINK-19401:


[~trohrmann] thanks a lot for looking into the problem. Just a small 
clarification on the job, it is a stream join job with keyBy and co-process 
functions.

> Job stuck in restart loop due to excessive checkpoint recoveries which block 
> the JobMaster
> --
>
> Key: FLINK-19401
> URL: https://issues.apache.org/jira/browse/FLINK-19401
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.1, 1.11.2
>Reporter: Steven Zhen Wu
>Priority: Critical
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> Flink job sometimes got into a restart loop for many hours and can't recover 
> until redeployed. We had some issue with Kafka that initially caused the job 
> to restart. 
> Below is the first of the many exceptions for "ResourceManagerException: 
> Could not find registered job manager" error.
> {code}
> 2020-09-19 00:03:31,614 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
> rofile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,620 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   
> [flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> 
> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> 
> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED 
> to FAILED.
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> No pooled slot available and request to ResourceManager for new slot failed
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
> at 
> java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113)
> at 
> org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115)
> at 
>

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

9 matches

Site Navigation

Mail list logo

Footer information