[jira] [Updated] (FLINK-19401) Job stuck in restart loop due to "Could not find registered job manager"

2020-09-29 Thread Till Rohrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann updated FLINK-19401:
--
Component/s: (was: Runtime / Coordination)
 Runtime / Checkpointing

> Job stuck in restart loop due to "Could not find registered job manager"
> 
>
> Key: FLINK-19401
> URL: https://issues.apache.org/jira/browse/FLINK-19401
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.1
>Reporter: Steven Zhen Wu
>Priority: Major
>
> Flink job sometimes got into a restart loop for many hours and can't recover 
> until redeployed. We had some issue with Kafka that initially caused the job 
> to restart. 
> Below is the first of the many exceptions for "ResourceManagerException: 
> Could not find registered job manager" error.
> {code}
> 2020-09-19 00:03:31,614 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
> rofile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,620 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   
> [flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> 
> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> 
> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED 
> to FAILED.
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> No pooled slot available and request to ResourceManager for new slot failed
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
> at 
> java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113)
> at 
> org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115)
> at 
> org.apache.flink.runtime.scheduler.DefaultExecutionSlotAllocator.lambda$allocateSlotsFor$0(DefaultExecutionSlotAllocator.java:104)
> at 
> java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)
> at 
> 

[jira] [Updated] (FLINK-19401) Job stuck in restart loop due to "Could not find registered job manager"

2020-09-24 Thread Steven Zhen Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Zhen Wu updated FLINK-19401:
---
Description: 
Flink job sometimes got into a restart loop for many hours and can't recover 
until redeployed. We had some issue with Kafka that initially caused the job to 
restart. 

Below is the first of the many exceptions for "ResourceManagerException: Could 
not find registered job manager" error.
{code}
2020-09-19 00:03:31,614 INFO  
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
[flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
[SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
rofile{UNKNOWN} from resource manager.
2020-09-19 00:03:31,615 INFO  
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
[flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
[SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile 
ResourceProfile{UNKNOWN} from resource manager.
2020-09-19 00:03:31,615 INFO  
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
[flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
[SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile 
ResourceProfile{UNKNOWN} from resource manager.
2020-09-19 00:03:31,615 INFO  
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
[flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
[SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile 
ResourceProfile{UNKNOWN} from resource manager.
2020-09-19 00:03:31,620 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   
[flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> 
k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> 
Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED 
to FAILED.
java.util.concurrent.CompletionException: 
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No 
pooled slot available and request to ResourceManager for new slot failed
at 
org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
at 
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
at 
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at 
java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
at 
java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
at 
org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
at 
org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
at 
org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
at 
org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
at 
org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160)
at 
org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143)
at 
org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113)
at 
org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115)
at 
org.apache.flink.runtime.scheduler.DefaultExecutionSlotAllocator.lambda$allocateSlotsFor$0(DefaultExecutionSlotAllocator.java:104)
at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)
at 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)
at 
org.apache.flink.runtime.scheduler.DefaultExecutionSlotAllocator.allocateSlotsFor(DefaultExecutionSlotAllocator.java:102)
at 
org.apache.flink.runtime.scheduler.DefaultScheduler.allocateSlots(DefaultScheduler.java:342)
at 
org.apache.flink.runtime.scheduler.DefaultScheduler.allocateSlotsAndDeploy(DefaultScheduler.java:311)
at 
org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy.allocateSlotsAndDeploy(EagerSchedulingStrategy.java:76)
at 
org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy.restartTasks(EagerSchedulingStrategy.java:57)
at