[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218851#comment-17218851 ] Arvid Heise commented on FLINK-19401: - Merged into release-1.11 as bff79f5efffccd1793e09a5e08d0ceb9fe90cf2 Merged into release-1.10 as 88c06e38d4eff06ac09e4141b988f9a561f286a4 Closing as resolved. > Job stuck in restart loop due to excessive checkpoint recoveries which block > the JobMaster > -- > > Key: FLINK-19401 > URL: https://issues.apache.org/jira/browse/FLINK-19401 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.1, 1.11.2 >Reporter: Steven Zhen Wu >Assignee: Roman Khachatryan >Priority: Critical > Labels: pull-request-available > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > Flink job sometimes got into a restart loop for many hours and can't recover > until redeployed. We had some issue with Kafka that initially caused the job > to restart. > Below is the first of the many exceptions for "ResourceManagerException: > Could not find registered job manager" error. > {code} > 2020-09-19 00:03:31,614 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP > rofile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,620 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [flink-akka.actor.default-dispatcher-35973] - Source: k2-csevpc -> > k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> > Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED > to FAILED. > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > No pooled slot available and request to ResourceManager for new slot failed > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) > at > java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) > at > org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113) > at > org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115) > at >
[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218541#comment-17218541 ] Arvid Heise commented on FLINK-19401: - Added to master as be448ecbd97ff9f79daac9ec7b6ec022d8c4e18d. > Job stuck in restart loop due to excessive checkpoint recoveries which block > the JobMaster > -- > > Key: FLINK-19401 > URL: https://issues.apache.org/jira/browse/FLINK-19401 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.1, 1.11.2 >Reporter: Steven Zhen Wu >Assignee: Roman Khachatryan >Priority: Critical > Labels: pull-request-available > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > Flink job sometimes got into a restart loop for many hours and can't recover > until redeployed. We had some issue with Kafka that initially caused the job > to restart. > Below is the first of the many exceptions for "ResourceManagerException: > Could not find registered job manager" error. > {code} > 2020-09-19 00:03:31,614 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP > rofile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,620 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [flink-akka.actor.default-dispatcher-35973] - Source: k2-csevpc -> > k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> > Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED > to FAILED. > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > No pooled slot available and request to ResourceManager for new slot failed > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) > at > java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) > at > org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113) > at > org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115) > at >
[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208502#comment-17208502 ] Roman Khachatryan commented on FLINK-19401: --- Thanks for the clarification [~stevenz3wu]. I looked at the code again and I think [~trohrmann] is right. The order of heartbeat timeouts is because TMs don't request heartbeats, only respond. > Job stuck in restart loop due to excessive checkpoint recoveries which block > the JobMaster > -- > > Key: FLINK-19401 > URL: https://issues.apache.org/jira/browse/FLINK-19401 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.1, 1.11.2 >Reporter: Steven Zhen Wu >Priority: Critical > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > Flink job sometimes got into a restart loop for many hours and can't recover > until redeployed. We had some issue with Kafka that initially caused the job > to restart. > Below is the first of the many exceptions for "ResourceManagerException: > Could not find registered job manager" error. > {code} > 2020-09-19 00:03:31,614 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP > rofile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,620 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [flink-akka.actor.default-dispatcher-35973] - Source: k2-csevpc -> > k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> > Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED > to FAILED. > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > No pooled slot available and request to ResourceManager for new slot failed > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) > at > java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) > at > org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113) > at > org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115) > at >
[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208437#comment-17208437 ] Steven Zhen Wu commented on FLINK-19401: [~roman_khachatryan] I don't know if repeated checkpoint recovery is the root/main cause or not. [~trohrmann] identified two problems during his investigation. This is one of the identified problem. Regarding the logs related to Titusm please ignore them. They are just noise. We haven't cleaned up our logs yet. > Job stuck in restart loop due to excessive checkpoint recoveries which block > the JobMaster > -- > > Key: FLINK-19401 > URL: https://issues.apache.org/jira/browse/FLINK-19401 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.1, 1.11.2 >Reporter: Steven Zhen Wu >Priority: Critical > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > Flink job sometimes got into a restart loop for many hours and can't recover > until redeployed. We had some issue with Kafka that initially caused the job > to restart. > Below is the first of the many exceptions for "ResourceManagerException: > Could not find registered job manager" error. > {code} > 2020-09-19 00:03:31,614 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP > rofile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,620 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [flink-akka.actor.default-dispatcher-35973] - Source: k2-csevpc -> > k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> > Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED > to FAILED. > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > No pooled slot available and request to ResourceManager for new slot failed > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) > at > java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) > at > org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113) > at >
[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208335#comment-17208335 ] Roman Khachatryan commented on FLINK-19401: --- I do see that the same data is downloaded over and over, but I'm not sure whether it's the root cause. From the logs I see that heartbeat from *on JM* from TMs are expired first, and heartbeat *on TMs* expire two minutes later: {code:java} flink-19401 $ grep -ah -B1 'Heartbeat of Task' *.log | grep '^2020' | cut -c-16 | sort | uniq -c 66 2020-09-18 23:57 11 2020-09-19 01:06 flink-19401 $ grep -ah -B1 'heartbeat of Job' *.log | grep '^2020' | cut -c-16 | sort | uniq -c 509 2020-09-18 23:59 2 2020-09-19 00:03 2 2020-09-19 00:32 2 2020-09-19 00:37 2 2020-09-19 01:06 {code} (77 vs 517 is due to parallelism level I think) Furthermore, the first 8 recoveries start after failures in Kafka: NPE in NetworkClient and "SSLProtocolException: Handshake message sequence violation, 2". Then they happen after Kafka failures and the aforementioned TM heartbeat timeouts on JM. So it looks like that the root cause is misconfiguration and/or network. WDYT [~trohrmann], [~stevenz3wu] ? Steven, can you please also explain what does this mean: {code:java} $ grep 72cc604b-dcd5-4c75-a466-0321d7c51c3e 2020-09-18 23:55:43,804 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60067 ms. 2020-09-18 23:56:11,256 Deploying Source: ee_clevent_presentation -> ee_clevent_presentationRaw -> cleventPresentation -> Flat Map (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,256 Deploying Source: k2-logtracevpc -> k2-logtracevpcRaw -> (Filter -> cybertronLogReplicated, Filter -> cybertronLogReplicatedCrossRegion, gpsRequestPivotAudit, gpsRequestPivotAuditCrossRegion) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,261 Deploying Source: k2-ee_clevent -> k2-ee_cleventRaw -> (cleventAddToPlaylistCommand -> Flat Map, cleventThumbRating -> Flat Map) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,261 Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,261 Deploying Filter -> Process -> (Sink: cybertron_joiner_client_input_0, Sink: cybertron_joiner_client_input_1) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,261 Deploying Filter -> Process -> (Sink: cybertron_joiner_server_input_0, Sink: cybertron_joiner_server_input_1) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:43,872 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered. 2020-09-18 23:57:43,940 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60068 ms. 2020-09-18 23:58:44,008 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered. 2020-09-18 23:59:11,794 Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (488/640) (attempt #3) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:59:27,101 Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (478/640) (attempt #2) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:59:44,075 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60066 ms. 2020-09-19 00:00:44,144 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered. 2020-09-19 00:01:44,634 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60489 ms. 2020-09-19 00:02:44,710 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered. 2020-09-19 00:03:44,799 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60088 ms. 2020-09-19 00:04:44,871 Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered. 2020-09-19 00:05:44,938 Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool
[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204827#comment-17204827 ] Till Rohrmann commented on FLINK-19401: --- No, the 3 operator chains are recovered together. However, we do this operation for every subtask index which means that we read the checkpoints {{parallelism}} times. > Job stuck in restart loop due to excessive checkpoint recoveries which block > the JobMaster > -- > > Key: FLINK-19401 > URL: https://issues.apache.org/jira/browse/FLINK-19401 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.1, 1.11.2 >Reporter: Steven Zhen Wu >Priority: Critical > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > Flink job sometimes got into a restart loop for many hours and can't recover > until redeployed. We had some issue with Kafka that initially caused the job > to restart. > Below is the first of the many exceptions for "ResourceManagerException: > Could not find registered job manager" error. > {code} > 2020-09-19 00:03:31,614 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP > rofile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,620 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [flink-akka.actor.default-dispatcher-35973] - Source: k2-csevpc -> > k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> > Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED > to FAILED. > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > No pooled slot available and request to ResourceManager for new slot failed > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) > at > java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) > at > org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113) > at > org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115) > at >
[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204732#comment-17204732 ] Steven Zhen Wu commented on FLINK-19401: [~trohrmann] just to clarify, do you mean operator chain caused the repeated checkpoint recoveries? i.e. if an operator chain has 3 operators, we are repeating checkpoint recoveries 3 times no matter the 3 operators has state or not. In this job, only two Kafka sources and the co-process functions have state. It has 3 operator chains as Till described (two operator chains for source, one operator chain for co-process+sink). > Job stuck in restart loop due to excessive checkpoint recoveries which block > the JobMaster > -- > > Key: FLINK-19401 > URL: https://issues.apache.org/jira/browse/FLINK-19401 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.1, 1.11.2 >Reporter: Steven Zhen Wu >Priority: Critical > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > Flink job sometimes got into a restart loop for many hours and can't recover > until redeployed. We had some issue with Kafka that initially caused the job > to restart. > Below is the first of the many exceptions for "ResourceManagerException: > Could not find registered job manager" error. > {code} > 2020-09-19 00:03:31,614 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP > rofile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,620 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [flink-akka.actor.default-dispatcher-35973] - Source: k2-csevpc -> > k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> > Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED > to FAILED. > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > No pooled slot available and request to ResourceManager for new slot failed > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) > at > java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) > at > org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143) > at >
[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204503#comment-17204503 ] Till Rohrmann commented on FLINK-19401: --- In the logs it looks as if you also have an embarrassingly parallel part: Several sources which are chained with a flat map and then two filters which are chained with a process function and sinks. I believe that this part causes the repeated checkpoint recoveries. > Job stuck in restart loop due to excessive checkpoint recoveries which block > the JobMaster > -- > > Key: FLINK-19401 > URL: https://issues.apache.org/jira/browse/FLINK-19401 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.1, 1.11.2 >Reporter: Steven Zhen Wu >Priority: Critical > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > Flink job sometimes got into a restart loop for many hours and can't recover > until redeployed. We had some issue with Kafka that initially caused the job > to restart. > Below is the first of the many exceptions for "ResourceManagerException: > Could not find registered job manager" error. > {code} > 2020-09-19 00:03:31,614 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP > rofile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,620 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [flink-akka.actor.default-dispatcher-35973] - Source: k2-csevpc -> > k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> > Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED > to FAILED. > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > No pooled slot available and request to ResourceManager for new slot failed > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) > at > java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) > at > org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113) > at > org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115) > at
[jira] [Commented] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204031#comment-17204031 ] Steven Zhen Wu commented on FLINK-19401: [~trohrmann] thanks a lot for looking into the problem. Just a small clarification on the job, it is a stream join job with keyBy and co-process functions. > Job stuck in restart loop due to excessive checkpoint recoveries which block > the JobMaster > -- > > Key: FLINK-19401 > URL: https://issues.apache.org/jira/browse/FLINK-19401 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.1, 1.11.2 >Reporter: Steven Zhen Wu >Priority: Critical > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > Flink job sometimes got into a restart loop for many hours and can't recover > until redeployed. We had some issue with Kafka that initially caused the job > to restart. > Below is the first of the many exceptions for "ResourceManagerException: > Could not find registered job manager" error. > {code} > 2020-09-19 00:03:31,614 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP > rofile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,620 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [flink-akka.actor.default-dispatcher-35973] - Source: k2-csevpc -> > k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> > Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED > to FAILED. > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > No pooled slot available and request to ResourceManager for new slot failed > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) > at > java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) > at > org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113) > at > org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115) > at >