[ https://issues.apache.org/jira/browse/FLINK-24919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452237#comment-17452237 ]
Dawid Wysakowicz edited comment on FLINK-24919 at 12/2/21, 4:02 PM: -------------------------------------------------------------------- Fixed in: * master ** 5bf49f1dc9215e86758d002bc5a2ab82e738d3fa * 1.14.1 ** d26c0e511e9f37671b52c23df4c09e7aa3719d5a * 1.13.4 ** 704941c883727e9cf8ca3dd7ee6e6f23056a527e was (Author: dawidwys): Fixed in: * master ** 5bf49f1dc9215e86758d002bc5a2ab82e738d3fa > UnalignedCheckpointITCase hangs on Azure > ---------------------------------------- > > Key: FLINK-24919 > URL: https://issues.apache.org/jira/browse/FLINK-24919 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.15.0 > Reporter: Piotr Nowojski > Assignee: Anton Kalashnikov > Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > > Extracted from FLINK-23466 > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26304&view=logs&j=a57e0635-3fad-5b08-57c7-a4142d7d6fa9&t=2ef0effc-1da1-50e5-c2bd-aab434b1c5b7&l=13067 > Nov 10 16:13:03 Starting > org.apache.flink.test.checkpointing.UnalignedCheckpointITCase#execute[pipeline > with mixed channels, p = 20, timeout = 0, buffersPerChannel = 1]. > From the log, we can see this case hangs. I guess this seems a new issue > which is different from the one reported in this ticket. From the stack, it > seems there is something wrong with the checkpoint coordinator, the following > thread locked 0x0000000087db4fb8: > {code:java} > 2021-11-10T17:14:21.0899474Z Nov 10 17:14:21 "jobmanager-io-thread-2" #12984 > daemon prio=5 os_prio=0 tid=0x00007f12e000b800 nid=0x3fb6 runnable > [0x00007f0fcd6d4000] > 2021-11-10T17:14:21.0899924Z Nov 10 17:14:21 java.lang.Thread.State: > RUNNABLE > 2021-11-10T17:14:21.0900300Z Nov 10 17:14:21 at > java.util.HashMap$TreeNode.balanceDeletion(HashMap.java:2338) > 2021-11-10T17:14:21.0900745Z Nov 10 17:14:21 at > java.util.HashMap$TreeNode.removeTreeNode(HashMap.java:2112) > 2021-11-10T17:14:21.0901146Z Nov 10 17:14:21 at > java.util.HashMap.removeNode(HashMap.java:840) > 2021-11-10T17:14:21.0901577Z Nov 10 17:14:21 at > java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:301) > 2021-11-10T17:14:21.0902002Z Nov 10 17:14:21 at > java.util.HashMap.putVal(HashMap.java:664) > 2021-11-10T17:14:21.0902531Z Nov 10 17:14:21 at > java.util.HashMap.putMapEntries(HashMap.java:515) > 2021-11-10T17:14:21.0902931Z Nov 10 17:14:21 at > java.util.HashMap.putAll(HashMap.java:785) > 2021-11-10T17:14:21.0903429Z Nov 10 17:14:21 at > org.apache.flink.runtime.checkpoint.ExecutionAttemptMappingProvider.getVertex(ExecutionAttemptMappingProvider.java:60) > 2021-11-10T17:14:21.0904060Z Nov 10 17:14:21 at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.reportStats(CheckpointCoordinator.java:1867) > 2021-11-10T17:14:21.0904686Z Nov 10 17:14:21 at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1152) > 2021-11-10T17:14:21.0905372Z Nov 10 17:14:21 - locked <0x0000000087db4fb8> > (a java.lang.Object) > 2021-11-10T17:14:21.0905895Z Nov 10 17:14:21 at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) > 2021-11-10T17:14:21.0906493Z Nov 10 17:14:21 at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1368/705813936.accept(Unknown > Source) > 2021-11-10T17:14:21.0907086Z Nov 10 17:14:21 at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) > 2021-11-10T17:14:21.0907698Z Nov 10 17:14:21 at > org.apache.flink.runtime.scheduler.ExecutionGraphHandler$$Lambda$1369/1447418658.run(Unknown > Source) > 2021-11-10T17:14:21.0908210Z Nov 10 17:14:21 at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > 2021-11-10T17:14:21.0908735Z Nov 10 17:14:21 at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > 2021-11-10T17:14:21.0909333Z Nov 10 17:14:21 at > java.lang.Thread.run(Thread.java:748) {code} > But other thread is waiting for the lock. I am not familiar with these logics > and not sure if this is in the right state. Could anyone who is familiar with > these logics take a look? > > BTW, concurrent access of HashMap may cause infinite loop,I see in the stack > that there are multiple threads are accessing HashMap, though I am not sure > if they are the same instance. -- This message was sent by Atlassian Jira (v8.20.1#820001)