[ https://issues.apache.org/jira/browse/FLINK-20099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230469#comment-17230469 ]
Jiayi Liao commented on FLINK-20099: ------------------------------------ Is this a duplicate issue of https://issues.apache.org/jira/browse/FLINK-16753? > HeapStateBackend checkpoint error hidden under cryptic message > -------------------------------------------------------------- > > Key: FLINK-20099 > URL: https://issues.apache.org/jira/browse/FLINK-20099 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / State Backends > Affects Versions: 1.11.2 > Reporter: Nico Kruber > Priority: Major > Labels: usability > Attachments: Screenshot_20201112_001331.png > > > When the memory state back-end hits a certain size, it fails to permit > checkpoints. Even though a very detailed exception is thrown at its source, > this is neither logged nor shown in the UI: > * Logs just contain: > {code:java} > 00:06:41.462 [jobmanager-future-thread-14] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline > checkpoint 2 by task 8eb303cd3196310cb2671212f4ed013c of job > c9b7a410bd3143864ca23ba89595d878 at 6a73bcf2-46b6-4735-a616-fdf09ff1471c @ > localhost (dataPort=-1). > {code} > * UI: (also see the attached Screenshot_20201112_001331.png) > {code:java} > Failure Message: The job has failed. > {code} > -> this isn't even true: the job is still running fine! > > Debugging into {{PendingCheckpoint#abort()}} reveals that the causing > exception is actually still in there but the detailed information from it is > just never used. > For reference, this is what is available there and should be logged or shown: > {code:java} > java.lang.Exception: Could not materialize checkpoint 2 for operator > aggregates -> (Sink: sink-agg-365, Sink: sink-agg-180, Sink: sink-agg-45, > Sink: sink-agg-30) (4/4). > at > org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:191) > at > org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:138) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Size > of the state is larger than the maximum permitted memory-backed state. > Size=6122737 , maxSize=5242880 . Consider using a different state backend, > like the File System State backend. > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:479) > at > org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:50) > at > org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:102) > ... 3 more > Caused by: java.io.IOException: Size of the state is larger than the maximum > permitted memory-backed state. Size=6122737 , maxSize=5242880 . Consider > using a different state backend, like the File System State backend. > at > org.apache.flink.runtime.state.memory.MemCheckpointStreamFactory.checkSize(MemCheckpointStreamFactory.java:64) > at > org.apache.flink.runtime.state.memory.MemCheckpointStreamFactory$MemoryCheckpointOutputStream.closeAndGetBytes(MemCheckpointStreamFactory.java:145) > at > org.apache.flink.runtime.state.memory.MemCheckpointStreamFactory$MemoryCheckpointOutputStream.closeAndGetHandle(MemCheckpointStreamFactory.java:126) > at > org.apache.flink.runtime.state.CheckpointStreamWithResultProvider$PrimaryStreamOnly.closeAndFinalizeCheckpointStreamResult(CheckpointStreamWithResultProvider.java:77) > at > org.apache.flink.runtime.state.heap.HeapSnapshotStrategy$1.callInternal(HeapSnapshotStrategy.java:199) > at > org.apache.flink.runtime.state.heap.HeapSnapshotStrategy$1.callInternal(HeapSnapshotStrategy.java:158) > at > org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:75) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:476) > ... 5 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)