Zhipeng Zhang created FLINK-31903: ------------------------------------- Summary: Caching records fails in BroadcastUtils#withBroadcastStream Key: FLINK-31903 URL: https://issues.apache.org/jira/browse/FLINK-31903 Project: Flink Issue Type: Bug Components: Library / Machine Learning Affects Versions: ml-2.3.0 Reporter: Zhipeng Zhang
When caching more than 1,000,000 records using BroadcastUtils#withBroadcast, it leads to exception as follows: {code:java} Caused by: org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold. at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.checkFailureAgainstCounter(CheckpointFailureManager.java:206) at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:191) at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:124) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2078) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:1038) at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$declineCheckpoint$2(ExecutionGraphHandler.java:103) at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} It seems that the bug comes from caching too many records when calling AbstractBroadcastWrapperOperator#snapshot. -- This message was sent by Atlassian Jira (v8.20.10#820010)