Hi Gabor.
Thanks for chiming in. I think it is failing but I could be mistaken. There are
no errors in the log, everything looks fine. However, when I inspect the
_metadata file, I can see references to other files which are not present at
the given locations. Here is an example.
Flink.log (time order is newer first)
2025-06-10 15:25:40.983
2025-06-10 13:25:40,983 INFO
org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor [] - Time taken
for Delete operation is: 0 ms with threads: 0
2025-06-10 15:25:40.983
2025-06-10 13:25:40,983 WARN
org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor [] - Disabling
threads for Delete operation as thread count 0 is <= 1
2025-06-10 15:25:40.936
2025-06-10 13:25:40,936 INFO
org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Marking
checkpoint 1425 as completed for source Source: Kafka source.
2025-06-10 15:25:40.936
2025-06-10 13:25:40,936 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 1425 for job 3acd203bc1b74b65803d14c9cad2df32 (3397 bytes,
checkpointDuration=134 ms, finalizationTime=188 ms).
2025-06-10 15:25:40.669
2025-06-10 13:25:40,669 INFO
org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream [] -
Cannot create recoverable writer due to Recoverable writers on AzureBlob are
only supported for ABFS, will use the ordinary writer.
2025-06-10 15:25:40.628
2025-06-10 13:25:40,628 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 1425 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1749561940614 for job
3acd203bc1b74b65803d14c9cad2df32.
References in the _metadata file:
wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/c628d0ed-bbdd-4edd-bfa5-c53c60da5d43
wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/6eab4448-6080-4fef-8503-7342dc407b9c
wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/685aa9e3-0260-4240-b5de-249f8d9a2683
wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints/3acd203bc1b74b65803d14c9cad2df32/chk-1425/9ebd7108-0073-4dd0-b047-56b69e21179b
So, if I understand things correctly, there should be those 4 files in the
chk-1425 folder, but it contains only the _metadata file. And this really is
all there is in the logs, Task Manager is spitting some warnings about metric
name collision, but that should be irrelevant.
Am I making a false alarm here? Would you need to inspect the _metadata file,
as well? Or can I do a better job of analyzing it?
Nikola.
From: Gabor Somogyi <[email protected]>
Date: Tuesday, June 10, 2025 at 10:52 AM
To: Nikola Milutinovic <[email protected]>
Cc: Flink Users <[email protected]>
Subject: Re: Savepoints and Checkpoints missing files
Hi Nikola,
Fails on how? Some stack trace or error would be beneficial.
G
On Tue, Jun 10, 2025 at 10:48 AM Nikola Milutinovic
<[email protected]<mailto:[email protected]>> wrote:
Hello.
We are running Flink 1.20.1 on Kubernetes (AKS). We have observed a consistent
error situation: both checkpoints and savepoints only save “_metadata” file and
nothing else. Sometimes this is OK, where all data is in that one file. But
sometimes “_metadata” holds references to other files, which are not present.
I understand that if the size of the state is smaller than a set limit, it will
be stored only in that one file. And if it is larger, it would be spilled over
to additional files. Our state is generally miniscule, so it should always fit
into _metadata, but sometimes I can inspect the _metadata file and see
references to those additional files. Trying to restore from such a
save/check-point always fails.
Does anyone know of a reason for this behavior?
This is our configuration (relevant parts, I have substituted our account with
a variable):
high-availability.type: kubernetes
high-availability.cluster-id: flink-cluster-session-cluster
high-availability.storageDir:
wasbs://flink-storage@${account}.blob.core.windows.net/data<http://blob.core.windows.net/data>
high-availability.jobmanager.port: 6123
state.backend.type: rocksdb
execution.checkpointing.num-retained: 3
execution.checkpointing.savepoint-dir:
wasbs://flink-storage@${account}.blob.core.windows.net/flink-savepoints<http://blob.core.windows.net/flink-savepoints>
execution.checkpointing.mode: EXACTLY_ONCE
execution.checkpointing.incremental: true
execution.checkpointing.interval: 60000
execution.checkpointing.timeout: 300000
$internal.flink.version: v1_20
execution.checkpointing.storage: filesystem
execution.checkpointing.dir:
wasbs://flink-storage@${account}.blob.core.windows.net/flink-checkpoints<http://blob.core.windows.net/flink-checkpoints>
execution.checkpointing.externalized-checkpoint-retention:
RETAIN_ON_CANCELLATION
execution.checkpointing.min-pause: 5000
execution.target: kubernetes-session
fs.azure.account.keyprovider.${account}.blob.core.windows.net<http://blob.core.windows.net>:
org.apache.flink.fs.azurefs.EnvironmentVariableKeyProvider
env.java.opts.all: --add-exports=java.base/sun.net.util=ALL-UNNAMED
--add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED
--add-exports=jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED
--add-exports=jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED
--add-exports=jdk.compiler/com.sun.tools.javac.parser=ALL-UNNAMED
--add-exports=jdk.compiler/com.sun.tools.javac.tree=ALL-UNNAMED
--add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED
--add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED
--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.net<http://java.net>=ALL-UNNAMED
--add-opens=java.base/java.io<http://java.io>=ALL-UNNAMED
--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/sun.nio.ch<http://sun.nio.ch>=ALL-UNNAMED
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
--add-opens=java.base/java.text=ALL-UNNAMED
--add-opens=java.base/java.time=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED
Nikola.