chenwyi2 opened a new issue, #8806:
URL: https://github.com/apache/iceberg/issues/8806
### Apache Iceberg version
1.2.1
### Query engine
Flink
### Please describe the bug 🐞
recently i met a job failed with "Failed to open input stream for file:
xxx/metadata/3e3a37a06993c2a0134beb41c1ceb66e-49884f57af809d38cc85f0c7211a0bc1-00000-0-25892-00048.avro",
the siutation is a task failed with checkpoint id 25893, then then restart the
job, it will reset the checkpoint ID to 25893 and restore job from Savepoint
25892, however some temprory manifests can be deleted when commiting
successfully, so manifests with checkpoint id 25892 were deleted before,, how
can we deal with this?
detail log is:
`2023-10-09 16:39:57,724 INFO org.apache.iceberg.hive.HiveTableOperations
[] - Committed to table icebergCatalog.xxx with the new metadata
location xxx/metadata/300237-907ef004-3085-439f-b606-fc2b106bcb54.metadata.json
2023-10-09 16:39:57,747 INFO org.apache.hadoop.fs.TrashPolicyDefault
[] - Moved:
'xxx/metadata/300136-01612e4a-add1-4a3f-b7e7-1ee25e063e04.metadata.json' to
trash
2023-10-09 16:39:57,747 INFO
org.apache.iceberg.BaseMetastoreTableOperations [] - Successfully
committed to table icebergCatalog.xxx in 3142 ms
2023-10-09 16:39:57,747 INFO org.apache.iceberg.SnapshotProducer
[] - Committed snapshot 8753072822283034565 (MergeAppend)
2023-10-09 16:39:57,788 INFO
org.apache.iceberg.flink.sink.IcebergFilesCommitter [] - Committed
append to table: icebergCatalog.xxx, branch: main, checkpointId 25892 in 7394 ms
2023-10-09 16:39:58,011 INFO org.apache.hadoop.fs.TrashPolicyDefault
[] - Moved:
'xxx/metadata/3e3a37a06993c2a0134beb41c1ceb66e-49884f57af809d38cc85f0c7211a0bc1-00000-0-25892-00048.avro'
to trash
2023-10-09 16:39:58,011 INFO
org.apache.iceberg.flink.sink.IcebergFilesCommitter [] - deleted
manifest :
xxx/metadata/3e3a37a06993c2a0134beb41c1ceb66e-49884f57af809d38cc85f0c7211a0bc1-00000-0-25892-00048.avro
`
then failed with other reasons
`2023-10-09 16:41:59,902 INFO
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] -
Checkpoint 25893 has been notified as aborted, would not trigger any
checkpoint.`
restart
`2023-10-09 16:43:41,742 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Reset the
checkpoint ID of job 3e878a638ceb45633f31e8813c521740 to 25893.
2023-10-09 16:43:41,742 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job
3e878a638ceb45633f31e8813c521740 from Savepoint 25892 @ 0 for
3e878a638ceb45633f31e8813c521740 located at xxx`
but the manifest was deleted before.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]