openinx commented on issue #2808: URL: https://github.com/apache/iceberg/issues/2808#issuecomment-913629145
> I think the socket timeout Txn1 is committed successfully finally and then the job restores before the Txn1 is committed, and the restored job commits normally. Then there will be two same max-committed-checkpointid snapshots. This could explain why there're two same txn commits in the metadata. I am thinking the candidate way to resolve this consistent issue are: 1. Just quit the flink streaming job when encountering CommitStateUnknownException and let people to check whether it's OK to restart the flink job. 2. Catch the CommitStateUnknownException in [commitOperation](https://github.com/apache/iceberg/blob/e20088449daec9ed431754044b520b3ac5fa3eaa/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java#L308), and retry to check the iceberg table whether it has been committed the stale txn. If it has been exhausted and timeout to check the table ( I mean it does not commit the txn successfully finally) , then we start to failover. In this way we will need to use an experience timeout to evaluate whether it's OK to stop to check the hive-metastore, and start the flink job failover.... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
