openinx commented on issue #2808:
URL: https://github.com/apache/iceberg/issues/2808#issuecomment-913629145


   > I think the socket timeout Txn1 is committed successfully finally and then 
the job restores before the Txn1 is committed, and the restored job commits 
normally. Then there will be two same max-committed-checkpointid snapshots.
   
   This could explain why there're two same txn commits in the metadata.  I am 
thinking the candidate way to resolve this consistent issue are: 
   
   1. Just quit the flink streaming job when encountering 
CommitStateUnknownException  and let people to check whether it's OK to restart 
the flink job. 
   2. Catch the CommitStateUnknownException in 
[commitOperation](https://github.com/apache/iceberg/blob/e20088449daec9ed431754044b520b3ac5fa3eaa/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java#L308),
  and retry to check the iceberg table whether it has been committed the stale 
txn.   If it has been exhausted and timeout to check the table ( I mean it does 
not commit the txn successfully finally) , then we start to failover.   In this 
way we will need to use an experience timeout to evaluate whether it's OK to 
stop to check the hive-metastore, and start the flink job failover....


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to