kbendick opened a new pull request #3110: URL: https://github.com/apache/iceberg/pull/3110
This test has a race condition, where one of the two disjointed DAGs can finish and close its tasks before the other has finished. When the task(s) belonging to the disjoint DAG which terminated aren't present to participate in checkpointing, it leads to an infinite loop of attempting to re-checkpoint. Here are some of the logs (visible when passing `-i` for info level logs to gradle. ``` 2021-09-13T08:19:47.7896411Z > Task :iceberg-flink:test 2021-09-13T08:19:47.7899950Z [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: rightCustomSource -> rightIcebergSink-rightIcebergSink -> rightIcebergSink-IcebergStreamWriter (1/1) of job 437e46445e777ca2231677f60f87496a is not in state RUNNING but FINISHED instead. Aborting checkpoint. 2021-09-13T08:19:47.7905489Z [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: rightCustomSource -> rightIcebergSink-rightIcebergSink -> rightIcebergSink-IcebergStreamWriter (1/1) of job 437e46445e777ca2231677f60f87496a is not in state RUNNING but FINISHED instead. Aborting checkpoint. 2021-09-13T08:19:47.7914766Z [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: rightCustomSource -> rightIcebergSink-rightIcebergSink -> rightIcebergSink-IcebergStreamWriter (1/1) of job 437e46445e777ca2231677f60f87496a is not in state RUNNING but FINISHED instead. Aborting checkpoint. 2021-09-13T08:19:47.7920502Z [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: rightCustomSource -> rightIcebergSink-rightIcebergSink -> rightIcebergSink-IcebergStreamWriter (1/1) of job 437e46445e777ca2231677f60f87496a is not in state RUNNING but FINISHED instead. Aborting checkpoint. ``` Link to another PR where I attempted to debug this with some relevant discussion - https://github.com/apache/iceberg/pull/3106 This (temporarily) closes this issue: https://github.com/apache/iceberg/issues/3091, though we should fix the `BoundedTestSource` (though this edge case might be fixed come Flink 1.14). More details and discussion in the issue (particularly the linked FLIP). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
