XComp opened a new pull request #19102:
URL: https://github.com/apache/flink/pull/19102


   ## What is the purpose of the change
   
   Flink 1.14- would just print warning in case of failures. The initial 1.15 
retrable cleanup functionality failed the cluster fatally if the cleanup 
failed. This can happen if the user decides to limit the number of attempts (by 
default, Flink will try to cleanup infinitely). This would cause issues in 
session mode: Other jobs stop as well with the whole Flink cluster failing 
fatally. That's not what we want.
   
   In contrast, for job and application mode: Failing fatally would be an 
option to notify the user about the failure. Flink is left in an inconsistent 
state. In HA mode, this would lead to a restart of the worker and the cleanup 
is picked up again. That would result in the cleanup logic being triggered 
again which might not what the user intended when limiting the cleanup retries.
   
   Therefore, logging a warning in case of the job and application mode is the 
more reasonable thing to do.
   
   ## Brief change log
   
   * Added log message to Dispatcher after cleanup
   * Updated default value for `fixed-delay` to also try infinitely to force 
the user to explicitly change the configuration parameter if he/she really 
wants to leave jobs in an inconsistent state
   * Updated the documentation of the configuration parameters and the 
retryable cleanup accordingly
   
   ## Verifying this change
   
   * Adapted `DispatcherCleanupITCase` - the test would now fail if the cluster 
fails fatally due to the `TestingFatalErrorHandlerResource` before verified 
after the test succeeded.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? docs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to