kylemeow opened a new pull request #11639: [FLINK-16626][runtime] Prevent REST 
handler from being closed more than once
URL: https://github.com/apache/flink/pull/11639
 
 
   ## What is the purpose of the change
   
   In Flink 1.10.0 release, job cancellation can be problematic, as users 
frequently experience *java.util.concurrent.TimeoutException* at the client 
side, because the REST endpoint closes pre-maturely before sending out the 
response. 
   
   After discussion with the community and research, it is shown that there are 
two issues to address:
   1. AbstractHandler and its subclasses can be closed more than once (whether 
intentionally or unintentionally), so this might lead to unexpected behavior 
like exceptions, especially when interacting with external systems, or 
unintended deregistration of Phaser in the handler instance which causes early 
shutdown of the cluster. 
   2. In WebMonitorEndpoint class, the same jobCancelTerminationHandler 
instance has been registered twice, thus during handler closure process, 
*closeAsync* method is called twice, therefore, the cluster pre-maturely 
entered internalShutdown process, leaving unfinished responses behind.
   
   ## Brief change log
   
   - Added an AtomicBoolean field to prevent closeAsync method of one handler 
instance from being called multiple times.
   - Added a new legacyJobCancelTerminationHandler to prevent reuse of existing 
jobCancelTerminationHandler handler instance.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
   YARNJobCancellationITCase
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: yes
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to