huwh opened a new pull request, #21565:
URL: https://github.com/apache/flink/pull/21565
## What is the purpose of the change
Currently, if Kubernetes/Yarn does not have enough resources to fulfill
Flink's resource requirement, there will be pending pod/container requests on
Kubernetes/Yarn. These pending resource requirements are never cleared until
either fulfilled or the Flink cluster is shutdown.
However, sometimes Flink no longer needs the pending resources. E.g., the
slot request is then fulfilled by another slots that become available, or the
job failed due to slot request timeout (in a session cluster). In such cases,
Flink does not remove the resource request until the resource is allocated,
then it discovers that it no longer needs the allocated resource and release
them. This would affect the underlying Kubernetes/Yarn cluster, especially when
the cluster is under heavy workload.
It would be good for Flink to cancel pod/container requests as earlier as
possible if it can discover that some of the pending workers are no longer
needed.
## Brief change log
- *SlotManager will remove unused
PendingTaskManagerSlot/PendingTaskManager and then declare the resources needed
to ResourceAllocator*
- *ActiveResourceManager will cancel pending workers by complete
requestWorkerFuture by RequestCancelledException*
- *(YARN/Kubernetes)ResourceManagerDriver will stop/release the pending
workers when get the RequestCancelledException*
## Verifying this change
This change added tests and can be verified as follows:
*(example:)*
- *Added unit test for (YARN/Kubernetes)ResourceManagerDriver releasing
pending worker*
- *Added unit test for ActiveResouceManager dealing with
declareResourceNeeded*
- *Manually verified the change by running 2 registered TaskManagers, 3
pending TaskManagers, a streaming program with set slot.request.timeout to
10000, and the job will failed after slot request timeout, and the pending
TaskManagers will be cancelled. Verified both in YARN/Kubernetes.*
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (no)
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: (no)
- The serializers: (no)
- The runtime per-record code paths (performance sensitive): (no)
- Anything that affects deployment or recovery: SlotManager,Kubernetes/Yarn
- The S3 file system connector: (no)
## Documentation
- Does this pull request introduce a new feature? (yes)
- If yes, how is the feature documented?
-
https://docs.google.com/document/d/1lcmf3MKmcmf9tsPc1whaZHMYKurGoGtqOXEep9ngP2k/edit
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]