gaoyunhaii opened a new pull request #8841: [FLINK-12765][coordinator] 
Bookkeeping of available resources of allocated slots in SlotPool
URL: https://github.com/apache/flink/pull/8841
 
 
   ## What is the purpose of the change
   
   This PR is to introduce the bookkeeping logic for the shared slots and 
colocated slots. It is a part of introducing the fine-grained resource 
management for Flink. Based on the current design, a task will always request 
the resource according to its own resource need, and the returned slot may be 
larger than requested resource. Therefore, it leaves chance for slot sharing 
and colocation group.
   
   For slot sharing, if the resource is enough for all the slot requests, they 
will be fulfilled directly, otherwise the over-allocated requests will retry 
and apply for the new slot. Besides, when checking the resolved allocated 
slots, the remaining resource is used for matching instead of the total 
resource.
   
   For co-location group, if the resource of the allocated slot is not enough 
for all the co-located tasks, the allocation will fail with no retry. To be 
more concrete, if the requests have already exceeded the allocated resource 
when the slot is offered by RM, all the requests will fail directly without 
retry. On the other hand, if the requests have not exceeded the allocated 
resource when the slot is offered by RM, they will be marked as successful. 
However, if the following co-located requests find that there are not enough 
resource left, these new requests will fail without retry. Since all the 
co-located tasks belong to the same region, all the co-located tasks will fail 
eventually. This implementation avoids postponing the requests till all 
requests of the co-located group are seen, therefore it will not introduce 
drawbacks for requests without the resource requirements.
   
   ## Brief change log
   
   1. Introduce the statistics of the resource requested in the hierarchical 
structure of MultiTaskSlot/SingleTaskSlot to bookkeeping the already requested 
resources.
   2. Modify the interface of `SlotSelectionStrategy` to also pass the 
remaining resource of the underlying slot. The implementation of the strategies 
should use the remaining resource instead of the total resource.
   3. Add the resource checking logic when the underlying slot is resolved. The 
over-allocated requests will be marked fail. The failure is able to retry iff 
some requests are fulfilled by the underlying slot. 
   4. Add the retry logic for over-allocated requests in `SchedulerImpl` if the 
exception is marked as able to retry.
   5. For the co-located requests, add the checking of whether the remaining 
resource is possible to fulfill the requests if the underlying slot is already 
resolved.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     - Added tests that validate the calculated of requested resource for the 
hierarchical structure of MultiTaskSlot/SingleTaskSlot. 
     - Added tests that validate the routine to fail the over-allocated 
requests when the underlying slot is resolved.
     - Added tests that validate the retry logic after failing due to 
over-allocation.
     - Added tests that validate the failure of the co-located requests if the 
slot is not enough for all the co-located tasks.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): **no**
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **no**
     - The serializers: **no**
     - The runtime per-record code paths (performance sensitive): **no**
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: **yes**
     - The S3 file system connector: **no**
   
   ## Documentation
   
     - Does this pull request introduce a new feature? **no**
     - If yes, how is the feature documented? 
[Doc](https://docs.google.com/document/d/1UR3vYsLOPXMGVyXHXYg3b5yNZcvSTvot4wela8knVAY/edit?usp=sharing)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to