[jira] [Commented] (FLINK-4348) Implement slot allocation protocol with TaskExecutor

ASF GitHub Bot (JIRA) Thu, 29 Sep 2016 08:38:46 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15533113#comment-15533113
 ]


ASF GitHub Bot commented on FLINK-4348:
---------------------------------------

GitHub user mxm opened a pull request:

    https://github.com/apache/flink/pull/2571

    [FLINK-4348] Simplify logic of SlotManager

    This pull request is split up into two commits:
    
    1. It removes some code from the `SlotManager` to make it simpler. 
      - It makes use of `handleSlotRequestFailedAtTaskManager` and simplifies 
it because we can assume that a Slot is only registered once if we talk to the 
same instance of a TaskExecutor. Further, we can omit to check the free slots 
because we previously removed the slot from the free slots.
      - `updateSlotStatus` only deals with new Task slots. All other updates 
are performed by the `ResourceManager`. In case the ResourceManager creashes, 
it will re-register all slot statuses.
    
    2. It fences `TaskExecutor` messages using an `InstanceID` which is 
required to make 1 work correctly. New messages have been introduced to achieve 
that.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mxm/flink flip-6

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2571.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2571
    
----
commit 1d297403e5e085d944d80b498c4a269a794f8e2f
Author: Maximilian Michels <[email protected]>
Date:   2016-09-29T13:08:32Z

    [FLINK-4347] change assumptions in SlotManager allocation
    
    - Makes use of `handleSlotRequestFailedAtTaskManager` and simplifies it
    because we can assume that a Slot is only registered once if we talk to
    the same instance of a TaskExecutor. Further, we can omit to check the
    free slots because we previously removed the slot from the free slots.
    
    - `updateSlotStatus` only deals with new Task slots. All other updates
    are performed by the `ResourceManager`. In case the ResourceManager
    creashes, it will re-register all slot statuses.

commit c6768524c869ecdb84f0a8ae48837afc329c9714
Author: Maximilian Michels <[email protected]>
Date:   2016-09-29T14:03:39Z

    [FLINK-4348] discard message from old TaskExecutorGateways
    
    This fences old message using the InstanceID of the TaskExecutor.

----


> Implement slot allocation protocol with TaskExecutor
> ----------------------------------------------------
>
>                 Key: FLINK-4348
>                 URL: https://issues.apache.org/jira/browse/FLINK-4348
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Cluster Management
>            Reporter: Kurt Young
>            Assignee: Maximilian Michels
>
> When slotManager finds a proper slot in the free pool for a slot request,  
> slotManager marks the slot as occupied, then tells the taskExecutor to give 
> the slot to the specified JobMaster. 
> when a slot request is sent to taskExecutor, it should contain following 
> parameters: AllocationID, JobID,  slotID, resourceManagerLeaderSessionID. 
> There exists 3 following possibilities of the response from taskExecutor, we 
> will discuss when each possibility happens and how to handle.
> 1. Ack request which means the taskExecutor gives the slot to the specified 
> jobMaster as expected.   
> 2. Decline request if the slot is already occupied by other AllocationID.  
> 3. Timeout which could caused by lost of request message or response message 
> or slow network transfer. 
> On the first occasion, ResourceManager need to do nothing. However, under the 
> second and third occasion, ResourceManager need to notify slotManager, 
> slotManager will verify and clear all the previous allocate information for 
> this slot request firstly, then try to find a proper slot for the slot 
> request again. This may cause some duplicate allocation, e.g. the slot 
> request to TaskManager is successful but the response is lost somehow, so we 
> may request a slot in another TaskManager, this causes two slots assigned to 
> one request, but it can be taken care of by rejecting registration at 
> JobMaster.
> There are still some question need to discuss in a step further.
> 1. Who send slotRequest to taskExecutor, SlotManager or ResourceManager? I 
> think it's better that SlotManager delegates the rpc call to ResourceManager 
> when SlotManager need to communicate with outside world.  ResourceManager 
> know which taskExecutor to send the request based on ResourceID. Besides this 
> RPC call which used to request slot to taskExecutor should not be a 
> RpcMethod,  because we hope only SlotManager has permission to call the 
> method, but the other component, for example JobMaster and TaskExecutor, 
> cannot call this method directly.
> 2. If JobMaster reject the slot offer from a TaskExecutor, the TaskExecutor 
> should notify the free slot to ResourceManager immediately, or wait for next 
> heartbeat sync. The advantage of first way is the resourceManager’s view 
> could be updated faster. The advantage of second way is save a RPC method in 
> ResourceManager.
> 3. There are two communication type. First, the slot request could be sent as 
> an ask operation where the response is returned as a future. Second, 
> resourceManager send the slot request in fire and forget way, the response 
> could be returned by an RPC call. I prefer the first one because it is more 
> simple and could save a RPC method in ResourceManager (for callback in the 
> second way).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4348) Implement slot allocation protocol with TaskExecutor

Reply via email to