Github user mxm commented on the issue:

    https://github.com/apache/flink/pull/2571
  
    Thanks for the feedback, @beyond1920 and @KurtYoung.
    
    You're right, the changes don't allow slots to be released by a 
TaskExecutor. We can change that by explicitly reporting to the RM if a slot 
becomes free. This may also decrease latency in case a tasks finishes and new 
ones are waiting to be deployed.
    
    >b. When we handleSlotRequestFailedAtTaskManager, we will make this slot 
free again. If the slot is occupied by some other task now, we will 
continuously failed for all allocation on this slot. ( this can be fixed by 3)
    
    How can that happen? The slot will not appear free while it is allocated at 
the TaskExecutor. When allocation fails, it is marked as free and then the 
request is retried immediately. It must succeed eventually if the initial 
decision to allocate the slot was correct. However, we need to explicitly check 
if a TaskExecutor has deregistered, to make sure old TaskExecutors don't send 
failures which triggers slot allocation of already removed slots (due to 
TaskExecutor deregistration). That should be fix with this PR.
    
    > 1. We can remove the update status part entirely, since it can only do 
new slot registration now, we can just move it to the task executor first 
registration.
    
    Very good suggestion. Let's move the initial registration and 
reconciliation of slots to the registration message.
    
    To wrap up, let's change the following:
    
    1. Move the slot registration and allocation report to the registration of 
the TaskExecutor
    2. Let the TaskExecutor immediately notify the ResourceManager once a slot 
becomes free
    3. Change the fencing in handleSlotRequestFailedAtTaskManager to protect 
against TaskExecutors which are not registered anymore.
    
    Let me know if that would work for you.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to