Andrey Zagrebin created FLINK-19832:
---------------------------------------

             Summary: Improve handling of immediately failed physical slot in 
SlotSharingExecutionSlotAllocator
                 Key: FLINK-19832
                 URL: https://issues.apache.org/jira/browse/FLINK-19832
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
    Affects Versions: 1.12.0
            Reporter: Andrey Zagrebin
            Assignee: Andrey Zagrebin


Improve handling of immediately failed physical slot in 
SlotSharingExecutionSlotAllocator

If a physical slot future the immediately fails for a new SharedSlot in 
SlotSharingExecutionSlotAllocator#getOrAllocateSharedSlot but we continue to 
add logical slots to this SharedSlot, eventually, the logical slot also fails 
and gets removed from {{the SharedSlot}} which gets released (state RELEASED). 
The subsequent logical slot addings in the loop of 
{{allocateLogicalSlotsFromSharedSlots}} will fail the scheduling
with the ALLOCATED state check because it will be RELEASED.

The subsequent bulk timeout check will also not find the SharedSlot and fail 
with NPE.

Hence, such SharedSlot with the immediately failed physical slot future should 
not be kept in the SlotSharingExecutionSlotAllocator and the logical slot 
requests depending on it can be immediately returned failed. The bulk timeout 
check does not need to be started because if some physical (and its logical) 
slot requests failed then the whole bulk will be canceled by scheduler.

If the last assumption is not true for the future scheduling, this bulk failure 
might need additional explicit pending requests cancelation. We expect to 
refactor it for the declarative scheduling anyways.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to