Zhu Zhu created FLINK-24005:
-------------------------------

             Summary: Resource requirements declaration may be incorrect if 
JobMaster disconnects with a TaskManager with available slots in the SlotPool
                 Key: FLINK-24005
                 URL: https://issues.apache.org/jira/browse/FLINK-24005
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.13.2, 1.12.5, 1.14.0
            Reporter: Zhu Zhu


When a TaskManager disconnects with JobMaster, it will trigger the 
`DeclarativeSlotPoolService#decreaseResourceRequirementsBy()` for all the slots 
that are registered to the JobMaster from the TaskManager. If the slots are 
still available, i.e. not assigned to any task,  the 
`decreaseResourceRequirementsBy` may lead to incorrect resource requirements 
declaration.

For example, there is one job with 3 source tasks only. It requires 3 slots and 
declares for 3 slots. Initially all the tasks are running. Suddenly one task 
failed and waits for some delay before restarting. The previous slot is 
returned to the SlotPool. Now the job requires 2 slots and declares for 2 
slots. At this moment, the TaskManager of that returned slot get lost. After 
the triggered `decreaseResourceRequirementsBy`, the job only declares for 1 
slot. Finally, when the failed task starts to re-schedule, the job will declare 
for 2 slots while it actually needs 3 slots.

The attached log of a real job and logs of the added test in 
https://github.com/zhuzhurk/flink/commit/59ca0ac5fa9c77b97c6e8a43dcc53ca8a0ad6c37
 can demonstrate this case.
Note that the real job is configured with a large 
"restart-strategy.fixed-delay.delay" and and large "slot.idle.timeout". So 
possibly in production it is a rare case.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to