XComp commented on code in PR #23977:
URL: https://github.com/apache/flink/pull/23977#discussion_r1441442856


##########
flink-runtime/src/test/java/org/apache/flink/runtime/minicluster/MiniClusterITCase.java:
##########
@@ -127,6 +129,11 @@ private void runHandleJobsWhenNotEnoughSlots(final 
JobGraph jobGraph) throws Exc
         configuration.setLong(JobManagerOptions.SLOT_REQUEST_TIMEOUT, 100L);
         // this triggers the failure for the adaptive scheduler
         configuration.set(JobManagerOptions.RESOURCE_WAIT_TIMEOUT, 
Duration.ofMillis(100));
+        // disable the resource requirements check from slot manager the make 
sure the scheduler
+        // cannot acquire slots until timeout.
+        configuration.set(

Review Comment:
   Thanks for you investigation of the test instability. I verified what you 
mentioned in [your FLINK-33414 
comment](https://issues.apache.org/jira/browse/FLINK-33414?focusedCommentId=17799308&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17799308).
 We, indeed, produce different stacktraces for the two cases that could cause 
the failure that is tested here:
   * 
[PhysicalSlotRequestBulkCheckerImpl#schedulePendingRequestBuilkWithTimestampCheck](https://github.com/apache/flink/blob/72bff2a2d0072602e4e625476bf5480dc50dc76c/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/PhysicalSlotRequestBulkCheckerImpl.java#L85)
 handles timeouts by cancelling the bulk with a `NoResourceAvailableException` 
which has a `TimeoutException` as a cause. 
   * In contrast, the 
[FineGrainedSlotManager#checkResourceRequirements:676](https://github.com/apache/flink/blob/7299da4cf688a2d87fd918b6327a0573bc88cbd8/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/FineGrainedSlotManager.java#L676)
 triggers the code path which leads to a `NoResourceAvailableException` being 
thrown without any cause in 
[DeclarativeSlotPoolBridge#failPendingRequests](https://github.com/apache/flink/blob/72bff2a2d0072602e4e625476bf5480dc50dc76c/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/DeclarativeSlotPoolBridge.java#L413)
   
   FLINK-32846 revealed this test scenario ambiguity. I played around with the 
configuration of this test a bit. I would assume that we could consistently 
reproduce the latter scenario by increasing the timeout from 100ms to something 
very large and keeping the requirements check delay low:
   ```java
   // 10s timeout for the sake of testing this change (Long.MAX_VALUE) would be 
a better value)
   configuration.setLong(JobManagerOptions.SLOT_REQUEST_TIMEOUT, 10000);
   configuration.set(JobManagerOptions.RESOURCE_WAIT_TIMEOUT, 
Duration.ofMillis(10000));
   configuration.set(
                   ResourceManagerOptions.REQUIREMENTS_CHECK_DELAY, 
Duration.ofMillis(50));
   ```
   This seems to work (the root cause is always the 
`NoResourceAvailableException`). But I'm puzzled by the runtime of the test 
which seems to correlated with the slot request timeout, still. Does this 
indicate that the early requirements check isn't working properly? I would have 
expected the test to finalize earlier (due to the requirements check happening 
after 50ms). :thinking: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to