github-actions[bot] commented on code in PR #65049:
URL: https://github.com/apache/doris/pull/65049#discussion_r3503208692


##########
fe/fe-core/src/main/java/org/apache/doris/datasource/kafka/KafkaUtil.java:
##########
@@ -240,6 +240,7 @@ private static InternalService.PProxyResult 
getInfoRequest(InternalService.PProx
                     Backend backend = 
Env.getCurrentSystemInfo().getBackend(beId);
                     if (backend != null && backend.isLoadAvailable()
                             && !backend.isDecommissioned()
+                            && !backend.isDecommissioning()
                             && !failedBeIds.contains(beId)

Review Comment:
   The new primary predicate skips decommissioning BEs, but the fallback below 
still bypasses it when `backendIds` is empty. Lines 255-260 repopulate 
candidates from `routineLoadManager.getBlacklist()` with only `backend != 
null`, so a BE that was blacklisted by an earlier metadata retry and later 
enters decommissioning can still be selected here and receive 
`BackendServiceProxy.getInfo` once all primary candidates are filtered out. 
`KinesisUtil` has the same fallback at lines 112-115. Please apply the same 
load-available/non-decommissioned predicate to the blacklist fallback, or share 
candidate construction, before sending metadata RPCs.



##########
fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java:
##########
@@ -563,7 +565,7 @@ public long getAvailableBeForTask(long jobId, long 
previousBeId) throws UserExce
             }
             // 4. on the basis of selecting the maximum idle slot be,
             //    try to reuse the object cache as much as possible
-            if (previousBeIdleTaskNum == maxIdleSlotNum) {
+            if (previousBeAvailable && previousBeIdleTaskNum == 
maxIdleSlotNum) {
                 return previousBeId;

Review Comment:
   This still allows a saturated previous BE to be reused when the only idle 
capacity comes from a BE that the new availability filter excludes. The 
scheduler gate uses `getClusterIdleSlotNum()`, whose slot map is still built 
from `getAllBackendIds(true)` and therefore counts alive decommissioning BEs. 
After this PR, `getAvailableBackendIds()` excludes that draining BE, so with 
one saturated eligible previous BE and one idle decommissioning BE, the loop 
below leaves `maxIdleSlotNum == 0` and `resultBeId == -1`, but this tie check 
returns the previous BE anyway and `allocateTaskToBe()` submits another task to 
that saturated backend. Please keep routine-load slot accounting aligned with 
the new eligible-backend predicate, and only reuse the previous BE when it has 
a positive idle slot.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to