Ethanlm commented on a change in pull request #3222: STORM-3596 use send assignment status in blacklist scheduling URL: https://github.com/apache/storm/pull/3222#discussion_r405254340
########## File path: storm-server/src/main/java/org/apache/storm/scheduler/blacklist/BlacklistScheduler.java ########## @@ -129,7 +142,34 @@ public void schedule(Topologies topologies, Cluster cluster) { return underlyingScheduler.config(); } - private void badSupervisors(Map<String, SupervisorDetails> supervisors) { + private Map<String, Integer> getNodeAssignmentFailures() { Review comment: In the original backlist scheduler, when we decide to blacklist a node or not, we have `blacklist.scheduler.tolerance.time.secs` and `blacklist.scheduler.tolerance.count`. If the node has heartbeat failure more than the `count` in the `tolerance.time.secs` period, it will be blacklisted. But here with getNodeAssignmentFailures, since it's using a `RotatingMap` with 3 buckets and it rotates every 3600s (1hr), which means it keeps 2~3 hrs of data in the RotatingMap. So `getNodeAssignmentFailures` here means if sending nodeAssignment to a node failed more than `blacklist.scheduler.assignment.failure.tolerance.count` times, it will be added to the blacklist. This is not consistent with heartbeatFailures and can be confusing. It is some kind of failure tolerance but doesn't follow the same logic and doesn't not use the same `blacklist.scheduler.tolerance.time.secs`. I think we should follow the same logic to determine whether blacklisting a node or not. (The resume part is the same, which is good. It uses the same logic as the blacklist caused by heartbeat failure.) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services