[ https://issues.apache.org/jira/browse/STORM-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated STORM-2083: ---------------------------------- Labels: blacklist pull-request-available scheduling (was: blacklist scheduling) > Blacklist Scheduler > ------------------- > > Key: STORM-2083 > URL: https://issues.apache.org/jira/browse/STORM-2083 > Project: Apache Storm > Issue Type: New Feature > Components: storm-core > Reporter: Howard Lee > Labels: blacklist, pull-request-available, scheduling > Time Spent: 15h 10m > Remaining Estimate: 0h > > My company has gone through a fault in production, in which a critical switch > causes unstable network for a set of machines with package loss rate of > 30%-50%. In such fault, the supervisors and workers on the machines are not > definitely dead, which is easy to handle. Instead they are still alive but > very unstable. They lost heartbeat to the nimbus occasionally. The nimbus, in > such circumstance, will still assign jobs to these machines, but will soon > find them invalid again, result in a very slow convergence to stable status. > To deal with such unstable cases, we intend to implement a blacklist > scheduler, which will add the unstable nodes (supervisors, slots) to the > blacklist temporarily, and resume them later. -- This message was sent by Atlassian JIRA (v6.4.14#64029)