[ 
https://issues.apache.org/jira/browse/STORM-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15466809#comment-15466809
 ] 

Howard Lee commented on STORM-2083:
-----------------------------------

We find that there is already a Blacklist in storm scheduling, which is used in 
Isolation scheduler. We decide to reuse this Blacklist. The only thing we will 
do is to add the unstable nodes to the blacklist, and leave the real scheduling 
to the underlying scheduler (Default Scheduler for now). 
Some configs:
1.      *blacklist.scheduler.tolerance.time.secs*: The number of seconds that 
the blacklist scheduler will concern of bad slots or supervisors. Default: 5 
min.
2.      *blacklist.scheduler.tolerance.count*: The number of hit count that 
will trigger blacklist in tolerance time. Default: 3.
3.      *blacklist.scheduler.resume.time.secs*: The number of seconds that the 
blacklisted slots or supervisor will be resumed. Default: 30 min.
4.      *blacklist.scheduler.reporter*: The class that the blacklist scheduler 
will report the blacklist. We do not want storm to add blacklist silently, the 
blacklist add action may be reported via email or so on. Default: 
org.apache.storm.scheduler.blacklist.reporters.LogReporter
5.      blacklist.scheduler.strategy: The class that specifies the eviction 
strategy to use in blacklist scheduler. Default: 
org.apache.storm.scheduler.blacklist.strategies.DefaultBlacklistStrategy.

The blacklist scheduler maintains a cached supervisors map, comparing all the 
incoming supervisors to the cache, add new to the cache and remove the ones 
which is never exist in tolerance time (We can assume that they have already 
been removed from cluster, if not, they will be added back to cache as soon as 
they appear again).
The blacklist scheduler also maintains a circular buffer with a fix length of 
_torerance.time / monitor.freq_ as a slide window. On every time of scheduling 
,the bad slots or supervisors will be added to the slide window. (We implement 
circular buffer ourselves instead of the disruptor RingBuffer inside storm, 
which I think is not used for slide window. This is to be discussed.)
The blacklist map in blacklist scheduler is map with a key of node info and 
value of _resume.time / monitor.freq_ while initializing which will be 
decreased by 1 on each schedule time and finally removed when it hits 0. The 
nodes that appear more than tolerance.count times in slide window will be add 
to the blacklist map discussed above.


> Blacklist Scheduler
> -------------------
>
>                 Key: STORM-2083
>                 URL: https://issues.apache.org/jira/browse/STORM-2083
>             Project: Apache Storm
>          Issue Type: New Feature
>          Components: storm-core
>            Reporter: Howard Lee
>              Labels: blacklist, scheduling
>             Fix For: 1.0.1, 1.0.2, 1.1.0, 1.0.3
>
>
> My company has gone through a fault in production, in which a critical switch 
> causes unstable network for a set of machines with package loss rate of 
> 30%-50%. In such fault, the supervisors and workers on the machines are not 
> definitely dead, which is easy to handle. Instead they are still alive but 
> very unstable. They lost heartbeat to the nimbus occasionally. The nimbus, in 
> such circumstance, will still assign jobs to these machines, but will soon 
> find them invalid again, result in a very slow convergence to stable status.
> To deal with such unstable cases, we intend to implement a blacklist 
> scheduler, which will add the unstable nodes (supervisors, slots) to the 
> blacklist temporarily, and resume them later. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to