[
https://issues.apache.org/jira/browse/BOOKKEEPER-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130687#comment-15130687
]
Sijie Guo commented on BOOKKEEPER-889:
--------------------------------------
That's cool. [~sboobna] I assigned this to you.
> BookKeeper client should try not to use bookies with errors/timeouts when
> forming a new ensemble
> ------------------------------------------------------------------------------------------------
>
> Key: BOOKKEEPER-889
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-889
> Project: Bookkeeper
> Issue Type: Improvement
> Components: bookkeeper-client
> Affects Versions: 4.3.2
> Reporter: Siddharth Sunil Boobna
> Assignee: Siddharth Sunil Boobna
> Fix For: 4.4.0
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> Due to various issues (slow disks, network issues, bugs, etc), the bookkeeper
> can be slow or unresponsive for extended period of times. During this time,
> r/w operations will fail/timeout and ledgers will create a new segment and
> form a new ensemble replacing this bookie. For new ledgers, it might still
> pick up this bookie or we can replace this bookie with another faulty bookie
> if we have multiple faulty bookies.
> The BK client should keep stats about these failure rates for all the bookies
> and it should "quarantine" failing bookies for a certain amount of time. Once
> a bookie is quarantined, it will not be picked up in forming a new ensemble,
> unless no other "healthy" bookies are available.
> Solution:
> Keep a counter of errors in the bookie client pool and periodically check for
> number of errors in a given time span and mark these bookies as "quarantined"
> in the BookieWatcher.
> In the BookieWatcher, try to create an ensemble list excluding the
> quarantined bookies and if that fails, fall back to an empty exclusion list.
> We will also remove the bookies from the quarantined list after a
> configurable period of time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)