[
https://issues.apache.org/jira/browse/BOOKKEEPER-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129907#comment-15129907
]
Sijie Guo commented on BOOKKEEPER-889:
--------------------------------------
[~sboobna] that sounds a pretty good solution. +1 on that. are you working on a
solution for it?
> BookKeeper client should try not to use bookies with errors/timeouts when
> forming a new ensemble
> ------------------------------------------------------------------------------------------------
>
> Key: BOOKKEEPER-889
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-889
> Project: Bookkeeper
> Issue Type: Improvement
> Components: bookkeeper-client
> Affects Versions: 4.3.2
> Reporter: Siddharth Sunil Boobna
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> Due to various issues (slow disks, network issues, bugs, etc), the bookkeeper
> can be slow or unresponsive for extended period of times. During this time,
> r/w operations will fail/timeout and ledgers will create a new segment and
> form a new ensemble replacing this bookie. For new ledgers, it might still
> pick up this bookie or we can replace this bookie with another faulty bookie
> if we have multiple faulty bookies.
> The BK client should keep stats about these failure rates for all the bookies
> and it should "quarantine" failing bookies for a certain amount of time. Once
> a bookie is quarantined, it will not be picked up in forming a new ensemble,
> unless no other "healthy" bookies are available.
> Solution:
> Keep a counter of errors in the bookie client pool and periodically check for
> number of errors in a given time span and mark these bookies as "quarantined"
> in the BookieWatcher.
> In the BookieWatcher, try to create an ensemble list excluding the
> quarantined bookies and if that fails, fall back to an empty exclusion list.
> We will also remove the bookies from the quarantined list after a
> configurable period of time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)