[ 
https://issues.apache.org/jira/browse/KAFKA-8526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-8526.
------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.4.0

> Broker may select a failed dir for new replica even in the presence of other 
> live dirs
> --------------------------------------------------------------------------------------
>
>                 Key: KAFKA-8526
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8526
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 1.1.1, 2.0.1, 2.1.1, 2.3.0, 2.2.1
>            Reporter: Anna Povzner
>            Assignee: Igor Soarez
>            Priority: Major
>             Fix For: 2.4.0
>
>
> Suppose a broker is configured with multiple log dirs. One of the log dirs 
> fails, but there is no load on that dir, so the broker does not know about 
> the failure yet, _i.e._, the failed dir is still in LogManager#_liveLogDirs. 
> Suppose a new topic gets created, and the controller chooses the broker with 
> failed log dir to host one of the replicas. The broker gets LeaderAndIsr 
> request with isNew flag set. LogManager#getOrCreateLog() selects a log dir 
> for the new replica from _liveLogDirs, then one two things can happen:
> 1) getAbsolutePath can fail, in which case getOrCreateLog will throw an 
> IOException
> 2) Creating directory for new the replica log may fail (_e.g._, if directory 
> becomes read-only, so getAbsolutePath worked). 
> In both cases, the selected dir will be marked offline (which is correct). 
> However, LeaderAndIsr will return an error and replica will be marked 
> offline, even though the broker may have other live dirs. 
> *Proposed solution*: Broker should retry selecting a dir for the new replica, 
> if initially selected dir threw an IOException when trying to create a 
> directory for the new replica. We should be able to do that in 
> LogManager#getOrCreateLog() method, but keep in mind that 
> logDirFailureChannel.maybeAddOfflineLogDir does not synchronously removes the 
> dir from _liveLogDirs. So, it makes sense to select initial dir by calling 
> LogManager#nextLogDir (current implementation), but if we fail to create log 
> on that dir, one approach is to select next dir from _liveLogDirs in 
> round-robin fashion (until we get to initial log dir – the case where all 
> dirs failed).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to