Anna Povzner created KAFKA-8526:
-----------------------------------
Summary: Broker may select a failed dir for new replica even in
the presence of other live dirs
Key: KAFKA-8526
URL: https://issues.apache.org/jira/browse/KAFKA-8526
Project: Kafka
Issue Type: Bug
Affects Versions: 2.2.1, 2.1.1, 2.0.1, 1.1.1, 2.3.0
Reporter: Anna Povzner
Suppose a broker is configured with multiple log dirs. One of the log dirs
fails, but there is no load on that dir, so the broker does not know about the
failure yet, _i.e._, the failed dir is still in LogManager#_liveLogDirs.
Suppose a new topic gets created, and the controller chooses the broker with
failed log dir to host one of the replicas. The broker gets LeaderAndIsr
request with isNew flag set. LogManager#getOrCreateLog() selects a log dir for
the new replica from _liveLogDirs, then one two things can happen:
1) getAbsolutePath can fail, in which case getOrCreateLog will throw an
IOException
2) Creating directory for new the replica log may fail (_e.g._, if directory
becomes read-only, so getAbsolutePath worked).
In both cases, the selected dir will be marked offline (which is correct).
However, LeaderAndIsr will return an error and replica will be marked offline,
even though the broker may have other live dirs.
*Proposed solution*: Broker should retry selecting a dir for the new replica,
if initially selected dir threw an IOException when trying to create a
directory for the new replica. We should be able to do that in
LogManager#getOrCreateLog() method, but keep in mind that
logDirFailureChannel.maybeAddOfflineLogDir does not synchronously removes the
dir from _liveLogDirs. So, it makes sense to select initial dir by calling
LogManager#nextLogDir (current implementation), but if we fail to create log on
that dir, one approach is to select next dir from _liveLogDirs in round-robin
fashion (until we get to initial log dir – the case where all dirs failed).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)