Re: [VOTE] KIP-858: Handle JBOD broker disk failure in KRaft

Viktor Somogyi-Vass Mon, 04 Dec 2023 04:28:26 -0800

Hi Igor,

I'm just reading through your KIP and noticed that the new protocol you
created doesn't say anything about ACLs of the new AssignReplicasToDirs
API. Would it make sense to authorize these requests as other inter-broker
protocol calls are usually authorized, that is ClusterAction on Cluster
resource?


Thanks,
Viktor

On Tue, Nov 28, 2023 at 4:18 PM Igor Soarez <i...@soarez.me> wrote:

> Hi everyone,
>
> There have been a number of further changes.
>
> I have updated the KIP to reflect them, but for reference,
> I'd also like to update this thread with a summary.
>
> 1. The reserved Uuids and their names for directories have changed.
> The first 100 Uuids are reserved for future use.
>
> 2. During the ZooKeeper to KRaft migration, if a broker still
> configured in ZK mode has any log directory offline, it will
> shutdown and refuse to startup. The expectation is that this
> escalation from a log directory's unavailability to the entire
> broker's unavailability will be temporary, limited to the migration
> period. And that the simplification will help develop, test and
> support this feature.
>
> 3. The representation of replica directories in metadata records is
> no longer tightly coupled with the respective broker IDs. Instead of
> replacing the int[] replicas field in PartitionRecord and
> PartitionChangeRecord, we are instead introducing a new field Uuid[]
> named directories, that should be kept in the same size and order as
> the existing replicas field. See
> https://github.com/apache/kafka/pull/14290 for further details.
>
> 4. Assignments that are respective to failed log directories are no
> longer prioritized. Previously the KIP proposed prioritizing
> assignments that related to failed log directories, aiming to
> synchronize the necessary replica to directory mapping on the
> controller before handling the directory failure. Recently, we have
> decided to remove any prioritization of these assignments, as
> delaying the reporting of directory failures is considered
> detrimental for any reason
>
> 5. Uuids for log directories that failed after startup are always
> included in every broker heartbeat request. Previously the KIP
> proposed sending Uuids for failed directories in the broker
> heartbeat until a successful reply is received. However, due to the
> overload mode handling of broker heartbeats, because broker
> heartbeat requests may receive a successful response without being
> fully processed, it is preferable to always send the cumulative list
> of directory IDs that have failed since startup. In the future, this
> list can be trimmed to remove directory IDs that are seen to be
> removed from the broker registration, as the broker catches up with
> metadata.
>
> 6. The proposal to shutdown the broker log.dir.failure.timeout.ms
> after not being able to communicate that some log directory is
> offline is now more of an open question. It's unclear if this will
> actually be necessary.
>
> Please share if you have any thoughts.
>
> Best,
>
> --
> Igor
>
>
> On Tue, Oct 10, 2023, at 5:28 AM, Igor Soarez wrote:
> > Hi Colin,
> >
> > Thanks for the renaming suggestions. UNASSIGNED is better then
> > UNKNOWN, MIGRATING is also better than SELECTED and I don't
> > expect it to be used outside of the migration phase.
> > LOST can also work instead of OFFLINE, but I think there will
> > be other uses for this value outside of the migration, like
> > in the intra-broker replica movement edge cases described in the KIP.
> > I've updated the KIP and also filed a tiny PR with your suggestion,
> > except I'm keeping the description of LOST more broad than just
> > scoped to the migration.
> >
> >   https://github.com/apache/kafka/pull/14517
> >
> >
> > The KIP already proposes that the broker does not want to unfence
> > until it has confirmed all the assignments are communicated
> > with the controller. And you're right about the interaction
> > with ReplicaManager, we definitely don't want RPCs coming
> > out of there. My intention is to introduce a new manager, as you
> > suggest, with its own event loop, that batches and prioritizes
> > assignment and dir failure events, called DirectoryEventManager.
> > There's already an open PR, perhaps you could have a look?
> >
> >   KAFKA-15357: Aggregate and propagate assignments and logdir failures
> >   https://github.com/apache/kafka/pull/14369
> >
> >
> > > With regard to the failure detection "gap" during hybrid mode: the
> > > kraft controller sends a full LeaderAndIsrRequest to the brokers
> > > that are in hybrid mode, right? And there is a per-partition
> > > response as well. Right now, we don't pay attention to the error
> > > codes sent back in the response. But we could. Any replica with an
> > > error could be transitioned from MIGRATING -> LOST, right? That
> > > would close the failure detection gap.
> >
> > Almost. The missing bit is that the controller would also need to
> > watch the /log_dir_event_notification znode, and on any change
> > read the broker ID in the value and send a full LeaderAndIsrRequest
> > to the respective broker. Without this watch, it might be a very long
> > time between a dir failing and the controller sending
> > LeaderAndIsrRequests covering every partition hosted in the failed dir.
> >
> > Watching the znode is a read-only action, and it doesn't seem like
> > a big thing to add this watch plus the error handling.
> > Maybe extending the ZK Controller compatibility functionality in this
> > way isn't such a bad option after all?
> >
> >
> > Best,
> >
> > --
> > Igor
> >
>

Re: [VOTE] KIP-858: Handle JBOD broker disk failure in KRaft

Reply via email to