Re: [VOTE] KIP-858: Handle JBOD broker disk failure in KRaft

Igor Soarez Mon, 04 Dec 2023 09:24:57 -0800

Hi Viktor,

Thanks for pointing this out.


I forgot to make this clear in the KIP. I'll update it.

ClusterAction on Cluster resource is exactly right,
see `ControllerApis.handleAssignReplicasToDirs`. [1]

--
Igor

[1]: 
https://github.com/apache/kafka/pull/14863/files#diff-91060c918c99d25342f625c146f14425716eda9d8fcfe1126b2c45feff388362R1073

On Mon, Dec 4, 2023, at 12:28 PM, Viktor Somogyi-Vass wrote:
> Hi Igor,
> 
> I'm just reading through your KIP and noticed that the new protocol you
> created doesn't say anything about ACLs of the new AssignReplicasToDirs
> API. Would it make sense to authorize these requests as other inter-broker
> protocol calls are usually authorized, that is ClusterAction on Cluster
> resource?
> 
> Thanks,
> Viktor
> 
> On Tue, Nov 28, 2023 at 4:18 PM Igor Soarez <[email protected]> wrote:
> 
> > Hi everyone,
> >
> > There have been a number of further changes.
> >
> > I have updated the KIP to reflect them, but for reference,
> > I'd also like to update this thread with a summary.
> >
> > 1. The reserved Uuids and their names for directories have changed.
> > The first 100 Uuids are reserved for future use.
> >
> > 2. During the ZooKeeper to KRaft migration, if a broker still
> > configured in ZK mode has any log directory offline, it will
> > shutdown and refuse to startup. The expectation is that this
> > escalation from a log directory's unavailability to the entire
> > broker's unavailability will be temporary, limited to the migration
> > period. And that the simplification will help develop, test and
> > support this feature.
> >
> > 3. The representation of replica directories in metadata records is
> > no longer tightly coupled with the respective broker IDs. Instead of
> > replacing the int[] replicas field in PartitionRecord and
> > PartitionChangeRecord, we are instead introducing a new field Uuid[]
> > named directories, that should be kept in the same size and order as
> > the existing replicas field. See
> > https://github.com/apache/kafka/pull/14290 for further details.
> >
> > 4. Assignments that are respective to failed log directories are no
> > longer prioritized. Previously the KIP proposed prioritizing
> > assignments that related to failed log directories, aiming to
> > synchronize the necessary replica to directory mapping on the
> > controller before handling the directory failure. Recently, we have
> > decided to remove any prioritization of these assignments, as
> > delaying the reporting of directory failures is considered
> > detrimental for any reason
> >
> > 5. Uuids for log directories that failed after startup are always
> > included in every broker heartbeat request. Previously the KIP
> > proposed sending Uuids for failed directories in the broker
> > heartbeat until a successful reply is received. However, due to the
> > overload mode handling of broker heartbeats, because broker
> > heartbeat requests may receive a successful response without being
> > fully processed, it is preferable to always send the cumulative list
> > of directory IDs that have failed since startup. In the future, this
> > list can be trimmed to remove directory IDs that are seen to be
> > removed from the broker registration, as the broker catches up with
> > metadata.
> >
> > 6. The proposal to shutdown the broker log.dir.failure.timeout.ms
> > after not being able to communicate that some log directory is
> > offline is now more of an open question. It's unclear if this will
> > actually be necessary.
> >
> > Please share if you have any thoughts.
> >
> > Best,
> >
> > --
> > Igor
> >
> >
> > On Tue, Oct 10, 2023, at 5:28 AM, Igor Soarez wrote:
> > > Hi Colin,
> > >
> > > Thanks for the renaming suggestions. UNASSIGNED is better then
> > > UNKNOWN, MIGRATING is also better than SELECTED and I don't
> > > expect it to be used outside of the migration phase.
> > > LOST can also work instead of OFFLINE, but I think there will
> > > be other uses for this value outside of the migration, like
> > > in the intra-broker replica movement edge cases described in the KIP.
> > > I've updated the KIP and also filed a tiny PR with your suggestion,
> > > except I'm keeping the description of LOST more broad than just
> > > scoped to the migration.
> > >
> > >   https://github.com/apache/kafka/pull/14517
> > >
> > >
> > > The KIP already proposes that the broker does not want to unfence
> > > until it has confirmed all the assignments are communicated
> > > with the controller. And you're right about the interaction
> > > with ReplicaManager, we definitely don't want RPCs coming
> > > out of there. My intention is to introduce a new manager, as you
> > > suggest, with its own event loop, that batches and prioritizes
> > > assignment and dir failure events, called DirectoryEventManager.
> > > There's already an open PR, perhaps you could have a look?
> > >
> > >   KAFKA-15357: Aggregate and propagate assignments and logdir failures
> > >   https://github.com/apache/kafka/pull/14369
> > >
> > >
> > > > With regard to the failure detection "gap" during hybrid mode: the
> > > > kraft controller sends a full LeaderAndIsrRequest to the brokers
> > > > that are in hybrid mode, right? And there is a per-partition
> > > > response as well. Right now, we don't pay attention to the error
> > > > codes sent back in the response. But we could. Any replica with an
> > > > error could be transitioned from MIGRATING -> LOST, right? That
> > > > would close the failure detection gap.
> > >
> > > Almost. The missing bit is that the controller would also need to
> > > watch the /log_dir_event_notification znode, and on any change
> > > read the broker ID in the value and send a full LeaderAndIsrRequest
> > > to the respective broker. Without this watch, it might be a very long
> > > time between a dir failing and the controller sending
> > > LeaderAndIsrRequests covering every partition hosted in the failed dir.
> > >
> > > Watching the znode is a read-only action, and it doesn't seem like
> > > a big thing to add this watch plus the error handling.
> > > Maybe extending the ZK Controller compatibility functionality in this
> > > way isn't such a bad option after all?
> > >
> > >
> > > Best,
> > >
> > > --
> > > Igor
> > >
> >
>

Re: [VOTE] KIP-858: Handle JBOD broker disk failure in KRaft

Reply via email to