Re: [VOTE] KIP-858: Handle JBOD broker disk failure in KRaft

Colin McCabe Mon, 04 Dec 2023 09:20:31 -0800

Yes, this should be CLUSTER_ACTION on CLUSTER, to be consistent with our other 
internal RPCs.


best,
Colin


On Mon, Dec 4, 2023, at 04:28, Viktor Somogyi-Vass wrote:
> Hi Igor,
>
> I'm just reading through your KIP and noticed that the new protocol you
> created doesn't say anything about ACLs of the new AssignReplicasToDirs
> API. Would it make sense to authorize these requests as other inter-broker
> protocol calls are usually authorized, that is ClusterAction on Cluster
> resource?
>
> Thanks,
> Viktor
>
> On Tue, Nov 28, 2023 at 4:18 PM Igor Soarez <i...@soarez.me> wrote:
>
>> Hi everyone,
>>
>> There have been a number of further changes.
>>
>> I have updated the KIP to reflect them, but for reference,
>> I'd also like to update this thread with a summary.
>>
>> 1. The reserved Uuids and their names for directories have changed.
>> The first 100 Uuids are reserved for future use.
>>
>> 2. During the ZooKeeper to KRaft migration, if a broker still
>> configured in ZK mode has any log directory offline, it will
>> shutdown and refuse to startup. The expectation is that this
>> escalation from a log directory's unavailability to the entire
>> broker's unavailability will be temporary, limited to the migration
>> period. And that the simplification will help develop, test and
>> support this feature.
>>
>> 3. The representation of replica directories in metadata records is
>> no longer tightly coupled with the respective broker IDs. Instead of
>> replacing the int[] replicas field in PartitionRecord and
>> PartitionChangeRecord, we are instead introducing a new field Uuid[]
>> named directories, that should be kept in the same size and order as
>> the existing replicas field. See
>> https://github.com/apache/kafka/pull/14290 for further details.
>>
>> 4. Assignments that are respective to failed log directories are no
>> longer prioritized. Previously the KIP proposed prioritizing
>> assignments that related to failed log directories, aiming to
>> synchronize the necessary replica to directory mapping on the
>> controller before handling the directory failure. Recently, we have
>> decided to remove any prioritization of these assignments, as
>> delaying the reporting of directory failures is considered
>> detrimental for any reason
>>
>> 5. Uuids for log directories that failed after startup are always
>> included in every broker heartbeat request. Previously the KIP
>> proposed sending Uuids for failed directories in the broker
>> heartbeat until a successful reply is received. However, due to the
>> overload mode handling of broker heartbeats, because broker
>> heartbeat requests may receive a successful response without being
>> fully processed, it is preferable to always send the cumulative list
>> of directory IDs that have failed since startup. In the future, this
>> list can be trimmed to remove directory IDs that are seen to be
>> removed from the broker registration, as the broker catches up with
>> metadata.
>>
>> 6. The proposal to shutdown the broker log.dir.failure.timeout.ms
>> after not being able to communicate that some log directory is
>> offline is now more of an open question. It's unclear if this will
>> actually be necessary.
>>
>> Please share if you have any thoughts.
>>
>> Best,
>>
>> --
>> Igor
>>
>>
>> On Tue, Oct 10, 2023, at 5:28 AM, Igor Soarez wrote:
>> > Hi Colin,
>> >
>> > Thanks for the renaming suggestions. UNASSIGNED is better then
>> > UNKNOWN, MIGRATING is also better than SELECTED and I don't
>> > expect it to be used outside of the migration phase.
>> > LOST can also work instead of OFFLINE, but I think there will
>> > be other uses for this value outside of the migration, like
>> > in the intra-broker replica movement edge cases described in the KIP.
>> > I've updated the KIP and also filed a tiny PR with your suggestion,
>> > except I'm keeping the description of LOST more broad than just
>> > scoped to the migration.
>> >
>> >   https://github.com/apache/kafka/pull/14517
>> >
>> >
>> > The KIP already proposes that the broker does not want to unfence
>> > until it has confirmed all the assignments are communicated
>> > with the controller. And you're right about the interaction
>> > with ReplicaManager, we definitely don't want RPCs coming
>> > out of there. My intention is to introduce a new manager, as you
>> > suggest, with its own event loop, that batches and prioritizes
>> > assignment and dir failure events, called DirectoryEventManager.
>> > There's already an open PR, perhaps you could have a look?
>> >
>> >   KAFKA-15357: Aggregate and propagate assignments and logdir failures
>> >   https://github.com/apache/kafka/pull/14369
>> >
>> >
>> > > With regard to the failure detection "gap" during hybrid mode: the
>> > > kraft controller sends a full LeaderAndIsrRequest to the brokers
>> > > that are in hybrid mode, right? And there is a per-partition
>> > > response as well. Right now, we don't pay attention to the error
>> > > codes sent back in the response. But we could. Any replica with an
>> > > error could be transitioned from MIGRATING -> LOST, right? That
>> > > would close the failure detection gap.
>> >
>> > Almost. The missing bit is that the controller would also need to
>> > watch the /log_dir_event_notification znode, and on any change
>> > read the broker ID in the value and send a full LeaderAndIsrRequest
>> > to the respective broker. Without this watch, it might be a very long
>> > time between a dir failing and the controller sending
>> > LeaderAndIsrRequests covering every partition hosted in the failed dir.
>> >
>> > Watching the znode is a read-only action, and it doesn't seem like
>> > a big thing to add this watch plus the error handling.
>> > Maybe extending the ZK Controller compatibility functionality in this
>> > way isn't such a bad option after all?
>> >
>> >
>> > Best,
>> >
>> > --
>> > Igor
>> >
>>

Re: [VOTE] KIP-858: Handle JBOD broker disk failure in KRaft

Reply via email to