Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Jun Rao Wed, 16 Nov 2022 17:18:23 -0800

Hi, Igor,

Thanks for the KIP. A few comments below.


10. In the example section, there were two /mnt/d1/meta.properties. One of
them should be d2.

11. "When replicas are moved between directories, using the existing
AlterReplicaLogDirs RPC, when the future replica has caught up with the
current replica, the broker will send the AssignReplicasToDirectories RPC
to the controller changing the association to the future log directory.
When the cluster metadata update is seen by the broker, the broker replaces
the current replica with the future replica."
Currently, when the broker switches the future replica to the current
replica, it briefly blocks new appends to the replica. Blocking while
waiting for AssignReplicasToDirectories request to complete on the
controller can cause the blocking to be longer. An alternative design is to
have the broker forward the AlterReplicaLogDirs request to the controller,
which then writes new partitionChanageRecord with the new log dir. On
receiving this record, the broker initiates the moving of the replicas to
the new dir and completes the switch from future to current replica as it
does now.

12. For BrokerRegistrationRequest and BrokerHeartbeatRequest, I guess we
don't need to include the UUID for metadata.log.dir?

13. When enabling JBOD for an existing KRaft cluster, adding/removing a log
dir to an existing JBOD KRaft cluster, or upgrading a JBOD ZK cluster to
KRaft, could we generate the log dir UUID in the meta.properties file
automatically instead of requiring running the storage format tool manually?

14. "If there are any missing log directories this means those have been
removed from the broker’s configuration, so the controller will reassign
all replicas currently assigned to the missing log directories to the
remaining online log directories in the same broker." Should the controller
just set the log directory to Uuid.ZERO and let the broker assign it?

15. Broker registration: I guess the controller needs to handle offline
dirs now?

16. Upgrade: Since we are changing the metadata records, we need to make
sure all brokers/controllers are upgraded before writing the new records.
This can be achieved by introducing a new version of the metadata.version
feature flag.

Thanks,

Jun

On Wed, Nov 2, 2022 at 10:32 AM Igor Soarez <soa...@apple.com.invalid>
wrote:

> Hi Tom,
>
> Thank you for having a look.
>
> 0. Thanks for pointing this out. I think it's worth having a new
> sub-command
> in `kafka-storage.sh` — `update-directories` — more suitable for situations
> where `metadata.properties` already exists. I've updated the section with
> further detail.
>
> When upgrading from a non-JBOD KRaft cluster all the configured log
> directories will already be "formatted". We could require that
> `update-directories` is run as part of the migration, but I think it seems
> easier to have the broker simply update the two properties `directory.id`
> and `directory.ids` in `metadata.properties` as necessary on startup.
> I don't have a particularly strong opinion either way, it just seems
> to create less friction for operators during the upgrade at very little
> cost.
> This is described under "Public Interfaces" > `meta.properties`.
> What do you think?
>
> 1. That makes sense. I've tried to make the example clearer by adding
> further detail as you suggest. Please let me know if you see further room
> for
> improvement.
>
> 2. Indeed. I've updated the section to reflect this.
>
> 3. The motivation was to avoid prolonging the time window between a
> partition being assigned to a broker in the cluster metadata and the broker
> actually being able to take leadership for it. But revisiting this question
> now, it feels it's a better choice to keep things simple and leave
> this possible optimisation for future work.
>
> I've updated the KIP to reflect that the controller does not assign
> partitions to log directories, instead using Uuid.ZERO and letting the
> broker communicate back the chosen log directory via the
> ASSIGN_REPLICAS_TO_DIRECTORIES RPC.
>
> 4. This perhaps another unnecessary optimisation. The intent was to
> allow use-cases that don't use JBOD to bypass this initialisation cost.
> We can keep it simple and remove this unnecessary optimisation.
> I've updated the KIP accordingly.
>
> 5. This is a good idea. I've added a new "Future work" section with
> the pieces we know are missing.
>
> Regarding using the BrokerHeartbeat RPC: I think if we avoid sending the
> full list of log directories in every heartbeat request as you describe
> makes this option more attractive. I've gotten rid of the
> LOG_DIRECTORIES_OFFLINE RPC and added details as to how the BrokerHeartbeat
> RPC request will include a new `LogDirsOfflined` field.
>
> I still don't think it makes sense to add a `LogDirsOnlined` field,
> as it is not possible for a log directory to come back online without
> a broker restart — which would result in a new `BrokerRegistration`
> request.
>
> Thanks,
>
> --
> Igor
>
>
> > On 1 Nov 2022, at 17:59, Tom Bentley <tbent...@redhat.com> wrote:
> >
> > Hi Igor,
> >
> > Thanks for the KIP, I've finally managed to take an initial look.
> >
> > 0. You mention the command line tools (which one?) in the public
> interfaces
> > section, but don't spell out how it changes -- what options are added.
> > Reading the proposed changes it suggests that there are no changes to the
> > supported options and that it is done automatically during initial
> > formatting based on the broker config. But what about the case where
> we're
> > upgrading an existing non-JBOD KRaft cluster where the meta.properties
> > already exists? Do we just run `./bin/kafka-storage.sh format -c
> > /tmp/server.properties` again? How would an operator remove an existing
> log
> > dir?
> >
> > 1. In the example for the storage format command I think it's worth
> > explaining it in a little more detail. i.e. that the `directory.ids` has
> > three items: two for the configured log.dirs and one for the configured
> > metadata.log.dir.
> >
> > 2. In 'Broker lifecycle management' you presumably want to check that the
> > directory ids for each log dir are actually unique.
> >
> > 3. I don't understand the motivation for having the controller decide the
> > log dir for new replicas. I think there are two cases we want to support:
> > a) Where the user specifies the log dir (likely based on some external
> > information). This is out of scope for this KIP.
> > b) If the user didn't specify, isn't the broker in a better position to
> > decide (for example, based on free storage), and the communicate back to
> > the controller the log dir that was chosen using the
> > ASSIGN_REPLICAS_TO_DIRECTORIES RPC?
> >
> > 4. Broker registration. I don't understand the intent behind the
> > optimization for the single log dir case (last bullet). "Brokers whose
> > registration indicate that multiple log directories are configured remain
> > FENCED until all log directory assignments for that broker are learnt by
> > the active controller and persisted into metadata." is something you've
> > committed to anyway.
> >
> > 5. AFAICS there's no way for the user to determine via the Kafka protocol
> > which directory id corresponds to which log dir path. I.e. you've not
> > changed any of the Admin APIs. Perhaps adding a Future Work section to
> > spell out the pieces we know are missing would be a good idea?
> >
> > I would second Jason's idea for piggybacking on- and off-line state
> changes
> > on the BrokerHeartbeat RPC. I suspect the complexity of making this
> > incrementally isn't so great, given that both broker and controller need
> to
> > keep track of the on- and off-line directories anyway. i.e. We could add
> > LogDirsOfflined and LogDirsOnlined fields to both request and response
> and
> > have the broker keep including a log dir in requests until acknowledged
> in
> > the response, but otherwise they'd be empty.
> >
> > Thanks again,
> >
> > Tom
> >
> > On Tue, 25 Oct 2022 at 19:59, Igor Soarez <soa...@apple.com.invalid>
> wrote:
> >
> >> Hello,
> >>
> >> There’s now a proposal to address ZK to KRaft migration — KIP-866 — but
> >> currently it doesn't address JBOD so I've decided to update this
> proposal
> >> to address that migration scenario.
> >>
> >> So given that:
> >>
> >> - When migrating from a ZK cluster running JBOD to KRaft, brokers
> >> registering in KRaft mode will need to be able to register all
> configured
> >> log directories.
> >> - As part of the migration, the mapping of partition to log directory
> will
> >> have to be learnt by the active controller and persisted into the
> cluster
> >> metadata.
> >> - It isn’t safe to allow for leadership from replicas without this
> >> mapping, as if the hosting log directory fails there will be no failover
> >> mechanism.
> >>
> >> I have updated the proposal to reflect that:
> >>
> >> - Multiple log directories may be indicated in the first broker
> >> registration referencing log directory UUIDs. We no longer require a
> single
> >> log directory to start with.
> >> - The controller must never assign leadership to a replica in a broker
> >> registered with multiple log directories, unless the partition to log
> >> directory assignment is already in the cluster metadata.
> >> - The broker should not be unfenced until all of its partition to log
> >> directory mapping is persisted into cluster metadata
> >>
> >> I've also added details as to how the ZK to KRaft migration can work in
> a
> >> cluster that is already operating with JBOD.
> >>
> >> Please have a look and share your thoughts.
> >>
> >> --
> >> Igor
> >>
> >>
> >>
>
>

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Reply via email to