Hello Kafka Community,

I’d like to consult the community on best practices for handling and
preventing what’s sometimes called a "half-dead" Kafka broker scenario in a
self-hosted OSS Kafka environment.

Specifically, I’m referring to situations where a broker appears healthy
from a cluster perspective (i.e., still part of the ISR) but is no longer
able to properly serve traffic, causing disruption to producers or
consumers.

I understand that some managed services like AWS MSK implement additional
mechanisms (e.g., their "healing" state) to detect and handle such brokers,
but I’d like to know how self-hosted OSS Kafka operators typically manage
this risk.

Some key questions:

   -

   Are there recommended monitoring patterns to detect a "half-dead" broker
   more proactively?
   -

   Are there any community-recommended configurations, scripts, or tools to
   automatically remove or restart such brokers?
   -

   Any lessons learned or operational best practices from other self-hosted
   users?

I would greatly appreciate any guidance ,Thank you in advance!

Best regards,

Saliha

Reply via email to