Hi all, Just following up on this thread.
I would appreciate any insights or guidance the community can share on this topic whenever you have a chance. Thank you very much for your time and support. Best regards, Saliha On Mon, 23 Jun 2025 at 12:19 PM, Mohamed Saliha A < a.mohamedsalih...@gmail.com> wrote: > Hello Kafka Community, > > I’d like to consult the community on best practices for handling and > preventing what’s sometimes called a "half-dead" Kafka broker scenario in a > self-hosted OSS Kafka environment. > > Specifically, I’m referring to situations where a broker appears healthy > from a cluster perspective (i.e., still part of the ISR) but is no longer > able to properly serve traffic, causing disruption to producers or > consumers. > > I understand that some managed services like AWS MSK implement additional > mechanisms (e.g., their "healing" state) to detect and handle such brokers, > but I’d like to know how self-hosted OSS Kafka operators typically manage > this risk. > > Some key questions: > > - > > Are there recommended monitoring patterns to detect a "half-dead" > broker more proactively? > - > > Are there any community-recommended configurations, scripts, or tools > to automatically remove or restart such brokers? > - > > Any lessons learned or operational best practices from other > self-hosted users? > > I would greatly appreciate any guidance ,Thank you in advance! > > Best regards, > > Saliha >