Hello Kafka Community, I’d like to consult the community on best practices for handling and preventing what’s sometimes called a "half-dead" Kafka broker scenario in a self-hosted OSS Kafka environment.
Specifically, I’m referring to situations where a broker appears healthy from a cluster perspective (i.e., still part of the ISR) but is no longer able to properly serve traffic, causing disruption to producers or consumers. I understand that some managed services like AWS MSK implement additional mechanisms (e.g., their "healing" state) to detect and handle such brokers, but I’d like to know how self-hosted OSS Kafka operators typically manage this risk. Some key questions: - Are there recommended monitoring patterns to detect a "half-dead" broker more proactively? - Are there any community-recommended configurations, scripts, or tools to automatically remove or restart such brokers? - Any lessons learned or operational best practices from other self-hosted users? I would greatly appreciate any guidance ,Thank you in advance! Best regards, Saliha