Lukasz Mierzwa created KAFKA-6436: ------------------------------------- Summary: Provide a metric indicating broker cluster membership state Key: KAFKA-6436 URL: https://issues.apache.org/jira/browse/KAFKA-6436 Project: Kafka Issue Type: Wish Components: metrics Reporter: Lukasz Mierzwa Priority: Minor
When deploying kafka config changes each instance needs to be restarted (since there's no graceful reload) and that requires coordination to keep all partitions on-line. Part of the automation I have waits after restarting each instance until restarted broker is back in sync on all partitions, to do that I query for: {noformat} kafka.server:name=BrokerState,type=KafkaServer to be 3 (broker is up & running) kafka.server:clientId=Replica,name=MaxLag,type=ReplicaFetcherManager = 0 (there's no lag) {noformat} I've noticed that there's a race for the MaxLag metric - when replica fetcher threads are starting this metric will be initialized with 0 value, then (I assume) once all threads connect to the leaders it's populated with "correct" MaxLag value computed from all those threads. This means that there's a window where I can query for those metrics and get expected BrokerState=3 and MaxLag=0 which would I interpret as "done restarting this instance" but a few seconds later MaxLag might jump to a huge value. Right now my workaround is to require multiple queries to return expected metric values, which seems to protect me from hitting that window. It would be nice if there was a metric like "ClusterState" initialized as 0 that would be set to 1 only once all replica fetcher threads are started, completed reconnecting to the leaders and proper MaxLag is set (or there's no replicas on given broker). Alternatively MaxLag could be just initialized with -1 and set to 0 later if that's the actual max lag computed after getting replication offsets from leaders (if that would work). If there was a "ClusterState" metric it could also be used to signal if a broker loses connectivity with the rest of the cluster, I don't there is such metric right now (is there?). -- This message was sent by Atlassian JIRA (v6.4.14#64029)