Lukasz Mierzwa created KAFKA-6436:
-------------------------------------

             Summary: Provide a metric indicating broker cluster membership 
state
                 Key: KAFKA-6436
                 URL: https://issues.apache.org/jira/browse/KAFKA-6436
             Project: Kafka
          Issue Type: Wish
          Components: metrics
            Reporter: Lukasz Mierzwa
            Priority: Minor


When deploying kafka config changes each instance needs to be restarted (since 
there's no graceful reload) and that requires coordination to keep all 
partitions on-line. Part of the automation I have waits after restarting each 
instance until restarted broker is back in sync on all partitions, to do that I 
query for:

{noformat}
kafka.server:name=BrokerState,type=KafkaServer to be 3 (broker is up & running)
kafka.server:clientId=Replica,name=MaxLag,type=ReplicaFetcherManager = 0 
(there's no lag)
{noformat}

I've noticed that there's a race for the MaxLag metric - when replica fetcher 
threads are starting this metric will be initialized with 0 value, then (I 
assume) once all threads connect to the leaders it's populated with "correct" 
MaxLag value computed from all those threads. This means that there's a window 
where I can query for those metrics and get expected BrokerState=3 and MaxLag=0 
which would I interpret as "done restarting this instance" but a few seconds 
later MaxLag might jump to a huge value.
Right now my workaround is to require multiple queries to return expected 
metric values, which seems to protect me from hitting that window.
It would be nice if there was a metric like "ClusterState" initialized as 0 
that would be set to 1 only once all replica fetcher threads are started, 
completed reconnecting to the leaders and proper MaxLag is set (or there's no 
replicas on given broker).
Alternatively MaxLag could be just initialized with -1 and set to 0 later if 
that's the actual max lag computed after getting replication offsets from 
leaders (if that would work).

If there was a "ClusterState" metric it could also be used to signal if a 
broker loses connectivity with the rest of the cluster, I don't there is such 
metric right now (is there?).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to