Good evening. I have read through section of monitoring. I tried to map each section to corresponding JMX attribute. I will appreciate if you answer a few questions bellow.
Thanks so much in advance, Vadim What this JMX "kafka.controller":type="KafkaController",name="ActiveControllerCount" for? The rate of data in and out of the cluster and the number of messages written Which jmx attributes should I monitor? Since I should alert on this What are acceptable changes? What are not? The log flush rate and the time taken to flush the log "kafka.log":type="LogFlushStats",name="LogFlushRateAndTimeMs" Which attribute I should be watching and what acceptable deviation change before I should alert The number of partitions that have replicas that are down or have fallen behind and are underreplicated. Is this the JMX "kafka.cluster":type="Partition",name="buypets-0-UnderReplicated" that will show replicas that are down? Unclean leader elections. This shouldn't happen. "kafka.controller":type="ControllerStats",name="UncleanLeaderElectionsPerSec". I assume that should always be 0 and if its not 0 we have problem. Number of partitions each node is the leader for. Which JMX attribute(s) monitors this? Leader elections: we track each time this happens and how long it took: "kafka.controller":type="ControllerStats",name="LeaderElectionRateAndTimeMs" Any changes to the ISR Which JMX attribute I should monitor for this? Should I alert on this? What are reasonable changes? Which are not? The number of produce requests waiting on replication to report back Which JMX attribute I should monitor for this? Should I alert on this? What are reasonable changes? Which are not? The number of fetch requests waiting on data to arrive Which JMX attribute I should monitor for this? Should I alert on this? What are reasonable changes? Which are not?