[ https://issues.apache.org/jira/browse/AMBARI-24719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hari Sekhon updated AMBARI-24719: --------------------------------- Summary: Kafka Rolling Restart causes outage(s) due to not checking for under replicated partitions (was: Kafka Rolling Restart causes outage(s)) > Kafka Rolling Restart causes outage(s) due to not checking for under > replicated partitions > ------------------------------------------------------------------------------------------ > > Key: AMBARI-24719 > URL: https://issues.apache.org/jira/browse/AMBARI-24719 > Project: Ambari > Issue Type: Improvement > Components: ambari-server > Affects Versions: 2.6.2 > Reporter: Hari Sekhon > Priority: Critical > > Ambari causes Kafka topic partition outages during rolling restarts because > it only does a simplistic 2 minute wait between brokers and doesn't check the > state of partition replicas before taking another broker down. > On busty Kafka clusters with lots topics / partitions / data it might take a > while before in-sync replicas recover. > Ambari should therefore check for any under replicated partitions and wait as > long as it takes for them to recover before proceeding to the next broker. > There is however an issue in doing so which is there is a topic partition > with a replica that no longer exists (eg. ambari_kafka_service_check) then it > will never recover so there needs to be some thoughtful handling around that. > This might be solved by AMBARI-24203 but I'm not sure it is tied in properly > to the rolling restarts or what the timeout policy or time interval is for > it, or whether it takes the above paragraph in to account. > This could also have been easily offset if Ambari had proper extensible > checking as raised in AMBARI-24381. -- This message was sent by Atlassian JIRA (v7.6.3#76005)