[ 
https://issues.apache.org/jira/browse/AMBARI-24719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated AMBARI-24719:
---------------------------------
    Summary: Kafka Rolling Restart causes outage(s) due to not checking for 
under replicated partitions  (was: Kafka Rolling Restart causes outage(s))

> Kafka Rolling Restart causes outage(s) due to not checking for under 
> replicated partitions
> ------------------------------------------------------------------------------------------
>
>                 Key: AMBARI-24719
>                 URL: https://issues.apache.org/jira/browse/AMBARI-24719
>             Project: Ambari
>          Issue Type: Improvement
>          Components: ambari-server
>    Affects Versions: 2.6.2
>            Reporter: Hari Sekhon
>            Priority: Critical
>
> Ambari causes Kafka topic partition outages during rolling restarts because 
> it only does a simplistic 2 minute wait between brokers and doesn't check the 
> state of partition replicas before taking another broker down.
> On busty Kafka clusters with lots topics / partitions / data it might take a 
> while before in-sync replicas recover.
> Ambari should therefore check for any under replicated partitions and wait as 
> long as it takes for them to recover before proceeding to the next broker. 
> There is however an issue in doing so which is there is a topic partition 
> with a replica that no longer exists (eg. ambari_kafka_service_check) then it 
> will never recover so there needs to be some thoughtful handling around that.
> This might be solved by AMBARI-24203 but I'm not sure it is tied in properly 
> to the rolling restarts or what the timeout policy or time interval is for 
> it, or whether it takes the above paragraph in to account.
> This could also have been easily offset if Ambari had proper extensible 
> checking as raised in AMBARI-24381.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to