Vadim, Controlled shutdown takes 2 parameters - number of retries and shutdown timeout. In every retry, controlled shutdown attempts to move leaders off of the broker that needs to be shutdown. If the controlled shutdown runs out of retries, it proceeds to shutting down the broker even if it still hosts a few leaders. At LinkedIn, the script to bounce Kafka brokers waits for the under replicated partition count to drop to 0 before invoking controlled shutdown on the next broker. The aim is to avoid data loss that occurs if you shut down a broker that still has some leaders. If the under replicated count never drops to 0, it indicates a bug in Kafka code and the script does not proceed to bouncing any more brokers in a cluster. We measure the time it takes to move "n" leaders off of some broker, and configure the shutdown timeout accordingly. We also configure the retries to a small number (2 or 3). If the controlled shutdown fails the retries, the broker shuts itself down anyways. In general, you want to avoid hard killing (kill -9) a broker since that means the broker will run a long running log recovery process on startup. That significantly delays the time the broker takes to rejoin the cluster.
Thanks, Neha On Sun, Aug 18, 2013 at 3:33 PM, Vadim Keylis <vkeylis2...@gmail.com> wrote: > Good afternoon. We are running kafka on centos linux. I enabled controlled > shutdown in the property file. We are starting/stopping kafka using init > script. The init script will issue term signal first followed 3 seconds > later by kill signal. Is that right process to shutdown kafka? Which > startup/shutdown/restart script you guys use? What shutdown process > linkedin uses? What side effects could be after kafka service is killed > uncleanly using kill -9 signal? > > Thanks, > Vadim >