Hi, Our production env uses Kafka 0.9.0.1 cluster of 12 m3.large nodes. Partitions count per broker is ~450, percent of leaders per broker is 30-40%. The average messages load is ~3K/s, bytes flow in is ~10MB/s and bytes flow out is ~60 MB/s.
We observed strange behaviour while putting one instance down terminating it on AWS: After putting down one Kafka instance, the leadership of partitions it was a leader for was transferred to other nodes. All nodes increased their cpu usage and one of them started consuming around 100% cpu. Restarts of that node does not help because high cpu usage is caught up by another node. This behaviour continues around 30 mins during that time. In two months, we have experienced this issue several times a day. Do you know something about that problem? -- With great enthusiasm, Andrey