We are seeing some strange behavior from brokers after we we had to change our log retention policy on brokers yesterday. We had a huge spike in producer data for a small period which caused brokers to get very close to the max disk space. Normally our retention policy is good 6-7 days but since our consumers were synced up we changed the retention policy from hour based to size based and cut short the size to a safe number (half of our max disk space and normal usage is around 30%). After the restart, we started seeing multiple producer side failures with FailedSends metrics showing almost 10% failures and FailedProduceRequestsPerSec on the broker side a non-zero number. The traces from one of the brokers looked like this:
[KafkaApi-8] Produce request with correlation id 2050686 from client xxx on partition [TOPIC_NAME,18] failed due to Partition [TOPIC_NAME,18] doesn't exist on 8 (kafka.server.KafkaApis) [KafkaApi-8] Produce request with correlation id 2102325 from client xxx on partition [TOPIC_NAME,28] failed due to Partition [TOPIC_NAME,28] doesn't exist on 8 (kafka.server.KafkaApis) We checked and made sure those partitions were present on the broker. Any help is appreciated. Also, is there a recommended way to purge log data quickly out from the brokers. Thanks, Sadhan