I have a kafka cluster which has 40 partitions, and 2M requests every second, for each, it is kinda 1.5k bytes, using 4 consumer machine, 10 partitions for each machine there are other consumers consuming. in other several machines the tricky thing is, some of the parts will get lag surely, but if restart it with some fetch buffer change either make it bigger or smaller, or even no change some time, it will catch up.
here is my guess, the kafka cluster cannot dynamically balance the bandwidth of the brokers, in cases it will be unbalanced which fall in some of brokers, and even slower than the producing rate, but if reconnect, hence, got a chance to do the balance. Just wanna clarify this issue