Hi.
When a broker crashes and restarts after a long time (say 30 minutes), the broker has to do some work to sync all its replicas with the leaders. If the broker is a preferred replica for some partition, the broker may become the leader as part of a preferred replica leader election, while it is still catching up on some other partitions. This scenario can lead to a high incoming throughput on the broker during the catch up phase and cause back pressure with certain storage volumes (which have a fixed max throughput). This backpressure can slow down recovery time, and manifest in the form of client application errors in producing / consuming data on / from the recovering broker. I am looking for solutions to mitigate this problem. There are two solutions that I am aware of. 1. Scale out the cluster to have more brokers, so that the replication traffic is smaller per broker during recovery. 2. Keep preferred replica leader elections disabled and manually run one instance of preferred replica leader election after the broker has come back up and fully caught up. Are there other solutions?