Hello, I am not an active user of Kafka but we had used it in my earlier company.
I had posted your query to Chatgpt. The response from chatgpt is given below. Hope this helps. Below are the main areas Kafka experts usually tune for this scenario. --- ## 1️⃣ Increase Network and I/O Thread Capacity When thousands of clients reconnect simultaneously, the broker may not have enough threads to process requests. **Key settings** ```properties num.network.threads=8-16 num.io.threads=16-32 ``` **Explanation** * `num.network.threads` → handles network requests (connection + metadata) * `num.io.threads` → handles disk I/O and request processing For a broker handling **200K connections**, the defaults (3 and 8) are usually too low. --- ## 2️⃣ Increase Socket Queue Limits During a connection spike, the OS queue may overflow. ```properties socket.request.max.bytes=104857600 queued.max.requests=500-1000 ``` This prevents the broker from rejecting bursts of requests. --- ## 3️⃣ Tune Connection Limits Kafka can struggle with sudden reconnection storms. ```properties max.connections=300000 max.connections.per.ip=10000 ``` Adjust according to client distribution. --- ## 4️⃣ TLS / SSL Handshake Overhead Since you're using **SSL**, connection spikes can cause heavy **CPU load due to TLS handshakes**. Things to consider: ### Enable TLS session reuse If your clients support it, enable session caching to reduce handshake cost. ### Increase SSL threads ```properties num.network.threads=16 ``` TLS handshake is handled by network threads. --- ## 5️⃣ Metadata Request Storm When brokers restart or leadership changes, **clients request metadata simultaneously**. Mitigation: ### Increase metadata cache propagation speed Ensure controllers are healthy. ### Adjust client settings Clients should avoid aggressive metadata refresh: ```properties metadata.max.age.ms=300000 reconnect.backoff.ms=100 reconnect.backoff.max.ms=1000 ``` This spreads reconnect attempts. --- ## 6️⃣ File Descriptor Limits 200K connections require high OS limits. Check: ```bash ulimit -n ``` Recommended: ``` 500000+ ``` Also tune Linux: ``` net.core.somaxconn net.ipv4.tcp_max_syn_backlog ``` --- ## 7️⃣ Increase Listener Backlog Kafka setting: ```properties socket.listen.backlog.size=65535 ``` Prevents dropped connection bursts. --- ## 8️⃣ JVM and GC Tuning Large connection counts increase heap pressure. Typical settings: ``` - Xms16G - Xmx16G - G1GC ``` Avoid GC pauses during reconnect storms. --- ## 9️⃣ Consider Connection Pooling 200K connections is very large. Kafka best practice: * Use **fewer clients** * Each client uses **multiple producers/consumers** Instead of thousands of separate connections. --- ## 🔟 Advanced Solution: Connection Rate Limiting Some large deployments implement **connection rate limiting** at the load balancer or client side to avoid reconnection storms. --- # Typical Config for Large Kafka Brokers Example tuning used in high-scale clusters: ```properties num.network.threads=16 num.io.threads=32 queued.max.requests=1000 socket.request.max.bytes=104857600 socket.listen.backlog.size=65535 max.connections=300000 max.connections.per.ip=20000 ``` --- 💡 **Most likely root cause in your case** Given your description, the bottleneck is usually: * **TLS handshake CPU overload** * **Network thread exhaustion** * **Metadata request storm** --- On Tue, 10 Mar, 2026, 1:28 am Nanda Naga via users, <[email protected]> wrote: > Hi Kafka experts, > > I have a large-scale distributed system that takes around 200K connections > to single broker on various topic partitions. I sometimes see my SSL port > becomes unresponsive to client with high metadata latency when I sudden > spike of around 30K connections (when a broker restarts or it become leader > when any other high traffic broker goes for a repair). Is there a reliable > way to tune the config to suit the load? > > Regards, > Nanda > >
