Hello,

I am not an active user of Kafka but we had used it in my earlier company.

I had posted your query to Chatgpt.

The response from chatgpt is given below. Hope this helps.

Below are the main areas Kafka experts usually tune for this scenario.

---

## 1️⃣ Increase Network and I/O Thread Capacity

When thousands of clients reconnect simultaneously, the broker may not have
enough threads to process requests.

**Key settings**

```properties
num.network.threads=8-16
num.io.threads=16-32
```

**Explanation**

* `num.network.threads` → handles network requests (connection + metadata)
* `num.io.threads` → handles disk I/O and request processing

For a broker handling **200K connections**, the defaults (3 and 8) are
usually too low.

---

## 2️⃣ Increase Socket Queue Limits

During a connection spike, the OS queue may overflow.

```properties
socket.request.max.bytes=104857600
queued.max.requests=500-1000
```

This prevents the broker from rejecting bursts of requests.

---

## 3️⃣ Tune Connection Limits

Kafka can struggle with sudden reconnection storms.

```properties
max.connections=300000
max.connections.per.ip=10000
```

Adjust according to client distribution.

---

## 4️⃣ TLS / SSL Handshake Overhead

Since you're using **SSL**, connection spikes can cause heavy **CPU load
due to TLS handshakes**.

Things to consider:

### Enable TLS session reuse

If your clients support it, enable session caching to reduce handshake cost.

### Increase SSL threads

```properties
num.network.threads=16
```

TLS handshake is handled by network threads.

---

## 5️⃣ Metadata Request Storm

When brokers restart or leadership changes, **clients request metadata
simultaneously**.

Mitigation:

### Increase metadata cache propagation speed

Ensure controllers are healthy.

### Adjust client settings

Clients should avoid aggressive metadata refresh:

```properties
metadata.max.age.ms=300000
reconnect.backoff.ms=100
reconnect.backoff.max.ms=1000
```

This spreads reconnect attempts.

---

## 6️⃣ File Descriptor Limits

200K connections require high OS limits.

Check:

```bash
ulimit -n
```

Recommended:

```
500000+
```

Also tune Linux:

```
net.core.somaxconn
net.ipv4.tcp_max_syn_backlog
```

---

## 7️⃣ Increase Listener Backlog

Kafka setting:

```properties
socket.listen.backlog.size=65535
```

Prevents dropped connection bursts.

---

## 8️⃣ JVM and GC Tuning

Large connection counts increase heap pressure.

Typical settings:

```
- Xms16G
- Xmx16G
- G1GC
```

Avoid GC pauses during reconnect storms.

---

## 9️⃣ Consider Connection Pooling

200K connections is very large.

Kafka best practice:

* Use **fewer clients**
* Each client uses **multiple producers/consumers**

Instead of thousands of separate connections.

---

## 🔟 Advanced Solution: Connection Rate Limiting

Some large deployments implement **connection rate limiting** at the load
balancer or client side to avoid reconnection storms.

---

# Typical Config for Large Kafka Brokers

Example tuning used in high-scale clusters:

```properties
num.network.threads=16
num.io.threads=32
queued.max.requests=1000
socket.request.max.bytes=104857600
socket.listen.backlog.size=65535
max.connections=300000
max.connections.per.ip=20000
```

---

💡 **Most likely root cause in your case**

Given your description, the bottleneck is usually:

* **TLS handshake CPU overload**
* **Network thread exhaustion**
* **Metadata request storm**

---


On Tue, 10 Mar, 2026, 1:28 am Nanda Naga via users, <[email protected]>
wrote:

> Hi Kafka experts,
>
> I have a large-scale distributed system that takes around 200K connections
> to single broker on various topic partitions. I sometimes see my SSL port
> becomes unresponsive to client with high metadata latency when I sudden
> spike of around 30K connections (when a broker restarts or it become leader
> when any other high traffic broker goes for a repair). Is there a reliable
> way to tune the config to suit the load?
>
> Regards,
> Nanda
>
>

Reply via email to