Hello,
We are experiencing stability issues in our Kafka architecture during
chunked file transfers, especially during load spikes or after broker
restarts. Here are the details:
------------------------------
📦 *Architecture*:
-
KafkaJS v2.2.4 used as both *producer and consumer*.
-
Kafka cluster with *3 brokers*, connected to a *shared NAS*.
-
Files are *split into 500 KB chunks* and sent to a Kafka topic.
-
A dedicated consumer reassembles files on the NAS.
-
Each file is assigned to the *topic/broker of the first chunk* to
preserve order.
------------------------------
❌ *Observed issues*:
1.
*Request Produce(version: 7) timed out error during load spikes*:
-
For about *15 minutes*, KafkaJS producers fail with the error:
*Request Produce(key: 0, version: 7) timed out*
-
This generally occurs *during traffic spikes*, typically when *multiple
files are being sent simultaneously*.
2.
*Network behavior – TCP Window Full*:
-
During this period, *TCP Window Full messages* appear on the network,
-
No *CPU or RAM spikes are observed during the blockage*,
-
However, a significant *resource usage increase (CPU/RAM)* happens *when
the system recovers*, suggesting a *sudden backlog clearance*.
3.
*Recovery after blockage*:
-
Connections reset,
-
A rollback seems to occur, then messages are processed quickly
without errors,
-
The issue may reoccur at the next volume spike.
4.
*Amplified behavior and prolonged instability after rolling restart*:
-
This problem is more frequent after a *rolling restart of brokers*
(spaced 20 to 30 minutes apart),
-
Instability can persist for *several days or even weeks* before
decreasing,
-
This suggests a *desynchronization or prolonged delay in partition
reassignment, metadata updates, or coordination*.
------------------------------
⚙️ *KafkaJS configuration*:
*{*
* requestTimeout: 30000, // 30s
retry: {
initialRetryTime: 1000, // 1s
retries: 1500 // max value, but in practice we rarely exceed 30
retries (~15 minutes)
}
}*
------------------------------
❓ *Questions*:
1.
Is the Request Produce(version: 7) timed out error generally
related to *broker
congestion*, *network issues*, or *partition imbalance*?
2.
Do the TCP Window Full messages indicate *network or broker buffer
saturation*? Are there Kafka logs or metrics recommended to monitor?
3.
Could assigning each file strictly to a single broker/topic lead to *local
saturation*?
4.
Could the KafkaJS retry configuration (1500 max retries, but rarely
exceeding 30 retries) *exacerbate congestion*? Would a *progressive
backoff strategy* be preferable?
5.
What are the *best KafkaJS practices* for chunked file flows:
compression, batching, flush intervals, etc.?
6.
Can a *rolling restart* of brokers cause *temporary metadata
desynchronization* or *excessive client wait times*? And why might this
instability last so long?
Thank you in advance for your help. Any diagnostic insights or optimization
recommendations would be greatly appreciated.
Best regards,
Mathias VANHEMS