Hello, We are experiencing stability issues in our Kafka architecture during chunked file transfers, especially during load spikes or after broker restarts. Here are the details: ------------------------------ 📦 *Architecture*:
- KafkaJS v2.2.4 used as both *producer and consumer*. - Kafka cluster with *3 brokers*, connected to a *shared NAS*. - Files are *split into 500 KB chunks* and sent to a Kafka topic. - A dedicated consumer reassembles files on the NAS. - Each file is assigned to the *topic/broker of the first chunk* to preserve order. ------------------------------ ❌ *Observed issues*: 1. *Request Produce(version: 7) timed out error during load spikes*: - For about *15 minutes*, KafkaJS producers fail with the error: *Request Produce(key: 0, version: 7) timed out* - This generally occurs *during traffic spikes*, typically when *multiple files are being sent simultaneously*. 2. *Network behavior – TCP Window Full*: - During this period, *TCP Window Full messages* appear on the network, - No *CPU or RAM spikes are observed during the blockage*, - However, a significant *resource usage increase (CPU/RAM)* happens *when the system recovers*, suggesting a *sudden backlog clearance*. 3. *Recovery after blockage*: - Connections reset, - A rollback seems to occur, then messages are processed quickly without errors, - The issue may reoccur at the next volume spike. 4. *Amplified behavior and prolonged instability after rolling restart*: - This problem is more frequent after a *rolling restart of brokers* (spaced 20 to 30 minutes apart), - Instability can persist for *several days or even weeks* before decreasing, - This suggests a *desynchronization or prolonged delay in partition reassignment, metadata updates, or coordination*. ------------------------------ ⚙️ *KafkaJS configuration*: *{* * requestTimeout: 30000, // 30s retry: { initialRetryTime: 1000, // 1s retries: 1500 // max value, but in practice we rarely exceed 30 retries (~15 minutes) } }* ------------------------------ ❓ *Questions*: 1. Is the Request Produce(version: 7) timed out error generally related to *broker congestion*, *network issues*, or *partition imbalance*? 2. Do the TCP Window Full messages indicate *network or broker buffer saturation*? Are there Kafka logs or metrics recommended to monitor? 3. Could assigning each file strictly to a single broker/topic lead to *local saturation*? 4. Could the KafkaJS retry configuration (1500 max retries, but rarely exceeding 30 retries) *exacerbate congestion*? Would a *progressive backoff strategy* be preferable? 5. What are the *best KafkaJS practices* for chunked file flows: compression, batching, flush intervals, etc.? 6. Can a *rolling restart* of brokers cause *temporary metadata desynchronization* or *excessive client wait times*? And why might this instability last so long? Thank you in advance for your help. Any diagnostic insights or optimization recommendations would be greatly appreciated. Best regards, Mathias VANHEMS