Hello, The bulk of this thread is to discuss [KAFKA-9893] Configurable TCP connection timeout and improve the initial metadata fetch - ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/KAFKA-9893>. IMHO t it should be considered a BUG FIX and be potentially backported, but others may disagree. Let me tell you about my environment:
We run kafka 2.2.X 2.5.x and even 2.6.x. We have clients using spark-streaming, spring bootdata kafka, and folks just using Kafka producer directly. clusters 3-12 nodes, 10 topics 48 partitions. We actually have a chaos monkey in our UAT environment that shuts down brokers and entire datacenters/racks of brokers,we have clusters each day, but simply shutting down brokers does not produce the problem. We observed: There is a huge distinction between kafka broker being down and host being up, and kafka being down and *host being down.* Before kafka-9893 the second case is handled poorly. If you look at how the meta-data connection works it randomly picks hosts from the list, and sometimes requires 2 random hosts for round trip operations. Here is what we did. We had all our apps, spark streaming, spring boot etc. We went and we shutoff a server physically (pick host 1 in the metadata boker list and shut it down physical hardware). spark streaming just could not go forward getting frequent timeouts and tasks failing. (this may be due to our number of topics and partitions 7 topics 48 partitions) not sure. The fix is pretty simple. Even latest spark streaming is still only looking at kafka-clients-2.6.0. We simply updated the kafka-clients artifact (maven) to 2.7.2 and set the timeout to something like 3 seconds and the process runs while node down. The good news is that generally kafka-clients seems backwards compatible and even things compiled against kafka-clients 2.0 do not seem to have a problem having kafka-clients swapped in at runtime. The other mitigation we are doing is we are introducing a gslb based round robin load balancers in here (using DNS), we assume this will work well, but honestly it somewhat defeats the purpose metadata.broker.list. Recap: IMHO based on what I have seen I would advise everyone to update their clients to 2.7.2 and set the timeout defined in the jira, based on what I have seen (but your mileage may vary). And since you're probably patching log4j now anyway might as well just update kafka deps at the same time. Please discuss if others have seen this issue, or if this is only something that affects me.