Discuss: outage impact of not updating kafka clients to a version that has KAFKA-9893

Edward Capriolo Sun, 23 Jan 2022 04:52:16 -0800

Hello,
The bulk of this thread is to discuss
[KAFKA-9893] Configurable TCP connection timeout and improve the initial
metadata fetch - ASF JIRA (apache.org)
<https://issues.apache.org/jira/browse/KAFKA-9893>. IMHO t it should be
considered a BUG FIX and be potentially backported, but others may
disagree. Let me tell you about my environment:


We run kafka 2.2.X 2.5.x and even 2.6.x. We have clients using
spark-streaming, spring bootdata kafka, and folks just using Kafka producer
directly. clusters 3-12 nodes, 10 topics 48 partitions.

We actually have a chaos monkey in our UAT environment that shuts down
brokers and  entire datacenters/racks of brokers,we have clusters each day,
but simply shutting down brokers does not produce the problem.

We observed: There is a huge distinction between kafka broker
being down and host being up, and kafka being down and *host being down.*

Before kafka-9893 the second case is handled poorly. If you look at how
the meta-data connection works it randomly picks hosts from the list, and
sometimes requires 2 random hosts for round trip operations.

Here is what we did. We had all our apps, spark streaming, spring
boot etc. We went and we shutoff a server physically (pick host 1 in the
metadata boker list and shut it down physical hardware). spark streaming
just
could not go forward getting frequent timeouts and tasks failing. (this may
be due to our number of topics and partitions 7 topics 48 partitions) not
sure.

The fix is pretty simple. Even latest spark streaming is still only
looking at kafka-clients-2.6.0.  We simply updated the kafka-clients
artifact (maven) to 2.7.2 and set the timeout to something like 3 seconds
and the process runs while node down. The good news is that generally
kafka-clients seems
backwards compatible and even things compiled against kafka-clients 2.0 do
not seem to have a problem having kafka-clients swapped in at runtime.

The other mitigation we are doing is we are introducing a gslb based round
robin load balancers in here (using DNS), we assume this will work well,
but honestly it somewhat defeats the purpose metadata.broker.list.

Recap:  IMHO based on what I have seen I would advise
everyone to update their clients to 2.7.2 and set the timeout defined in
the jira, based on what I have seen (but your mileage may vary). And since
you're probably patching log4j now anyway might as well just update kafka
deps at the same time.

Please discuss if others have seen this issue, or if this is only something
that affects me.

Discuss: outage impact of not updating kafka clients to a version that has KAFKA-9893

Reply via email to