Dmitry Mischenko created KAFKA-10075:
----------------------------------------
Summary: Kafka client stucks after Kafka-cluster unavailability
Key: KAFKA-10075
URL: https://issues.apache.org/jira/browse/KAFKA-10075
Project: Kafka
Issue Type: Bug
Components: clients
Affects Versions: 2.4.0
Environment: Kafka v2.3.1 deployed by https://strimzi.io/ to
Kubernetes cluster
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
Reporter: Dmitry Mischenko
Several times we got an issue with kafka-client.
What happened:
We have Kafka v2.3.1 deployed by [https://strimzi.io/] to Kubernetes cluster
(Amazon EKS).
# Kafka brokers were unavailable (due to cluster upgrade) and couldn't be
resolved by internal hostnames
```2020-05-28 17:19:50 WARN NetworkClient:962 - [Producer
clientId=change_transformer-postgres_101.public.user_storage-9a89f512-43df-4179-a80f-db74f31ac724-StreamThread-1-producer]
Error connecting to node
data-kafka-dev-kafka-0.data-kafka-dev-kafka-brokers.data-kafka-dev.svc.cluster.local:9092
(id: -1 rack: null)2020-05-28 17:19:50 WARN NetworkClient:962 - [Producer
clientId=change_transformer-postgres_101.public.user_storage-9a89f512-43df-4179-a80f-db74f31ac724-StreamThread-1-producer]
Error connecting to node
data-kafka-dev-kafka-0.data-kafka-dev-kafka-brokers.data-kafka-dev.svc.cluster.local:9092
(id: -1 rack: null)at
org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:289)at
org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)at
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231)at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:538)at
org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)at
org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:335)"
at java.base/java.net.InetAddress.getAllByName(Unknown Source)"at
org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363)"
at java.base/java.net.InetAddress.getAllByName(Unknown Source)"" at
java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source)"at
org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:104)at
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:671)at
java.base/java.net.InetAddress.getAllByName0(Unknown Source)at
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:444)at
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)at
org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:843)at
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:698)2020-05-28
17:19:50 WARN NetworkClient:962 - [Producer
clientId=change_transformer-postgres_101.public.user_storage-9a89f512-43df-4179-a80f-db74f31ac724-StreamThread-1-producer]
Error connecting to node
data-kafka-dev-kafka-1.data-kafka-dev-kafka-brokers.data-kafka-dev.svc.cluster.local:9092
(id: -2 rack: null)at
org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:955)"
at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source)"at
org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363)```
# But after the moment when cluster was repaired, kafka-admin-client couldn't
restore connection and only every 120s was throwing timeout exceptions for a
long time.
``` 2020-05-28 17:21:14 INFO StreamThread:219 - stream-thread
[consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-StreamThread-1]
State transition from CREATED to STARTING
2020-05-28 17:21:14 WARN ConsumerConfig:355 - The configuration
'admin.retry.backoff.ms' was supplied but isn't a known config.
2020-05-28 17:21:14 INFO AppInfoParser:118 - Kafka commitId: 77a89fcf8d7fa018
2020-05-28 17:21:14 INFO AppInfoParser:117 - Kafka version: 2.4.0
2020-05-28 17:21:14 INFO KafkaConsumer:1032 - [Consumer
clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-StreamThread-1-consumer,
groupId=consumer_group-101.public.user_storage] Subscribed to pattern:
'postgres_101.public.user_storage'
2020-05-28 17:21:14 INFO KafkaStreams:276 - stream-client
[consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7]
State transition from CREATED to REBALANCING
2020-05-28 17:21:14 INFO StreamThread:664 - stream-thread
[consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-StreamThread-1]
Starting
2020-05-28 17:21:14 INFO AppInfoParser:119 - Kafka startTimeMs: 1590686474110
2020-05-28 17:21:14 WARN ConsumerConfig:355 - The configuration
'schema.registry.url' was supplied but isn't a known config.
2020-05-28 17:21:14 WARN ConsumerConfig:355 - The configuration
'admin.retries' was supplied but isn't a known config.
"org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node
assignment.
"
2020-05-28 17:23:11 INFO AdminMetadataManager:238 - [AdminClient
clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin]
Metadata update failed
2020-05-28 17:25:11 INFO AdminMetadataManager:238 - [AdminClient
clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin]
Metadata update failed
"org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send
the call.
"
2020-05-28 17:27:11 INFO AdminMetadataManager:238 - [AdminClient
clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin]
Metadata update failed
"org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node
assignment.
"
2020-05-28 17:29:11 INFO AdminMetadataManager:238 - [AdminClient
clientId=consumer_group-101.public.user_storage-714cfbe7-f34a-466a-97e1-bb145f0e34b7-admin]
Metadata update failed
"org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node
assignment.```
# After app restart everything works fine
The problem is that we nor can catch this exception and detect problem in order
to automatically reboot app nor client can self-heal in this situatuon.
Why could this happen and
--
This message was sent by Atlassian Jira
(v8.3.4#803005)