[ https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885113#comment-16885113 ]
Sam Weston edited comment on KAFKA-7931 at 7/15/19 11:40 AM: ------------------------------------------------------------- Good news! I've got to the bottom of it! The fix is to use a DNS name as the advertised listener instead of the Pod IP address (in my case the Kubernetes headless service). Now I can restart containers as quickly as I like and my Java apps don't get upset. e.g. KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://pulseplatform-dev-kafka-0.pulseplatform-dev-kafka-headless.pulseplatform-dev:9092 where the headless service is called pulseplatform-dev-kafka-headless, my namespace is pulseplatform-dev and the pod is called pulseplatform-dev-kafka-0 If you're using the incubator helm chart let me know and I'll provide more details of my values file. was (Author: cablespaghetti): Good news! I've got to the bottom of it! The fix is to use a DNS name as the advertised listener instead of the Pod IP address (in my case the Kubernetes headless service). Now I can restart containers as quickly as I like and my Java apps don't get upset. e.g. KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://pulseplatform-dev-kafka-0.pulseplatform-dev-kafka-headless.pulseplatform-dev:9092 where the headless service is called pulseplatform-dev-kafka-headless, my namespace is pulseplatform-dev and the pod is called pulseplatform-dev-kafka-0 > Java Client: if all ephemeral brokers fail, client can never reconnect to > brokers > --------------------------------------------------------------------------------- > > Key: KAFKA-7931 > URL: https://issues.apache.org/jira/browse/KAFKA-7931 > Project: Kafka > Issue Type: Bug > Components: clients > Affects Versions: 2.1.0 > Reporter: Brian > Priority: Critical > > Steps to reproduce: > * Setup kafka cluster in GKE, with bootstrap server address configured to > point to a load balancer that exposes all GKE nodes > * Run producer that emits values into a partition with 3 replicas > * Kill every broker in the cluster > * Wait for brokers to restart > Observed result: > The java client cannot find any of the nodes even though they have all > recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) > could not be established. Broker may not be available.". > Note, this is *not* a duplicate of > https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client > version that contains the fix for > https://issues.apache.org/jira/browse/KAFKA-7890. > Versions: > Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image > Client: trunk from a few days ago (git sha > 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890 > -- This message was sent by Atlassian JIRA (v7.6.14#76016)