[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers
[ https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921709#comment-16921709 ] ASF GitHub Bot commented on KAFKA-7931: --- aravindvs commented on pull request #7288: KAFKA-7931 : [Proposal] Fix metadata fetch for ephemeral brokers behind a Virtual IP URL: https://github.com/apache/kafka/pull/7288 If we have ephemeral brokers sitting behind a Virtual IP and when all the brokers go down, the client won't be able to reconnect as mentioned in: https://issues.apache.org/jira/browse/KAFKA-7931. This is because we take the bootstrap nodes and completely forget about it once the first metadata response comes in (and then we create a new metadata cache and a new cluster). Now when all the brokers go down before the metadata is updated, then the client will be stuck unless it is rebooted. This patch simply stores the bootstrap brokers list. Instead of simply giving up when a 'leastLoadedNode' is not found, we simply use one of the bootstrap nodes to get the metadata. Also we can make sure to use the bootstrap nodes only when the bootstrap node is not part of the set of nodes on the cluster. Testing * Manual Testing - Setup ephemeral brokers behind a VIP. Recreate all the ephemeral brokers (so that they change their IPs) * NetworkClient Unit Test - Test metadata with bootstrap - being the same as the node on the cluster and also different than the node on the cluster. Note: This doesn't change any existing system behavior and this code path will be hit only if we are unable to find any `leastLoadedNode` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Java Client: if all ephemeral brokers fail, client can never reconnect to > brokers > - > > Key: KAFKA-7931 > URL: https://issues.apache.org/jira/browse/KAFKA-7931 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.1.0 >Reporter: Brian >Priority: Critical > > Steps to reproduce: > * Setup kafka cluster in GKE, with bootstrap server address configured to > point to a load balancer that exposes all GKE nodes > * Run producer that emits values into a partition with 3 replicas > * Kill every broker in the cluster > * Wait for brokers to restart > Observed result: > The java client cannot find any of the nodes even though they have all > recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) > could not be established. Broker may not be available.". > Note, this is *not* a duplicate of > https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client > version that contains the fix for > https://issues.apache.org/jira/browse/KAFKA-7890. > Versions: > Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image > Client: trunk from a few days ago (git sha > 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890 > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers
[ https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919630#comment-16919630 ] Aravind Velamur Srinivasan commented on KAFKA-7931: --- btw - i manually tested this multiple times. It is very easy to repro as Brian mentioned above and with the patch the clients can discover the new brokers without needing to restart and I also see this log line (from the above patch) being hit as well: {noformat} +log.info("Found bootstrap node for metadata {}", found); {noformat} > Java Client: if all ephemeral brokers fail, client can never reconnect to > brokers > - > > Key: KAFKA-7931 > URL: https://issues.apache.org/jira/browse/KAFKA-7931 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.1.0 >Reporter: Brian >Priority: Critical > > Steps to reproduce: > * Setup kafka cluster in GKE, with bootstrap server address configured to > point to a load balancer that exposes all GKE nodes > * Run producer that emits values into a partition with 3 replicas > * Kill every broker in the cluster > * Wait for brokers to restart > Observed result: > The java client cannot find any of the nodes even though they have all > recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) > could not be established. Broker may not be available.". > Note, this is *not* a duplicate of > https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client > version that contains the fix for > https://issues.apache.org/jira/browse/KAFKA-7890. > Versions: > Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image > Client: trunk from a few days ago (git sha > 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890 > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers
[ https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919614#comment-16919614 ] Aravind Velamur Srinivasan commented on KAFKA-7931: --- Finally tracked this one. This seems to happen because the "bootstrap.servers" config is thrown away after the initial startup. Because of this, when all the ephemeral brokers fail, the metadata can never be resolved, as all the brokers on the 'clusterView' has changed their IPs. This just remains there forever unless the client is rebooted! This will be the case for deployments which use a VIP (Virtual IP - say, like a GCP LB-IP) to talk to the ephemeral brokers. To solve this I think we can simply cache the 'bootstrap servers' in the metadata cache and when we are unable to find the 'leastLoadedNode' for sending the metadata, we can use one of the IPs on the bootstrap.servers to fetch the metadata. Gist of the patch is something like this: {noformat} $ git diff clients/src/main/java/org/apache/kafka/clients/NetworkClient.java diff --git a/kafka/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java b/kafka/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java index cf823f5..8c51c14 100644 --- a/kafka/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java +++ b/kafka/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java @@ -668,8 +668,15 @@ public class NetworkClient implements KafkaClient { if (found != null) log.trace("Found least loaded node {}", found); -else +else { log.trace("Least loaded node selection failed to find an available node"); +// instead of giving up get one of the bootstrap nodes +List bootStrapNodes = this.metadataUpdater.fetchBootStrapNodes(); +if (bootStrapNodes != null && !bootStrapNodes.isEmpty()) { +found = bootStrapNodes.get(0); +log.info("Found bootstrap node for metadata {}", found); +} +} return found; } @@ -951,6 +958,10 @@ public class NetworkClient implements KafkaClient { return metadata.fetch().nodes(); } +@Override +public List fetchBootStrapNodes() { +return metadata.getBootStrapNodes(); +} @Override public boolean isUpdateDue(long now) { return !this.hasFetchInProgress() && this.metadata.timeToNextUpdate(now) == 0; {noformat} Of course we can do more checks here to see: 1. If the bootstrap servers and the brokers list are different - this implies we are behind an ephemeral IP (we can either add a config option or we can get this from the metadata response and check if the list of brokers is different than the bootstrap one). 2. If the bootstrap server is connected and can be reached as well. It should be since we initiate a connection there. 3. Pick a random index rather than finding the first available bootstrap server node similar to the leastLoadedNode logic if have more than one VIP on the bootstrap.servers list. I can open a formal PR as well after adding tests and doing some cleanup of the patch. > Java Client: if all ephemeral brokers fail, client can never reconnect to > brokers > - > > Key: KAFKA-7931 > URL: https://issues.apache.org/jira/browse/KAFKA-7931 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.1.0 >Reporter: Brian >Priority: Critical > > Steps to reproduce: > * Setup kafka cluster in GKE, with bootstrap server address configured to > point to a load balancer that exposes all GKE nodes > * Run producer that emits values into a partition with 3 replicas > * Kill every broker in the cluster > * Wait for brokers to restart > Observed result: > The java client cannot find any of the nodes even though they have all > recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) > could not be established. Broker may not be available.". > Note, this is *not* a duplicate of > https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client > version that contains the fix for > https://issues.apache.org/jira/browse/KAFKA-7890. > Versions: > Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image > Client: trunk from a few days ago (git sha > 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890 > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers
[ https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885489#comment-16885489 ] Brian commented on KAFKA-7931: -- Would love to see a patch. How do you know you've solved the issue? Are you going to try and get this merged back into [https://github.com/helm/charts/tree/master/incubator/kafka] > Java Client: if all ephemeral brokers fail, client can never reconnect to > brokers > - > > Key: KAFKA-7931 > URL: https://issues.apache.org/jira/browse/KAFKA-7931 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.1.0 >Reporter: Brian >Priority: Critical > > Steps to reproduce: > * Setup kafka cluster in GKE, with bootstrap server address configured to > point to a load balancer that exposes all GKE nodes > * Run producer that emits values into a partition with 3 replicas > * Kill every broker in the cluster > * Wait for brokers to restart > Observed result: > The java client cannot find any of the nodes even though they have all > recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) > could not be established. Broker may not be available.". > Note, this is *not* a duplicate of > https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client > version that contains the fix for > https://issues.apache.org/jira/browse/KAFKA-7890. > Versions: > Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image > Client: trunk from a few days ago (git sha > 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890 > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers
[ https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885113#comment-16885113 ] Sam Weston commented on KAFKA-7931: --- Good news! I've got to the bottom of it! The fix is to use a DNS name as the advertised listener instead of the Pod IP address (in my case the Kubernetes headless service). Now I can restart containers as quickly as I like and my Java apps don't get upset. e.g. KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://pulseplatform-dev-kafka-0.pulseplatform-dev-kafka-headless.pulseplatform-dev:9092 where the headless service is called pulseplatform-dev-kafka-headless, my namespace is pulseplatform-dev and the pod is called pulseplatform-dev-kafka-0 > Java Client: if all ephemeral brokers fail, client can never reconnect to > brokers > - > > Key: KAFKA-7931 > URL: https://issues.apache.org/jira/browse/KAFKA-7931 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.1.0 >Reporter: Brian >Priority: Critical > > Steps to reproduce: > * Setup kafka cluster in GKE, with bootstrap server address configured to > point to a load balancer that exposes all GKE nodes > * Run producer that emits values into a partition with 3 replicas > * Kill every broker in the cluster > * Wait for brokers to restart > Observed result: > The java client cannot find any of the nodes even though they have all > recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) > could not be established. Broker may not be available.". > Note, this is *not* a duplicate of > https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client > version that contains the fix for > https://issues.apache.org/jira/browse/KAFKA-7890. > Versions: > Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image > Client: trunk from a few days ago (git sha > 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890 > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers
[ https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876368#comment-16876368 ] Brian commented on KAFKA-7931: -- [~cablespaghetti] I haven't resolved this unfortunately. > Java Client: if all ephemeral brokers fail, client can never reconnect to > brokers > - > > Key: KAFKA-7931 > URL: https://issues.apache.org/jira/browse/KAFKA-7931 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.1.0 >Reporter: Brian >Priority: Critical > > Steps to reproduce: > * Setup kafka cluster in GKE, with bootstrap server address configured to > point to a load balancer that exposes all GKE nodes > * Run producer that emits values into a partition with 3 replicas > * Kill every broker in the cluster > * Wait for brokers to restart > Observed result: > The java client cannot find any of the nodes even though they have all > recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) > could not be established. Broker may not be available.". > Note, this is *not* a duplicate of > https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client > version that contains the fix for > https://issues.apache.org/jira/browse/KAFKA-7890. > Versions: > Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image > Client: trunk from a few days ago (git sha > 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers
[ https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876222#comment-16876222 ] Sam Weston commented on KAFKA-7931: --- Have you made any progress with this? I have the same problem if I lose more than 1 node every 5 minutes or so, and I haven't worked out how to monitor for it yet... > Java Client: if all ephemeral brokers fail, client can never reconnect to > brokers > - > > Key: KAFKA-7931 > URL: https://issues.apache.org/jira/browse/KAFKA-7931 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.1.0 >Reporter: Brian >Priority: Critical > > Steps to reproduce: > * Setup kafka cluster in GKE, with bootstrap server address configured to > point to a load balancer that exposes all GKE nodes > * Run producer that emits values into a partition with 3 replicas > * Kill every broker in the cluster > * Wait for brokers to restart > Observed result: > The java client cannot find any of the nodes even though they have all > recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) > could not be established. Broker may not be available.". > Note, this is *not* a duplicate of > https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client > version that contains the fix for > https://issues.apache.org/jira/browse/KAFKA-7890. > Versions: > Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image > Client: trunk from a few days ago (git sha > 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)