[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers

2019-09-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921709#comment-16921709
 ] 

ASF GitHub Bot commented on KAFKA-7931:
---

aravindvs commented on pull request #7288: KAFKA-7931 : [Proposal] Fix metadata 
fetch for ephemeral brokers behind a Virtual IP
URL: https://github.com/apache/kafka/pull/7288
 
 
   If we have ephemeral brokers sitting behind a Virtual IP and when all the 
brokers go down, the client won't be able to reconnect as mentioned in: 
https://issues.apache.org/jira/browse/KAFKA-7931. This is because we take the 
bootstrap nodes and completely forget about it once the first metadata response 
comes in (and then we create a new metadata cache and a new cluster). Now when 
all the brokers go down before the metadata is updated, then the client will be 
stuck unless it is rebooted. 
   
   This patch simply stores the bootstrap brokers list. Instead of simply 
giving up when a 'leastLoadedNode' is not found, we simply use one of the 
bootstrap nodes to get the metadata. Also we can make sure to use the bootstrap 
nodes only when the bootstrap node is not part of the set of nodes on the 
cluster.
   
   Testing
   
   * Manual Testing - Setup ephemeral brokers behind a VIP. Recreate all the 
ephemeral brokers (so that they change their IPs)
   * NetworkClient Unit Test - Test metadata with bootstrap - being the same as 
the node on the cluster and also different than the node on the cluster.
   
   Note: This doesn't change any existing system behavior and this code path 
will be hit only if we are unable to find any `leastLoadedNode`
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Java Client: if all ephemeral brokers fail, client can never reconnect to 
> brokers
> -
>
> Key: KAFKA-7931
> URL: https://issues.apache.org/jira/browse/KAFKA-7931
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Brian
>Priority: Critical
>
> Steps to reproduce:
>  * Setup kafka cluster in GKE, with bootstrap server address configured to 
> point to a load balancer that exposes all GKE nodes
>  * Run producer that emits values into a partition with 3 replicas
>  * Kill every broker in the cluster
>  * Wait for brokers to restart
> Observed result:
> The java client cannot find any of the nodes even though they have all 
> recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) 
> could not be established. Broker may not be available.".
> Note, this is *not* a duplicate of 
> https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client 
> version that contains the fix for 
> https://issues.apache.org/jira/browse/KAFKA-7890.
> Versions:
> Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image
> Client: trunk from a few days ago (git sha 
> 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers

2019-08-30 Thread Aravind Velamur Srinivasan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919630#comment-16919630
 ] 

Aravind Velamur Srinivasan commented on KAFKA-7931:
---

btw - i manually tested this multiple times. It is very easy to repro as Brian 
mentioned above and with the patch the clients can discover the new brokers 
without needing to restart and I also see this log line (from the above patch) 
being hit as well:

{noformat}
+log.info("Found bootstrap node for metadata {}", found);

{noformat}


> Java Client: if all ephemeral brokers fail, client can never reconnect to 
> brokers
> -
>
> Key: KAFKA-7931
> URL: https://issues.apache.org/jira/browse/KAFKA-7931
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Brian
>Priority: Critical
>
> Steps to reproduce:
>  * Setup kafka cluster in GKE, with bootstrap server address configured to 
> point to a load balancer that exposes all GKE nodes
>  * Run producer that emits values into a partition with 3 replicas
>  * Kill every broker in the cluster
>  * Wait for brokers to restart
> Observed result:
> The java client cannot find any of the nodes even though they have all 
> recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) 
> could not be established. Broker may not be available.".
> Note, this is *not* a duplicate of 
> https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client 
> version that contains the fix for 
> https://issues.apache.org/jira/browse/KAFKA-7890.
> Versions:
> Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image
> Client: trunk from a few days ago (git sha 
> 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers

2019-08-30 Thread Aravind Velamur Srinivasan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919614#comment-16919614
 ] 

Aravind Velamur Srinivasan commented on KAFKA-7931:
---

Finally tracked this one. This seems to happen because the "bootstrap.servers" 
config is thrown away after the initial startup. Because of this, when all the 
ephemeral brokers fail, the metadata can never be resolved, as all the brokers 
on the 'clusterView' has changed their IPs. This just remains there forever 
unless the client is rebooted! This will be the case for deployments which use 
a VIP (Virtual IP - say, like a GCP LB-IP) to talk to the ephemeral brokers. 

To solve this I think we can simply cache the 'bootstrap servers' in the 
metadata cache and when we are unable to find the 'leastLoadedNode' for sending 
the metadata, we can use one of the IPs on the bootstrap.servers to fetch the 
metadata. 

Gist of the patch is something like this:

{noformat}
$ git diff clients/src/main/java/org/apache/kafka/clients/NetworkClient.java 
diff --git 
a/kafka/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java 
b/kafka/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java
index cf823f5..8c51c14 100644
--- a/kafka/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java
+++ b/kafka/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java
@@ -668,8 +668,15 @@ public class NetworkClient implements KafkaClient {
 
 if (found != null)
 log.trace("Found least loaded node {}", found);
-else
+else {
 log.trace("Least loaded node selection failed to find an available 
node");
+// instead of giving up get one of the bootstrap nodes
+List bootStrapNodes = 
this.metadataUpdater.fetchBootStrapNodes();
+if (bootStrapNodes != null && !bootStrapNodes.isEmpty()) {
+found = bootStrapNodes.get(0);
+log.info("Found bootstrap node for metadata {}", found);
+}
+}
 
 return found;
 }
@@ -951,6 +958,10 @@ public class NetworkClient implements KafkaClient {
 return metadata.fetch().nodes();
 }
 
+@Override
+public List fetchBootStrapNodes() {
+return metadata.getBootStrapNodes();
+}
 @Override
 public boolean isUpdateDue(long now) {
 return !this.hasFetchInProgress() && 
this.metadata.timeToNextUpdate(now) == 0;
{noformat}

Of course we can do more checks here to see:
1. If the bootstrap servers and the brokers list are different - this implies 
we are behind an ephemeral IP (we can either add a config option or we can get 
this from the metadata response and check if the list of brokers is different 
than the bootstrap one).
2. If the bootstrap server is connected and can be reached as well. It should 
be since we initiate a connection there.
3. Pick a random index rather than finding the first available bootstrap server 
node similar to the leastLoadedNode logic if have more than one VIP on the 
bootstrap.servers list.

I can open a formal PR as well after adding tests and doing some cleanup of the 
patch.

> Java Client: if all ephemeral brokers fail, client can never reconnect to 
> brokers
> -
>
> Key: KAFKA-7931
> URL: https://issues.apache.org/jira/browse/KAFKA-7931
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Brian
>Priority: Critical
>
> Steps to reproduce:
>  * Setup kafka cluster in GKE, with bootstrap server address configured to 
> point to a load balancer that exposes all GKE nodes
>  * Run producer that emits values into a partition with 3 replicas
>  * Kill every broker in the cluster
>  * Wait for brokers to restart
> Observed result:
> The java client cannot find any of the nodes even though they have all 
> recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) 
> could not be established. Broker may not be available.".
> Note, this is *not* a duplicate of 
> https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client 
> version that contains the fix for 
> https://issues.apache.org/jira/browse/KAFKA-7890.
> Versions:
> Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image
> Client: trunk from a few days ago (git sha 
> 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers

2019-07-15 Thread Brian (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885489#comment-16885489
 ] 

Brian commented on KAFKA-7931:
--

Would love to see a patch. How do you know you've solved the issue?

Are you going to try and get this merged back into 
[https://github.com/helm/charts/tree/master/incubator/kafka] 

 

> Java Client: if all ephemeral brokers fail, client can never reconnect to 
> brokers
> -
>
> Key: KAFKA-7931
> URL: https://issues.apache.org/jira/browse/KAFKA-7931
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Brian
>Priority: Critical
>
> Steps to reproduce:
>  * Setup kafka cluster in GKE, with bootstrap server address configured to 
> point to a load balancer that exposes all GKE nodes
>  * Run producer that emits values into a partition with 3 replicas
>  * Kill every broker in the cluster
>  * Wait for brokers to restart
> Observed result:
> The java client cannot find any of the nodes even though they have all 
> recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) 
> could not be established. Broker may not be available.".
> Note, this is *not* a duplicate of 
> https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client 
> version that contains the fix for 
> https://issues.apache.org/jira/browse/KAFKA-7890.
> Versions:
> Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image
> Client: trunk from a few days ago (git sha 
> 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers

2019-07-15 Thread Sam Weston (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885113#comment-16885113
 ] 

Sam Weston commented on KAFKA-7931:
---

Good news! I've got to the bottom of it!

The fix is to use a DNS name as the advertised listener instead of the Pod IP 
address (in my case the Kubernetes headless service). Now I can restart 
containers as quickly as I like and my Java apps don't get upset.

e.g. 
KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://pulseplatform-dev-kafka-0.pulseplatform-dev-kafka-headless.pulseplatform-dev:9092
 where the headless service is called pulseplatform-dev-kafka-headless, my 
namespace is pulseplatform-dev and the pod is called pulseplatform-dev-kafka-0

> Java Client: if all ephemeral brokers fail, client can never reconnect to 
> brokers
> -
>
> Key: KAFKA-7931
> URL: https://issues.apache.org/jira/browse/KAFKA-7931
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Brian
>Priority: Critical
>
> Steps to reproduce:
>  * Setup kafka cluster in GKE, with bootstrap server address configured to 
> point to a load balancer that exposes all GKE nodes
>  * Run producer that emits values into a partition with 3 replicas
>  * Kill every broker in the cluster
>  * Wait for brokers to restart
> Observed result:
> The java client cannot find any of the nodes even though they have all 
> recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) 
> could not be established. Broker may not be available.".
> Note, this is *not* a duplicate of 
> https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client 
> version that contains the fix for 
> https://issues.apache.org/jira/browse/KAFKA-7890.
> Versions:
> Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image
> Client: trunk from a few days ago (git sha 
> 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers

2019-07-01 Thread Brian (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876368#comment-16876368
 ] 

Brian commented on KAFKA-7931:
--

[~cablespaghetti] I haven't resolved this unfortunately.

> Java Client: if all ephemeral brokers fail, client can never reconnect to 
> brokers
> -
>
> Key: KAFKA-7931
> URL: https://issues.apache.org/jira/browse/KAFKA-7931
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Brian
>Priority: Critical
>
> Steps to reproduce:
>  * Setup kafka cluster in GKE, with bootstrap server address configured to 
> point to a load balancer that exposes all GKE nodes
>  * Run producer that emits values into a partition with 3 replicas
>  * Kill every broker in the cluster
>  * Wait for brokers to restart
> Observed result:
> The java client cannot find any of the nodes even though they have all 
> recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) 
> could not be established. Broker may not be available.".
> Note, this is *not* a duplicate of 
> https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client 
> version that contains the fix for 
> https://issues.apache.org/jira/browse/KAFKA-7890.
> Versions:
> Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image
> Client: trunk from a few days ago (git sha 
> 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7931) Java Client: if all ephemeral brokers fail, client can never reconnect to brokers

2019-07-01 Thread Sam Weston (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876222#comment-16876222
 ] 

Sam Weston commented on KAFKA-7931:
---

Have you made any progress with this? I have the same problem if I lose more 
than 1 node every 5 minutes or so, and I haven't worked out how to monitor for 
it yet...

 

> Java Client: if all ephemeral brokers fail, client can never reconnect to 
> brokers
> -
>
> Key: KAFKA-7931
> URL: https://issues.apache.org/jira/browse/KAFKA-7931
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Brian
>Priority: Critical
>
> Steps to reproduce:
>  * Setup kafka cluster in GKE, with bootstrap server address configured to 
> point to a load balancer that exposes all GKE nodes
>  * Run producer that emits values into a partition with 3 replicas
>  * Kill every broker in the cluster
>  * Wait for brokers to restart
> Observed result:
> The java client cannot find any of the nodes even though they have all 
> recovered. I see messages like "Connection to node 30 (/10.6.0.101:9092) 
> could not be established. Broker may not be available.".
> Note, this is *not* a duplicate of 
> https://issues.apache.org/jira/browse/KAFKA-7890. I'm using the client 
> version that contains the fix for 
> https://issues.apache.org/jira/browse/KAFKA-7890.
> Versions:
> Kakfa: kafka version 2.1.0, using confluentinc/cp-kafka/5.1.0 docker image
> Client: trunk from a few days ago (git sha 
> 9f7e6b291309286e3e3c1610e98d978773c9d504), to pull in the fix for KAFKA-7890
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)