Hi, We are using kafka 0.10.2 with 2 brokers and 2 application nodes composed of 6 consumers each (all in one group). And recently we experienced disconnection of both nodes simultaneously and an infinite retry to connect to the coordinator. Currently, just restarting the nodes solve the problem but it will occur a few hours later. In the application log we see a lot of : 11.07.2017 06:47:08,905 INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631] Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead for group ABC 11.07.2017 06:47:09,007 INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586] Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for group ABC. 11.07.2017 06:47:09,008 INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420] (Re-)joining group ABC 11.07.2017 06:47:09,274 INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631] Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead for group ABC 11.07.2017 06:47:09,375 INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586] Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for group ABC. 11.07.2017 06:47:09,375 INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420] (Re-)joining group ABC 11.07.2017 06:47:10,820 INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631] Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead for group ABC 11.07.2017 06:47:10,921 INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586] Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for group ABC. 11.07.2017 06:47:10,922 INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420] (Re-)joining group ABC
There is nothing in the log of the brokers. We have no problem to contact the coordinator from both nodes. Could it be a periodic instability of the network which leads to this infinite retries? This problem could it be related to https://issues.apache.org/jira/browse/KAFKA-5464 ? Here is the configuration of the Stream (lots of option are default ones) application.id = ABC application.server = bootstrap.servers = [kafka-1:9092, kafka-2:9092] buffered.records.per.partition = 1000 cache.max.bytes.buffering = 10485760 client.id = commit.interval.ms = 30000 connections.max.idle.ms = 540000 key.serde = class org.apache.kafka.common.serialization.Serdes$StringSerde metadata.max.age.ms = 300000 num.standby.replicas = 0 num.stream.threads = 6 partition.grouper = class org.apache.kafka.streams.processor.DefaultPartitionGrouper poll.ms = 100 receive.buffer.bytes = 32768 reconnect.backoff.ms = 50 replication.factor = 1 request.timeout.ms = 40000 retry.backoff.ms = 100 rocksdb.config.setter = null security.protocol = PLAINTEXT send.buffer.bytes = 131072 state.cleanup.delay.ms = 60000 state.dir = null timestamp.extractor = class org.apache.kafka.streams.processor.FailOnInvalidTimestamp value.serde = class com.sigfox.kafka.serde.AvroStreamRecordSerde windowstore.changelog.additional.retention.ms = 86400000 zookeeper.connect = zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181 Any thoughts? Regards, Pierre