It was on our test environment and nothing was running when the incident occurred. In the server log we have a bunch of [2017-07-11 11:52:15,330] WARN Attempting to send response via channel for which there is no open connection, connection id 0 (kafka.network.Processor) But the time doesn't match so I don't know if it's correlated or not
On Tue, Jul 11, 2017 at 1:08 PM, John Yost <hokiege...@gmail.com> wrote: > Hi Pierre, > > Do your brokers remain responsive? In other words, do you see any other > symptoms such as decreased write or read throughput which may indicate long > GC pauses or possibly heavy load on your zookeeper cluster as evidenced by > any SocketTimeoutExceptions on the Kafka and/or Zookeeper sides? > > --John > > On Tue, Jul 11, 2017 at 6:15 AM, Pierre Coquentin < > pierre.coquen...@gmail.com> wrote: > > > Hi, > > > > We are using kafka 0.10.2 with 2 brokers and 2 application nodes composed > > of 6 consumers each (all in one group). And recently we experienced > > disconnection of both nodes simultaneously and an infinite retry to > connect > > to the coordinator. Currently, just restarting the nodes solve the > problem > > but it will occur a few hours later. > > In the application log we see a lot of : > > 11.07.2017 06:47:08,905 INFO > > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631] > > Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead for > > group ABC > > 11.07.2017 06:47:09,007 INFO > > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586] > > Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for group > > ABC. > > 11.07.2017 06:47:09,008 INFO > > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420] > > (Re-)joining group ABC > > 11.07.2017 06:47:09,274 INFO > > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631] > > Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead for > > group ABC > > 11.07.2017 06:47:09,375 INFO > > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586] > > Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for group > > ABC. > > 11.07.2017 06:47:09,375 INFO > > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420] > > (Re-)joining group ABC > > 11.07.2017 06:47:10,820 INFO > > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631] > > Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead for > > group ABC > > 11.07.2017 06:47:10,921 INFO > > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586] > > Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for group > > ABC. > > 11.07.2017 06:47:10,922 INFO > > [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420] > > (Re-)joining group ABC > > > > There is nothing in the log of the brokers. > > We have no problem to contact the coordinator from both nodes. Could it > be > > a periodic instability of the network which leads to this infinite > retries? > > This problem could it be related to > > https://issues.apache.org/jira/browse/KAFKA-5464 ? > > > > Here is the configuration of the Stream (lots of option are default ones) > > application.id = ABC > > application.server = > > bootstrap.servers = [kafka-1:9092, kafka-2:9092] > > buffered.records.per.partition = 1000 > > cache.max.bytes.buffering = 10485760 > > client.id = > > commit.interval.ms = 30000 > > connections.max.idle.ms = 540000 > > key.serde = class > > org.apache.kafka.common.serialization.Serdes$StringSerde > > metadata.max.age.ms = 300000 > > num.standby.replicas = 0 > > num.stream.threads = 6 > > partition.grouper = class > > org.apache.kafka.streams.processor.DefaultPartitionGrouper > > poll.ms = 100 > > receive.buffer.bytes = 32768 > > reconnect.backoff.ms = 50 > > replication.factor = 1 > > request.timeout.ms = 40000 > > retry.backoff.ms = 100 > > rocksdb.config.setter = null > > security.protocol = PLAINTEXT > > send.buffer.bytes = 131072 > > state.cleanup.delay.ms = 60000 > > state.dir = null > > timestamp.extractor = class > > org.apache.kafka.streams.processor.FailOnInvalidTimestamp > > value.serde = class com.sigfox.kafka.serde.AvroStreamRecordSerde > > windowstore.changelog.additional.retention.ms = 86400000 > > zookeeper.connect = > > zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181 > > > > > > Any thoughts? > > Regards, > > > > Pierre > > >