
We are using kafka 0.10.2 with 2 brokers and 2 application nodes composed
of 6 consumers each (all in one group). And recently we experienced
disconnection of both nodes simultaneously and an infinite retry to connect
to the coordinator. Currently, just restarting the nodes solve the problem
but it will occur a few hours later.
In the application log we see a lot of :
11.07.2017 06:47:08,905 INFO
Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead for
group ABC
11.07.2017 06:47:09,007 INFO
Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for group
11.07.2017 06:47:09,008 INFO
(Re-)joining group ABC
11.07.2017 06:47:09,274 INFO
Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead for
group ABC
11.07.2017 06:47:09,375 INFO
Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for group
11.07.2017 06:47:09,375 INFO
(Re-)joining group ABC
11.07.2017 06:47:10,820 INFO
Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead for
group ABC
11.07.2017 06:47:10,921 INFO
Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for group
11.07.2017 06:47:10,922 INFO
(Re-)joining group ABC

There is nothing in the log of the brokers.
We have no problem to contact the coordinator from both nodes. Could it be
a periodic instability of the network which leads to this infinite retries?
This problem could it be related to
https://issues.apache.org/jira/browse/KAFKA-5464 ?

Here is the configuration of the Stream (lots of option are default ones)
        application.id = ABC
        application.server =
        bootstrap.servers = [kafka-1:9092, kafka-2:9092]
        buffered.records.per.partition = 1000
        cache.max.bytes.buffering = 10485760
        client.id =
        commit.interval.ms = 30000
        connections.max.idle.ms = 540000
        key.serde = class
        metadata.max.age.ms = 300000
        num.standby.replicas = 0
        num.stream.threads = 6
        partition.grouper = class
        poll.ms = 100
        receive.buffer.bytes = 32768
        reconnect.backoff.ms = 50
        replication.factor = 1
        request.timeout.ms = 40000
        retry.backoff.ms = 100
        rocksdb.config.setter = null
        security.protocol = PLAINTEXT
        send.buffer.bytes = 131072
        state.cleanup.delay.ms = 60000
        state.dir = null
        timestamp.extractor = class
        value.serde = class com.sigfox.kafka.serde.AvroStreamRecordSerde
        windowstore.changelog.additional.retention.ms = 86400000
        zookeeper.connect =

Any thoughts?


Reply via email to