Hi,

We are using kafka 0.10.2 with 2 brokers and 2 application nodes composed
of 6 consumers each (all in one group). And recently we experienced
disconnection of both nodes simultaneously and an infinite retry to connect
to the coordinator. Currently, just restarting the nodes solve the problem
but it will occur a few hours later.
In the application log we see a lot of :
11.07.2017 06:47:08,905 INFO
[org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631]
Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead for
group ABC
11.07.2017 06:47:09,007 INFO
[org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586]
Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for group
ABC.
11.07.2017 06:47:09,008 INFO
[org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420]
(Re-)joining group ABC
11.07.2017 06:47:09,274 INFO
[org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631]
Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead for
group ABC
11.07.2017 06:47:09,375 INFO
[org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586]
Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for group
ABC.
11.07.2017 06:47:09,375 INFO
[org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420]
(Re-)joining group ABC
11.07.2017 06:47:10,820 INFO
[org.apache.kafka.clients.consumer.internals.AbstractCoordinator:631]
Marking the coordinator kafka-2:9092 (id: 2147483646 rack: null) dead for
group ABC
11.07.2017 06:47:10,921 INFO
[org.apache.kafka.clients.consumer.internals.AbstractCoordinator:586]
Discovered coordinator kafka-2:9092 (id: 2147483646 rack: null) for group
ABC.
11.07.2017 06:47:10,922 INFO
[org.apache.kafka.clients.consumer.internals.AbstractCoordinator:420]
(Re-)joining group ABC

There is nothing in the log of the brokers.
We have no problem to contact the coordinator from both nodes. Could it be
a periodic instability of the network which leads to this infinite retries?
This problem could it be related to
https://issues.apache.org/jira/browse/KAFKA-5464 ?

Here is the configuration of the Stream (lots of option are default ones)
        application.id = ABC
        application.server =
        bootstrap.servers = [kafka-1:9092, kafka-2:9092]
        buffered.records.per.partition = 1000
        cache.max.bytes.buffering = 10485760
        client.id =
        commit.interval.ms = 30000
        connections.max.idle.ms = 540000
        key.serde = class
org.apache.kafka.common.serialization.Serdes$StringSerde
        metadata.max.age.ms = 300000
        num.standby.replicas = 0
        num.stream.threads = 6
        partition.grouper = class
org.apache.kafka.streams.processor.DefaultPartitionGrouper
        poll.ms = 100
        receive.buffer.bytes = 32768
        reconnect.backoff.ms = 50
        replication.factor = 1
        request.timeout.ms = 40000
        retry.backoff.ms = 100
        rocksdb.config.setter = null
        security.protocol = PLAINTEXT
        send.buffer.bytes = 131072
        state.cleanup.delay.ms = 60000
        state.dir = null
        timestamp.extractor = class
org.apache.kafka.streams.processor.FailOnInvalidTimestamp
        value.serde = class com.sigfox.kafka.serde.AvroStreamRecordSerde
        windowstore.changelog.additional.retention.ms = 86400000
        zookeeper.connect =
zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181


Any thoughts?
Regards,

Pierre

Reply via email to