Hi, We ran into a problem where clients crash after we restarted a node using kill -15 and then starting it using (broker 1001). 2 of the brokers including 1001 also can't sync with each other.
Is this a known issue and if so, is it fixed in later versions? Details: We see logs similar to the following being spammed in 1001's log, for each topic for which it is the leader: [2017-11-08 16:13:14,880] ERROR [ReplicaFetcherThread-0-1002], Error for partition [some-topic,58] to broker 1002:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) Looking at topic metadata we see that 1002 cannot see to sync up with 1001 or the other way around: Topic: x Partition: 22 Leader: 1002 Replicas: 1002,1001,1006 Isr: 1002,1006 Topic: x Partition: 30 Leader: 1001 Replicas: 1001,1002,1005 Isr: 1005,1001 Producer settings are default. >From client side I see these logs (we use Samza): 2017-11-08 10:54:55.749 WARN o.a.k.c.producer.internals.Sender [kafka-producer-network-thread | samza_producer] - Got error produce response with correlation id 4788574 on topic-partition some-topic-8, retrying (2147483646 attempts left). Error: NOT_LEADER_FOR_PARTITION ... 2017-11-08 10:55:28.187 WARN o.a.k.c.producer.internals.Sender [kafka-producer-network-thread | samza_producer-job] - Got error produce response with correlation id 4787666 on topic-partition some-topic-8, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION ... (these 2 log lines below are from Samza's Kafka client code) 2017-11-08 10:55:28.189 ERROR o.a.s.s.kafka.KafkaSystemProducer [kafka-producer-network-thread | samza_producer-job] - Closing the producer because of an exception in callback: org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for some-topic-8 due to 44577 ms has passed since batch creation plus linger time 2017-11-08 10:55:30.135 ERROR o.a.s.s.kafka.KafkaSystemProducer [kafka-producer-network-thread | samza_producer-job] - Closing the producer because of an exception in callback: java.lang.IllegalStateException: Producer is closed forcefully. at org.apache.kafka.clients.producer.internals.RecordAccumulator.abortBatches(RecordAccumulator.java:513) at org.apache.kafka.clients.producer.internals.RecordAccumulator.abortIncompleteBatches(RecordAccumulator.java:493) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:156) at java.lang.Thread.run(Thread.java:748) We have 6 nodes running 0.10.1.1 with settings: broker.id.generation.enable=true delete.topic.enable=true log.dirs=/data/kafka num.partitions=60 default.replication.factor=3 min.insync.replicas=1 log.retention.hours=168 log.segment.bytes=1073741824 log.retention.check.interval.ms=300000 zookeeper.connect=zookeeper1:2181,zookeeper2:2181,zookeeper3:2181/kafka1 zookeeper.connection.timeout.ms=6000 auto.create.topics.enable=true broker.rack=us-east-1 num.io.threads=1 Brokers are running OpenJDK 1.8 with JVM settings copied from https://kafka.apache.org/0101/documentation.html#java. The client is using org.apache.kafka:kafka-clients:jar:0.10.1.1. Producer and consumer settings are default. Topic configs are default. The load is fairly low. There are 74 topics with 60 partitions each. Thanks, Xiaochuan Yu