There are indeed some known issues in the Controller that require care to avoid. Onur has recently contributed a PR that simplifies the concurrency model of the Controller:
https://github.com/apache/kafka/commit/bb663d04febcadd4f120e0ff5c5919ca8bf7e971 This is a good first step and will be part of 0.11.0.0. The next step will be to fix the session expiration issues. It's a non-trivial amount of work so the current target is the feature release after 0.11.0.0. Ismael On Fri, Apr 28, 2017 at 8:30 PM, Michal Borowiecki < michal.borowie...@openbet.com> wrote: > Hi James, > > This "Cached zkVersion [x] not equal to that in zookeeper" issue bit us > once in production and I found these ticket to be relevant: > KAFKA-2729 <https://issues.apache.org/jira/browse/KAFKA-2729> > KAFKA-3042 <https://issues.apache.org/jira/browse/KAFKA-3042> > KAFKA-3083 <https://issues.apache.org/jira/browse/KAFKA-3083> > Unfortunately, I don't believe there is a fix for it yet, or in the making. > > Thanks, > Michał > > > On 28/04/17 19:26, James Brown wrote: > > For what it's worth, shutting down the entire cluster and then restarting > it did address this issue. > > I'd love anyone's thoughts on what the "correct" fix would be here. > > On Fri, Apr 28, 2017 at 10:58 AM, James Brown <jbr...@easypost.com> > <jbr...@easypost.com> wrote: > > > The following is also appearing in the logs a lot, if anyone has any ideas: > > INFO Partition [easypost.syslog,7] on broker 1: Cached zkVersion [647] not > equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > > On Fri, Apr 28, 2017 at 10:43 AM, James Brown <jbr...@easypost.com> > <jbr...@easypost.com> wrote: > > > We're running 0.10.1.0 on a five-node cluster. > > I was in the process of migrating some topics from having 2 replicas to > having three replicas when two the five machines in this cluster crashed > (brokers 2 and 3). > > After restarting them, all of the topics that were previously assigned to > them are unavailable and showing "Leader: -1". > > Example kafka-topics output: > > % kafka-topics.sh --zookeeper $ZK_HP --describe --unavailable-partitions > Topic: __consumer_offsets Partition: 9 Leader: -1 Replicas: 3,2,4 Isr: > Topic: __consumer_offsets Partition: 13 Leader: -1 Replicas: 3,2,4 Isr: > Topic: __consumer_offsets Partition: 17 Leader: -1 Replicas: 3,2,5 Isr: > Topic: __consumer_offsets Partition: 23 Leader: -1 Replicas: 5,2,1 Isr: > Topic: __consumer_offsets Partition: 25 Leader: -1 Replicas: 3,2,5 Isr: > Topic: __consumer_offsets Partition: 26 Leader: -1 Replicas: 3,2,1 Isr: > Topic: __consumer_offsets Partition: 30 Leader: -1 Replicas: 3,1,2 Isr: > Topic: __consumer_offsets Partition: 33 Leader: -1 Replicas: 1,2,4 Isr: > Topic: __consumer_offsets Partition: 35 Leader: -1 Replicas: 1,2,5 Isr: > Topic: __consumer_offsets Partition: 39 Leader: -1 Replicas: 3,1,2 Isr: > Topic: __consumer_offsets Partition: 40 Leader: -1 Replicas: 3,4,2 Isr: > Topic: __consumer_offsets Partition: 44 Leader: -1 Replicas: 3,1,2 Isr: > Topic: __consumer_offsets Partition: 45 Leader: -1 Replicas: 1,3,2 Isr: > > Note that I wasn't even moving any of the __consumer_offsets partitions, > so I'm not sure if the fact that a reassignment was in progress is a red > herring or not. > > The logs are full of > > ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2] > to broker 3:org.apache.kafka.common.errors.UnknownServerException: The > server experienced an unexpected error when processing the request > (kafka.server.ReplicaFetcherThread) > ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2] > to broker 3:org.apache.kafka.common.errors.UnknownServerException: The > server experienced an unexpected error when processing the request > (kafka.server.ReplicaFetcherThread) > ERROR [ReplicaFetcherThread-0-3], Error for partition > [epostg.request_log_v1,0] to broker > 3:org.apache.kafka.common.errors.UnknownServerException: > The server experienced an unexpected error when processing the request > (kafka.server.ReplicaFetcherThread) > ERROR [ReplicaFetcherThread-0-3], Error for partition > [epostg.request_log_v1,0] to broker > 3:org.apache.kafka.common.errors.UnknownServerException: > The server experienced an unexpected error when processing the request > (kafka.server.ReplicaFetcherThread) > > What can I do to fix this? Should I manually reassign all partitions > that were led by brokers 2 or 3 to only have whatever the third broker was > in their replica-set as their replica set? Do I need to temporarily enable > unclean elections? > > I've never seen a cluster fail this way... > > -- > James Brown > Engineer > > > -- > James Brown > Engineer > > > > -- > <http://www.openbet.com/> Michal Borowiecki > Senior Software Engineer L4 > T: +44 208 742 1600 <020%208742%201600> > > > +44 203 249 8448 <020%203249%208448> > > > > E: michal.borowie...@openbet.com > W: www.openbet.com > OpenBet Ltd > > Chiswick Park Building 9 > > 566 Chiswick High Rd > > London > > W4 5XT > > UK > <https://www.openbet.com/email_promo> > This message is confidential and intended only for the addressee. If you > have received this message in error, please immediately notify the > postmas...@openbet.com and delete it from your system as well as any > copies. The content of e-mails as well as traffic data may be monitored by > OpenBet for employment and security purposes. To protect the environment > please do not print this e-mail unless necessary. OpenBet Ltd. Registered > Office: Chiswick Park Building 9, 566 Chiswick High Road, London, W4 5XT, > United Kingdom. A company registered in England and Wales. Registered no. > 3134634. VAT no. GB927523612 >