More information about the issue: When the issue happens, the controller is always on the 0.9 version Kafka broker. In server.log of other brokers, we can see this kind of error: [2016-03-23 22:36:02,814] ERROR [ReplicaFetcherThread-0-5], Error for partition [topic,208] to broker 5:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
And after restart that controller, everything works again. On Tue, Mar 22, 2016 at 6:14 PM, Qi Xu <shkir...@gmail.com> wrote: > Hi folks, Rajiv, Jun, > I'd like to bring up this thread again from Rajiv Kurian 3 months ago. > Basically we did the same thing as Rajiv did. I upgraded two machines (out > of 10) from 0.8.2.1 to 0.9. SO after the upgrade, there will be 2 machines > in 0.9 and 8 machines in 0.8.2.1. And initially it all works fine. But > after about 2 hours, all old uploaders and consumers are broken due to no > leader found for all partitions of all topics. The producer just complains > "unknown error for topic xxx when it tries to refresh the metadata". And in > server side there's some error complaining no leader for a partition. > I'm wondering is there any known issue about 0.9 and 0.8.2 co-existing > version in the same cluster? Thanks a lot. > > > Below is the original thread: > > We had to revert to 0.8.3 because three of our topics seem to have gotten > corrupted during the upgrade. As soon as we did the upgrade producers to > the three topics I mentioned stopped being able to do writes. The clients > complained (occasionally) about leader not found exceptions. We restarted > our clients and brokers but that didn't seem to help. Actually even after > reverting to 0.8.3 these three topics were broken. To fix it we had to stop > all clients, delete the topics, create them again and then restart the > clients. > > I realize this is not a lot of info. I couldn't wait to get more debug info > because the cluster was actually being used. Has any one run into something > like this? Are there any known issues with old consumers/producers. The > topics that got busted had clients writing to them using the old Java > wrapper over the Scala producer. > > Here are the steps I took to upgrade. > > For each broker: > > 1. Stop the broker. > 2. Restart with the *0.9* broker running with > inter.broker.protocol.version=*0.8.2*.X > 3. Wait for under replicated partitions to go down to 0. > 4. Go to step 1. > Once all the brokers were running the *0.9* code with > inter.broker.protocol.version=*0.8.2*.X we restarted them one by one with > inter.broker.protocol.version=0.9.0.0 > > When reverting I did the following. > > For each broker. > > 1. Stop the broker. > 2. Restart with the *0.9* broker running with > inter.broker.protocol.version=*0.8.2*.X > 3. Wait for under replicated partitions to go down to 0. > 4. Go to step 1. > > Once all the brokers were running *0.9* code with > inter.broker.protocol.version=*0.8.2*.X I restarted them one by one with > the > 0.8.2.3 broker code. This however like I mentioned did not fix the three > broken topics. >