Hey, Changing broker IDs made existing data in ZK irrelevant - since ZK uses broker ID to identify brokers and "associate" them with the partition placement (where the data is for each partition), it expects your data to be on the "old" brokers, while the ones that are connecting now are completely new to him (new IDs). This is the main problem here, I think.
Also, changing default replication factor will not affect existing topics, so this change doesn't matter here and it's not causing any more issues :-) I think you may still be able to recover your data given: a) the ZK downtime you mentioned didn't cause a data loss / corruption (doesn't seem like, from what you wrote) b) you still have the "old" nodes with the data in "old" directory If that's the case, you should be able to recover the cluster to the "previous" state by simply reverting your changes, specifically the broker IDs and directory dir settings (if data is still in the old directory and configuration in ZK still correctly refers to old broker IDs I can't see a reason why it wouldn't work). Once your cluster is up and data is available, you're good to start over again. If you want to start from changing the data directory, simply: 1. stop Kafka process on one of the nodes 2. change data dir in config 3. move your data to the new directory 4. restart Kafka process Repeat for other nodes. To increase RF you have to use the bin/kafka-reassign-partitions.sh tool. I'd suggest referring to the Confluent documentation ( https://docs.confluent.io/1.0.1/kafka/post-deployment.html#increasing-replication-factor ) for the details, as it's explained very well there. While it's a Confluent one, this specific CLI is not Confluent Platform specific, so will work for you as well. This may take a while depending on your data size. Also,* If* you started from increasing replication factor, to change your data dir you can simply stop Kafka, change the data dir in config, delete the data directory, and start the Kafka process again. Kafka will take care of getting the missing data from other brokers and putting them in the new data dir. Keep in mind that while it's one step less than what I proposed above, this means transferring all the data through the network - if you have a lot of data, it might be a bad idea. Also, be absolutely sure that your data is correctly replicated - if it's not, deleting your data from that broker means (obviously) a data loss. And an advice to keep in mind when dealing with Kafka in general: DO NOT CHANGE BROKER IDS for brokers with data, unless you exactly know what you're doing and have a good reason to do it - it will save you from many problems :-) Kind regards, MichaĆ On 1 February 2018 at 13:28, Traiano Welcome <trai...@gmail.com> wrote: > Hi all, > I reconfigured my kafka cluster, changing: > > - default replication factor from 1 to 3 and also > - changing the location of the kafka data dir on disk > > So after restarting all nodes, the cluster seemed ok but then I noticed > all the topics are failing to come online. In the logs there are messages > like this for each topic: > > state-change.log: [2018-02-01 12:41:42,176] ERROR Controller 826437096 > epoch 19 initiated state change for partition [filedrop,0] from > OfflinePartition to OnlinePartition failed (state.change.logger) > > > So none of the topics are usable; Listing topics with kafkacat -L -b shows > leaders not availabil > > > --- > Metadata for all topics (from broker -1: lol-045:9092/bootstrap): > 7 brokers: > broker 826437096 at lol-044:9092 > broker 746155422 at lol-047:9092 > broker 651737161 at lol-046:9092 > broker 728512596 at lol-048:9092 > broker 213763378 at lol-045:9092 > broker 622553932 at lol-049:9092 > broker 746727274 at lol-050:9092 > 14 topics: > topic "lol.stripped" with 3 partitions: > partition 2, leader -1, replicas: , isrs: , Broker: Leader not > available > partition 1, leader -1, replicas: , isrs: , Broker: Leader not > available > partition 0, leader -1, replicas: , isrs: , Broker: Leader not > available > --- > > However, newly created topics are correctly replicated and healthy > > --- > topic "lol-kafka-health" with 3 partitions: > partition 2, leader 622553932, replicas: 622553932,213763378,651737161, > isrs: 622553932,213763378,651737161 > partition 1, leader 213763378, replicas: 622553932,213763378,826437096, > isrs: 213763378,826437096,622553932 > partition 0, leader 826437096, replicas: 213763378,746727274,826437096, > isrs: 826437096,746727274,213763378 > --- > > So I think some kind of metadata corruption happened during the > reconfigure > > My question is: > > - Is there any way I can get these topic partitions online again ? > > Given that: > - the broker ids were changed during the reconfigure > - the zookeeper cluster for kafka went down temporarily during the > reconfig > > In addition, are there some procedures I can use to investigate how > recoverable these topics are? > > Many thanks in advance! > Traiano >