Hi all Thanks for the help. Eventually restarting the broker a second time (days later) triggered a full repair, and the cluster is happy again. I don't know why the first restart didn't fix the issue.
Greetings Bart On Fri, Oct 18, 2019, at 16:09, M. Manna wrote: > In addition to what Peter said, I would recommend that you stop and delete > all data logs (if your replication factor is set correctly). Upon restart, > they’ll be recreated. This is of course the last time thing to do if you > cannot determine the root cause. > > The measure works well for me with my k8s deployment where a pod (broker) > is killed and recreated upon Liveness Probe failure. > > Thanks, > > > On Fri, 18 Oct 2019 at 10:06, Peter Bukowinski <[email protected]> wrote: > > > Hi Bart, > > > > Before changing anything, I would verify whether or not the affected > > broker is trying to catch up. Have you looked at the broker’s log? Do you > > see any errors? Check your metrics or the partition directories themselves > > to see if data is flowing into the broker. > > > > If you do want to reset the broker to have it start a fresh resync, stop > > the kafka broker service/process, 'rm -rf /path/to/kafka-logs' — check the > > value of your log.dir or log.dirs property in your server.properties file > > for the path — and then start the service again. It should check in with > > zookeeper and then start following the topic partition leaders for all the > > topic partition replicas assigned to it. > > > > -- Peter > > > > >> On Oct 18, 2019, at 12:16 AM, Bart van Deenen < > > [email protected]> wrote: > > > Hi all > > > > > > We had a Kafka broker failure (too many open files, stupid), and now the > > partitions on that broker will no longer become part of the ISR set. It's > > been a few days (organizational issues), and we have significant amounts of > > data on the ISR partitions. > > > > > > In order to make the partitions on the broker become part of the ISR set > > again, should I: > > > > > > * increase `replica.lag.time.max.ms` on the broker to the number of ms > > that the partitions are behind. I can guesstimate the value to about 7 > > days, or should I measure it somehow? > > > * stop the broker and wipe files (which ones?) and then restart it. > > Should I also do stuff on zookeeper ? > > > > > > Is there any _official_ information on how to deal with this situation? > > > > > > Thanks for helping! > > >
