Re: Broker that stays outside of the ISR, how to recover

Bart van Deenen Sat, 26 Oct 2019 07:56:06 -0700

Hi all

Thanks for the help. Eventually restarting the broker a second time (days 
later) triggered a full repair, and the cluster is happy again. I don't know 
why the first restart didn't fix the issue.


Greetings

Bart

On Fri, Oct 18, 2019, at 16:09, M. Manna wrote:
> In addition to what Peter said, I would recommend that you stop and delete
> all data logs (if your replication factor is set correctly). Upon restart,
> they’ll be recreated. This is of course the last time thing to do if you
> cannot determine the root cause.
> 
>  The measure works well for me with my k8s deployment where a pod (broker)
> is killed and recreated upon Liveness Probe failure.
> 
> Thanks,
> 
> 
> On Fri, 18 Oct 2019 at 10:06, Peter Bukowinski <[email protected]> wrote:
> 
> > Hi Bart,
> >
> > Before changing anything, I would verify whether or not the affected
> > broker is trying to catch up. Have you looked at the broker’s log? Do you
> > see any errors? Check your metrics or the partition directories themselves
> > to see if data is flowing into the broker.
> >
> > If you do want to reset the broker to have it start a fresh resync, stop
> > the kafka broker service/process, 'rm -rf /path/to/kafka-logs' — check the
> > value of your log.dir or log.dirs property in your server.properties file
> > for the path — and then start the service again. It should check in with
> > zookeeper and then start following the topic partition leaders for all the
> > topic partition replicas assigned to it.
> >
> > -- Peter
> >
> > >> On Oct 18, 2019, at 12:16 AM, Bart van Deenen <
> > [email protected]> wrote:
> > > Hi all
> > >
> > > We had a Kafka broker failure (too many open files, stupid), and now the
> > partitions on that broker will no longer become part of the ISR set. It's
> > been a few days (organizational issues), and we have significant amounts of
> > data on the ISR partitions.
> > >
> > > In order to make the partitions on the broker become part of the ISR set
> > again, should I:
> > >
> > > * increase `replica.lag.time.max.ms` on the broker to the number of ms
> > that the partitions are behind. I can guesstimate the value to about 7
> > days, or should I measure it somehow?
> > > * stop the broker and wipe files (which ones?) and then restart it.
> > Should I also do stuff on zookeeper ?
> > >
> > > Is there any _official_ information on how to deal with this situation?
> > >
> > > Thanks for helping!
> >
>

Re: Broker that stays outside of the ISR, how to recover

Reply via email to