Hi Jun,
I updated the KIP-501 with more details. Please take a look and
provide your comments.

This issue occurred several times in multiple production
environments(Uber,  Yelp, Twitter, etc).


On Thu, 13 Feb 2020 at 17:04, Satish Duggana <satish.dugg...@gmail.com> wrote:
> Hi Lucas,
> Thanks for looking into the KIP and providing your comments.
> Adding to what Harsha mentioned, I do not think there is a fool proof
> solution here to solve the cases like pending requests in the request
> queue. We also thought about the option of relinquishing the
> leadership but the followers might have been already out of ISR which
> will result in offline partitions. This was added as a rejected
> alternative in the KIP.
> Broker should try its best to keep the followers(sending fetch requests) 
> insync.
> ~Satish.
> On Tue, Feb 11, 2020 at 11:45 PM Harsha Chintalapani <ka...@harsha.io> wrote:
> >
> > Hi Lucas,
> >            Yes the case you mentioned is true. I do understand KIP-501
> > might not fully solve this particular use case where there might blocked
> > fetch requests. But the issue we noticed multiple times  and continue to
> > notice is
> >           1. Fetch request comes from Follower
> >           2. Leader tries to fetch data from disk which takes longer than
> > replica.lag.time.max.ms
> >          3. Async thread on leader side which checks the ISR marks the
> > follower who sent a fetch request as not in ISR
> >          4. Leader dies during this request due to disk errors and now we
> > have offline partitions because Leader kicked out healthy followers out of
> > ISR
> >
> > Instead of considering this from a disk issue. Lets look at how we maintain
> > the ISR
> >
> >    1. Currently we do not consider a follower as healthy even when its able
> >    to send fetch requests
> >    2. ISR is controlled on how healthy a broker is, ie if it takes longer
> >    than replica.lag.time.max.ms we mark followers out of sync instead of
> >    relinquishing the leadership.
> >
> >
> > What we are proposing in this KIP, we should look at the time when a
> > follower sends a fetch request and keep that as basis for marking a
> > follower out of ISR or to keep it in the ISR and leave the disk read time
> > on leader side out of this.
> >
> > Thanks,
> > Harsha
> >
> >
> >
> > On Mon, Feb 10, 2020 at 9:26 PM, Lucas Bradstreet <lu...@confluent.io>
> > wrote:
> >
> > > Hi Harsha,
> > >
> > > Is the problem you'd like addressed the following?
> > >
> > > Assume 3 replicas, L and F1 and F2.
> > >
> > > 1. F1 and F2 are alive and sending fetch requests to L.
> > > 2. L starts encountering disk issues, any requests being processed by the
> > > request handler threads become blocked.
> > > 3. L's zookeeper connection is still alive so it remains the leader for
> > > the partition.
> > > 4. Given that F1 and F2 have not successfully fetched, L shrinks the ISR
> > > to itself.
> > >
> > > While KIP-501 may help prevent a shrink in partitions where a replica
> > > fetch request has started processing, any fetch requests in the request
> > > queue will have no effect. Generally when these slow/failing disk issues
> > > occur, all of the request handler threads end up blocked and requests 
> > > queue
> > > up in the request queue. For example, all of the request handler threads
> > > may end up stuck in
> > > KafkaApis.handleProduceRequest handling produce requests, at which point
> > > all of the replica fetcher fetch requests remain queued in the request
> > > queue. If this happens, there will be no tracked fetch requests to prevent
> > > a shrink.
> > >
> > > Solving this shrinking issue is tricky. It would be better if L resigns
> > > leadership when it enters a degraded state rather than avoiding a shrink.
> > > If L is no longer the leader in this situation, it will eventually become
> > > blocked fetching from the new leader and the new leader will shrink the
> > > ISR, kicking out L.
> > >
> > > Cheers,
> > >
> > > Lucas
> > >

Reply via email to