Hi Jun, I updated the KIP-501 with more details. Please take a look and provide your comments.
This issue occurred several times in multiple production environments(Uber, Yelp, Twitter, etc). Thanks, Satish. On Thu, 13 Feb 2020 at 17:04, Satish Duggana <satish.dugg...@gmail.com> wrote: > > Hi Lucas, > Thanks for looking into the KIP and providing your comments. > > Adding to what Harsha mentioned, I do not think there is a fool proof > solution here to solve the cases like pending requests in the request > queue. We also thought about the option of relinquishing the > leadership but the followers might have been already out of ISR which > will result in offline partitions. This was added as a rejected > alternative in the KIP. > Broker should try its best to keep the followers(sending fetch requests) > insync. > > ~Satish. > > On Tue, Feb 11, 2020 at 11:45 PM Harsha Chintalapani <ka...@harsha.io> wrote: > > > > Hi Lucas, > > Yes the case you mentioned is true. I do understand KIP-501 > > might not fully solve this particular use case where there might blocked > > fetch requests. But the issue we noticed multiple times and continue to > > notice is > > 1. Fetch request comes from Follower > > 2. Leader tries to fetch data from disk which takes longer than > > replica.lag.time.max.ms > > 3. Async thread on leader side which checks the ISR marks the > > follower who sent a fetch request as not in ISR > > 4. Leader dies during this request due to disk errors and now we > > have offline partitions because Leader kicked out healthy followers out of > > ISR > > > > Instead of considering this from a disk issue. Lets look at how we maintain > > the ISR > > > > 1. Currently we do not consider a follower as healthy even when its able > > to send fetch requests > > 2. ISR is controlled on how healthy a broker is, ie if it takes longer > > than replica.lag.time.max.ms we mark followers out of sync instead of > > relinquishing the leadership. > > > > > > What we are proposing in this KIP, we should look at the time when a > > follower sends a fetch request and keep that as basis for marking a > > follower out of ISR or to keep it in the ISR and leave the disk read time > > on leader side out of this. > > > > Thanks, > > Harsha > > > > > > > > On Mon, Feb 10, 2020 at 9:26 PM, Lucas Bradstreet <lu...@confluent.io> > > wrote: > > > > > Hi Harsha, > > > > > > Is the problem you'd like addressed the following? > > > > > > Assume 3 replicas, L and F1 and F2. > > > > > > 1. F1 and F2 are alive and sending fetch requests to L. > > > 2. L starts encountering disk issues, any requests being processed by the > > > request handler threads become blocked. > > > 3. L's zookeeper connection is still alive so it remains the leader for > > > the partition. > > > 4. Given that F1 and F2 have not successfully fetched, L shrinks the ISR > > > to itself. > > > > > > While KIP-501 may help prevent a shrink in partitions where a replica > > > fetch request has started processing, any fetch requests in the request > > > queue will have no effect. Generally when these slow/failing disk issues > > > occur, all of the request handler threads end up blocked and requests > > > queue > > > up in the request queue. For example, all of the request handler threads > > > may end up stuck in > > > KafkaApis.handleProduceRequest handling produce requests, at which point > > > all of the replica fetcher fetch requests remain queued in the request > > > queue. If this happens, there will be no tracked fetch requests to prevent > > > a shrink. > > > > > > Solving this shrinking issue is tricky. It would be better if L resigns > > > leadership when it enters a degraded state rather than avoiding a shrink. > > > If L is no longer the leader in this situation, it will eventually become > > > blocked fetching from the new leader and the new leader will shrink the > > > ISR, kicking out L. > > > > > > Cheers, > > > > > > Lucas > > >