Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests are not processed in time

2021-07-14 Thread Satish Duggana
Hi Jun, Thanks for looking into the KIP and providing your comments. >1. For Solution 2, we probably want to be a bit careful with letting each >broker automatically relinquish leadership. The danger of doing that is if all >brokers start doing the same (say due to increased data volume), the

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests are not processed in time

2021-07-07 Thread Jun Rao
Hi, Satish, Thanks for the KIP. 1. For Solution 2, we probably want to be a bit careful with letting each broker automatically relinquish leadership. The danger of doing that is if all brokers start doing the same (say due to increased data volume), the whole cluster could get into a state with

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests are not processed in time

2021-06-30 Thread Satish Duggana
> That clarification in the document helps. But then setting the first option > to true does not necessarily mean that the condition is happening. Did you > mean to say that relinquish the leadership if it is taking longer than > leader.fetch.process.time.max.ms AND there are fetch requests

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests are not processed in time

2021-06-28 Thread Mohan Parthasarathy
Hi Satish, > > > >It is not clear to me whether Solution 2 can happen independently. For > example, if the leader exceeds *leader.fetch.process.time.max.ms > * due to a transient condition, > should it relinquish leadership immediately ? That might be

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests are not processed in time

2021-06-28 Thread Satish Duggana
Hi Mohan, Please find my inline comments below. >One small clarification regarding the proposal. I understand how Solution (1) enables the other replicas to be chosen as the leader. But it is possible that the other replicas may not be in sync yet and if unclean leader election is not enabled,

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests are not processed in time

2021-06-27 Thread Mohan Parthasarathy
Hi Satish, One small clarification regarding the proposal. I understand how Solution (1) enables the other replicas to be chosen as the leader. But it is possible that the other replicas may not be in sync yet and if unclean leader election is not enabled, the other replicas may not become the

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests are not processed in time

2021-06-27 Thread Satish Duggana
Hi Dhruvil, Thanks for looking into the KIP and providing your comments. There are two problems about the scenario raised in this KIP: a) Leader is slow and it is not available for reads or writes. b) Leader is causing the followers to be out of sync and cause the partitions unavailability. (a)

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests are not processed in time

2021-06-26 Thread Dhruvil Shah
Thanks for the KIP, Satish. I am trying to understand the problem we are looking to solve with this KIP. When the leader is slow in processing fetch requests from the follower (due to disk, GC, or other reasons), the primary problem is that it could impact read and write latency and at times

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests are not processed in time

2021-06-23 Thread Ryanne Dolan
Satish, we encounter this frequently and consider it a major bug. Your solution makes sense to me. Ryanne On Tue, Jun 22, 2021, 7:29 PM Satish Duggana wrote: > Hi, > Bumping up the discussion thread on KIP-501 about avoiding out-of-sync or > offline partitions when follower fetch requests are

[DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests are not processed in time

2021-06-22 Thread Satish Duggana
Hi, Bumping up the discussion thread on KIP-501 about avoiding out-of-sync or offline partitions when follower fetch requests are not processed in time by the leader replica. This issue occurred several times in multiple production environments (at Uber, Yelp, Twitter, etc). KIP-501 is located