Zhanxiang (Patrick) Huang created KAFKA-8571:
------------------------------------------------

             Summary: Not complete delayed produce requests when processing 
StopReplicaRequest causing high produce latency for acks=all
                 Key: KAFKA-8571
                 URL: https://issues.apache.org/jira/browse/KAFKA-8571
             Project: Kafka
          Issue Type: Bug
            Reporter: Zhanxiang (Patrick) Huang
            Assignee: Zhanxiang (Patrick) Huang


Currently a broker will only attempt to complete delayed requests upon 
highwater mark changes and receiving LeaderAndIsrRequest. When a broker 
receives StopReplicaRequest, it will not try to complete delayed operations 
including delayed produce for acks=all, which can cause the producer to timeout 
even though the producer should have attempted to talk to the new leader faster 
if a NotLeaderForPartition error is sent.

This can happen during partition reassignment when controller is trying to kick 
the previous leader out of the replica set. It this case, controller will only 
send StopReplicaRequest (not LeaderAndIsrRequest) to the previous leader in the 
replica set shrink phase. Here is an example:
{noformat}
During Reassign the replica set of partition A from [B1, B2] to [B2, B3]:
t0: Controller expands the replica set to [B1, B2, B3]

t1: B1 receives produce request PR on partition A with acks=all and timetout T. 
B1 puts PR into the DelayedProducePurgatory with timeout T.

t2: Controller elects B2 as the new leader and shrinks the replica set fo [B2, 
B3]. LeaderAndIsrRequests are sent to B2 and B3. StopReplicaRequest is sent to 
B!.

t3: B1 receives StopReplicaRequest but doesn't try to comeplete PR.

If PR cannot be fullfilled by t3, and t1 + T > t3, PR will eventually time out 
in the purgatory and producer will eventually time out the produce 
request.{noformat}
Since it is possible for the leader to receive only a StopReplicaRequest 
(without receiving any LeaderAndIsrRequest) to leave the replica set, a fix for 
this issue is to also try to complete delay operations in processing 
StopReplicaRequest.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to