[
https://issues.apache.org/jira/browse/KAFKA-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Gustafson resolved KAFKA-8571.
------------------------------------
Resolution: Fixed
> Not complete delayed produce requests when processing StopReplicaRequest
> causing high produce latency for acks=all
> ------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-8571
> URL: https://issues.apache.org/jira/browse/KAFKA-8571
> Project: Kafka
> Issue Type: Bug
> Reporter: Zhanxiang (Patrick) Huang
> Assignee: Zhanxiang (Patrick) Huang
> Priority: Major
>
> Currently a broker will only attempt to complete delayed requests upon
> highwater mark changes and receiving LeaderAndIsrRequest. When a broker
> receives StopReplicaRequest, it will not try to complete delayed operations
> including delayed produce for acks=all, which can cause the producer to
> timeout even though the producer should have attempted to talk to the new
> leader faster if a NotLeaderForPartition error is sent.
> This can happen during partition reassignment when controller is trying to
> kick the previous leader out of the replica set. It this case, controller
> will only send StopReplicaRequest (not LeaderAndIsrRequest) to the previous
> leader in the replica set shrink phase. Here is an example:
> {noformat}
> During Reassign the replica set of partition A from [B1, B2] to [B2, B3]:
> t0: Controller expands the replica set to [B1, B2, B3]
> t1: B1 receives produce request PR on partition A with acks=all and timetout
> T. B1 puts PR into the DelayedProducePurgatory with timeout T.
> t2: Controller elects B2 as the new leader and shrinks the replica set fo
> [B2, B3]. LeaderAndIsrRequests are sent to B2 and B3. StopReplicaRequest is
> sent to B!.
> t3: B1 receives StopReplicaRequest but doesn't try to comeplete PR.
> If PR cannot be fullfilled by t3, and t1 + T > t3, PR will eventually time
> out in the purgatory and producer will eventually time out the produce
> request.{noformat}
> Since it is possible for the leader to receive only a StopReplicaRequest
> (without receiving any LeaderAndIsrRequest) to leave the replica set, a fix
> for this issue is to also try to complete delay operations in processing
> StopReplicaRequest.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)