[ 
https://issues.apache.org/jira/browse/KAFKA-5678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112124#comment-16112124
 ] 

cuiyang edited comment on KAFKA-5678 at 8/3/17 3:40 AM:
--------------------------------------------------------

[~becket_qin]
Hi Jiang, thank you for your patience, I have two confusions to your 
explanation:
# I think producer doesn't try on  request timeout:
{code:java}
        RecordAccumulator::abortExpiredBatches()
            -> batch.expirationDone(); 
                -> this.done()
                    -> thunk.callback.onCompletion()
                        -> this.userCallback.onCompletion() //will remove the 
batch from queue and invoke the user callback without any retries.
{code}
     So this is why we so care about the request timeout rather than other 
errors, because retries doesn't work for it even we set retries to non-zero.
 # Even if  we only make leader migration without a broker going down(assuming 
that we migrate all the leaders from Broker A to other Broker B, C,D),  we 
still need face the scenario which is described by [~Json Tu]:
bq. "if there are some partition leader did not change in this request but not 
have enough replica, then it will not satisfy such code as below."
{code:java}
        if (! produceMetadata.produceStatus.values.exists(p => p.acksPending))
            forceComplete()
{code}


was (Author: cuiyang):
[~becket_qin]
Hi Jiang, thank you for your patience, I have two confusions to your 
explanation:
# 1 I think producer doesn't try on  request timeout:
{code:java}
        RecordAccumulator::abortExpiredBatches()
            -> batch.expirationDone(); 
                -> this.done()
                    -> thunk.callback.onCompletion()
                        -> this.userCallback.onCompletion() //will remove the 
batch from queue and invoke the user callback without any retries.
{code}
     So this is why we so care about the request timeout rather than other 
errors, because retries doesn't work for it even we set retries to non-zero.
# 2 Even if  we only make leader migration without a broker going down(assuming 
that we migrate all the leaders from Broker A to other Broker B, C,D),  we 
still need face the scenario which is described by [~Json Tu]:
bq. "if there are some partition leader did not change in this request but not 
have enough replica, then it will not satisfy such code as below."
{code:java}
        if (! produceMetadata.produceStatus.values.exists(p => p.acksPending))
            forceComplete()
{code}

> When the broker graceful shutdown occurs, the producer side sends timeout.
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-5678
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5678
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 0.9.0.0, 0.10.0.0, 0.11.0.0
>            Reporter: tuyang
>
> Test environment as follows.
> 1.Kafka version:0.9.0.1
> 2.Cluster with 3 broker which with broker id A,B,C 
> 3.Topic with 6 partitions with 2 replicas,with 2 leader partitions at each 
> broker.
> We can reproduce the problem as follows.
> 1.we send message as quickly as possible with ack -1.
> 2.if partition p0's leader is on broker A and we graceful shutdown broker 
> A,but we send a message to p0 before the leader is reelect, so the message 
> can be appended to the leader replica successful, but if the follower replica 
> not catch it as quickly as possible, so the shutting down broker will create 
> a delayProduce for this request to wait complete until request.timeout.ms .
> 3.because of the controllerShutdown request from broker A, then the p0 
> partition leader will reelect
> , then the replica on broker A will become follower before complete shut 
> down.then the delayProduce will not be trigger to complete until expire. 
> 4.if broker A shutdown cost too long, then the producer will get response 
> after request.timeout.ms, which results in increase the producer send latency 
> when we are restarting broker one by one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to