Joel Koshy created KAFKA-1911:
---------------------------------

             Summary: Log deletion on stopping replicas should be async
                 Key: KAFKA-1911
                 URL: https://issues.apache.org/jira/browse/KAFKA-1911
             Project: Kafka
          Issue Type: Bug
          Components: log, replication
            Reporter: Joel Koshy
            Assignee: Jay Kreps
             Fix For: 0.8.3


If a StopReplicaRequest sets delete=true then we do a file.delete on the file 
message sets. I was under the impression that this is fast but it does not seem 
to be the case.

On a partition reassignment in our cluster the local time for stop replica took 
nearly 30 seconds.

{noformat}
Completed request:Name: StopReplicaRequest; Version: 0; CorrelationId: 467; 
ClientId: ;    DeletePartitions: true; ControllerId: 1212; ControllerEpoch: 53 
from 
client/...:45964;totalTime:29191,requestQueueTime:1,localTime:29190,remoteTime:0,responseQueueTime:0,sendTime:0
{noformat}

This ties up one API thread for the duration of the request.

Specifically in our case, the queue times for other requests also went up and 
producers to the partition that was just deleted on the old leader took a while 
to refresh their metadata (see KAFKA-1303) and eventually ran out of retries on 
some messages leading to data loss.

I think the log deletion in this case should be fully asynchronous although we 
need to handle the case when a broker may respond immediately to the 
stop-replica-request but then go down after deleting only some of the log 
segments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to