Joel Koshy created KAFKA-1911: --------------------------------- Summary: Log deletion on stopping replicas should be async Key: KAFKA-1911 URL: https://issues.apache.org/jira/browse/KAFKA-1911 Project: Kafka Issue Type: Bug Components: log, replication Reporter: Joel Koshy Assignee: Jay Kreps Fix For: 0.8.3
If a StopReplicaRequest sets delete=true then we do a file.delete on the file message sets. I was under the impression that this is fast but it does not seem to be the case. On a partition reassignment in our cluster the local time for stop replica took nearly 30 seconds. {noformat} Completed request:Name: StopReplicaRequest; Version: 0; CorrelationId: 467; ClientId: ; DeletePartitions: true; ControllerId: 1212; ControllerEpoch: 53 from client/...:45964;totalTime:29191,requestQueueTime:1,localTime:29190,remoteTime:0,responseQueueTime:0,sendTime:0 {noformat} This ties up one API thread for the duration of the request. Specifically in our case, the queue times for other requests also went up and producers to the partition that was just deleted on the old leader took a while to refresh their metadata (see KAFKA-1303) and eventually ran out of retries on some messages leading to data loss. I think the log deletion in this case should be fully asynchronous although we need to handle the case when a broker may respond immediately to the stop-replica-request but then go down after deleting only some of the log segments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)