Hi Dong, thanks for this patch.
I happen to dig into this test, and it seems this test is flaky and can fail if 
the following sequence of events happens:
1. broker 0 creates the two partitions, topic-0, and topic-1, under different 
log dirs, say topic-0 under dir A, and topic-1 under dir B.
2. broker0 handles the log dir failure for dirB, as instructed by the test code.
3. The 1st StopReplicaRequest with delete=true is sent by the test code, and 
handled by broker 0, which removes topic-0 from the currentLogs.
4. a LogDirEventNotification is triggered on the controller through ZK because 
of the change in step 2, and the controller calls onBrokerLogDirFailure, which 
in turn calls ReplicaStateMachine#handleStateChanges for topic-0 [replica 0], 
and topic-1 [replica 0] with a target state of OnlineReplica. The 
ReplicaStateMachine#handleStateChanges call will trigger LeaderAndIsr requests 
to be sent to broker 0 with partition state isNew set to false for both topic-0 
and topic-1.
5. While handling the LeaderAndIsr request, broker0 eventually calls 
LogManager#getOrCreateLog for topic-0, and executes the logic
       if (!isNew && offlineLogDirs.nonEmpty)
          throw new KafkaStorageException(s"Can not create log for 
$topicPartition because log directories ${offlineLogDirs.mkString(",")} are 
offline")
The KafkaStorageException will be converted into a Kafka_STORAGE_ERROR, which 
will be returned to the controller.
6. When the controller processes LeaderAndIsrResponse and detects the error, it 
will eventually try to send a StopReplica, and a LeaderAndIsr request to 
broker0.
7. broker0 handles the StopReplica request as a Noop since topic-0 is already 
deleted. But it handles the LeaderAndISR request by marking topic-0 as an 
OfflinePartition in the LogManager.allPartitions
8. Since the state of topic-0 is OfflinePartition now, the 2nd round of 
StopReplica request sent by the test in the for loop will receive an error 
KAFKA_STORAGE_ERROR for topic-0, thus failing the test.

It seems the divergence starts happening when broker0 receives a StopReplica 
request from the test code instead of from the controller. This causes broker0 
to delete the replica without the controller knowing about the change. Please 
let me know if the description above makes sense, thanks!

[ Full content available at: https://github.com/apache/kafka/pull/5533 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to