Hi Dong, thanks for this patch.
I happen to dig into this test, and it seems this test is flaky and can fail if
the following sequence of events happens:
1. broker 0 creates the two partitions, topic-0, and topic-1, under different
log dirs, say topic-0 under dir A, and topic-1 under dir B.
2. broker0 handles the log dir failure for dirB, as instructed by the test code.
3. The 1st StopReplicaRequest with delete=true is sent by the test code, and
handled by broker 0, which removes topic-0 from the currentLogs.
4. a LogDirEventNotification is triggered on the controller through ZK because
of the change in step 2, and the controller calls onBrokerLogDirFailure, which
in turn calls ReplicaStateMachine#handleStateChanges for topic-0 [replica 0],
and topic-1 [replica 0] with a target state of OnlineReplica. The
ReplicaStateMachine#handleStateChanges call will trigger LeaderAndIsr requests
to be sent to broker 0 with partition state isNew set to false for both topic-0
and topic-1.
5. While handling the LeaderAndIsr request, broker0 eventually calls
LogManager#getOrCreateLog for topic-0, and executes the logic
if (!isNew && offlineLogDirs.nonEmpty)
throw new KafkaStorageException(s"Can not create log for
$topicPartition because log directories ${offlineLogDirs.mkString(",")} are
offline")
The KafkaStorageException will be converted into a Kafka_STORAGE_ERROR, which
will be returned to the controller.
6. When the controller processes LeaderAndIsrResponse and detects the error, it
will eventually try to send a StopReplica, and a LeaderAndIsr request to
broker0.
7. broker0 handles the StopReplica request as a Noop since topic-0 is already
deleted. But it handles the LeaderAndISR request by marking topic-0 as an
OfflinePartition in the LogManager.allPartitions
8. Since the state of topic-0 is OfflinePartition now, the 2nd round of
StopReplica request sent by the test in the for loop will receive an error
KAFKA_STORAGE_ERROR for topic-0, thus failing the test.
It seems the divergence starts happening when broker0 receives a StopReplica
request from the test code instead of from the controller. This causes broker0
to delete the replica without the controller knowing about the change. Please
let me know if the description above makes sense, thanks!
[ Full content available at: https://github.com/apache/kafka/pull/5533 ]
This message was relayed via gitbox.apache.org for [email protected]