[ 
https://issues.apache.org/jira/browse/KAFKA-14824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703449#comment-17703449
 ] 

hudeqi edited comment on KAFKA-14824 at 3/22/23 2:35 AM:
---------------------------------------------------------

For "potential exceptions may be throw", I did such an experiment: IOException 
was artificially injected into "processPartitionData" to simulate that 
ReplicaAlterLogDirsThread encountered a disk failure during the fetch process, 
in which the future log is in /data2 and fetching from /data1 (normal data 
retention time is 1h).
Finally, I found the same serious impact as the previous "offset mismatch 
error" (KAFKA-9087) leads: when the exception thrown by "processPartitionData" 
was caught, because it was marked as failed, the log cleanup of the 
corresponding partition was not resumed, so compared to other disks, The log of 
/data1 will grow without limit. Therefore, I think the defensive measure of 
"resume the log cleanup of the source partition when failure" is necessary.

The screenshot of the experiment can be found in the attachment.


was (Author: hudeqi):
For "potential exceptions may be throw", I did such an experiment: IOException 
was artificially injected into "processPartitionData" to simulate that 
ReplicaAlterLogDirsThread encountered a disk failure during the fetch process, 
in which the future log is in /data2 and fetching from /data1 (normal data 
retention time is 1h).
Finally, I found the same serious impact as the previous "offset mismatch 
error" (KAFKA-9087) leads: when the exception thrown by "processPartitionData" 
was caught, because it was marked as failed, the log cleanup of the 
corresponding partition was not resumed, so compared to other disks, The log of 
/data1 will grow without limit. Therefore, I think the defensive measure of 
"resume the log cleanup of the source partition when failure" is necessary.

!1.png!

> ReplicaAlterLogDirsThread may cause serious disk usage in case of unknown 
> exception
> -----------------------------------------------------------------------------------
>
>                 Key: KAFKA-14824
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14824
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.3.2
>            Reporter: hudeqi
>            Priority: Blocker
>         Attachments: 1.png, 2.png, 3.png, 4.png
>
>
> For ReplicaAlterLogDirsThread, if the partition is marked as failed due to an 
> unknown exception and the partition fetch is suspended, the paused cleanup 
> logic of the partition needs to be canceled, otherwise it will lead to 
> serious unexpected disk usage growth.
>  
> For example, in the actual production environment (the Kafka version used is 
> 2.5.1), there is such a case: perform log dir balance on this partition 
> leader broker. After started fetching when the future log is successfully 
> created, then reset and truncate to the leader's log start offset for the 
> first time due to out of range. At the same time, because the partition 
> leader is processing the leaderAndIsrRequest, the leader epoch is updated, so 
> the ReplicaAlterLogDirsThread appears FENCED_LEADER_EPOCH, and the 
> 'partitionStates' of the partition are cleaned up. At the same time, the 
> logic of add ReplicaAlterLogDirsThread for the partition is executing in the 
> thread that is processing leaderAndIsrRequest. In here, the offset set by 
> InitialFetchState is the hw of the leader. When ReplicaAlterLogDirsThread 
> performs the logic of processFetchRequest, it will throw 
> "java.lang.IllegalStateException : Offset mismatch for the future replica 
> anti_fraud.data_collector.anticrawler_live-54: fetched offset = 4979659327, 
> log end offset = 4918576434.", leading to such a result: 
> ReplicaAlterLogDirsThread no longer fetch the partition, due to the previous 
> paused cleanup logic of the partition, the disk usage of the corresponding 
> broker increases infinitely, causing serious problems.
>  
> But I found that trunk fixed this bug in KAFKA-9087, which may cause 
> ReplicaAlterLogDirsThread to appear “Offset mismatch error" causing to stop 
> fetch. But I don't know if there will be some other unknown exceptions, and 
> at the same time, due to the current logic, it will bring the same disk 
> cleanup failure problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to