[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1713803121 Hi, still follow this pr? @viktorsomogyi @jolshan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1706100541 > Sorry I'm late to the party but I had a question on this: > > > It seems like @clolov is right, I tested it both in quorum and zk mode, Kafka successfully reconciles the questionable case (when X-1 on B comes back after A has compacted the logs), so I think it's fine to merge in this PR. > > I was also thinking of creating some integration test for this but it's hard to simulate disk errors in Java and we can't have any assumptions about where the tests run, so I think that should be a separate task as it's out of scope for this one. If you folks know a good fault injection framework, I'm all ears. > > Did we confirm that if B comes back it is cleaned and or resumes correctly? @jolshan Sorry it took so long to pick up this PR. I have verified this recently, and it is divided into two cases: 1. When B restarts to resume fetch, if the fetch offset is not out of range (that is, it is a timely recovery situation, A has not deleted too much logs so that the log start offset is greater than B's fetch offset), B can continue to fetch with the original fetch offset. 2. When B restarts to resume fetch, if the fetch offset occurs out of range (the fetch of B may not be restored until long after A resumes cleaning, and at this time A's log start offset is greater than B's fetch offset), B will proceed `truncateFully` and start at A's log start offset, and delete all previously fetched logs of B. So I think it's as expected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1706059124 @viktorsomogyi Sorry, I have been focusing on other PRs recently. I looked at trogdor and found kibosh, a disk fault injection tool. When simulating the scenario mentioned in this PR, I did this: 1. First follow this link (https://github.com/confluentinc/kibosh) to install kibosh. 2. Then submit the inter-disk copy reassign task, for example, in the figure1, partition 0 of test_alter_log is migrated from /disk12 to /disk2. 3. Perform fault injection on the kafka log directory configured by /disk2: ./kibosh /data2/kafka/logs/kafka_trunk_test --target /data2/kafka/logs/kafka_trunk _test_mirror -o nonempty 4. Wait for a while and verify that the trunk version of Kafka service will not continue to clean the log of /disk12, as shown in the figure2&3. 5. Finally, switch to the fixed version and perform the above operation again. After the fault is injected into /disk2, the log of /disk12 continues to be cleaned, as shown in the figure4. The log when injecting a fault into the /disk2 is shown in figure 5. figures: figure1:![SeaTalk_IMG_20230905_111732](https://github.com/apache/kafka/assets/16536770/4810fa4e-5832-4f5a-b091-b9f05c10cb48) figure2: ![SeaTalk_IMG_20230905_111741](https://github.com/apache/kafka/assets/16536770/fa672037-c252-4a2b-b283-64b7e6d8e335) figure3: ![SeaTalk_IMG_20230905_111747](https://github.com/apache/kafka/assets/16536770/df97dde3-6f4d-49e7-b451-d4221dc661e3) figure4: ![SeaTalk_IMG_20230905_141229](https://github.com/apache/kafka/assets/16536770/5f4dd7d4-e912-4604-a52f-96619111de13) figure5: ![SeaTalk_IMG_20230905_150641](https://github.com/apache/kafka/assets/16536770/d9437c97-cad4-433a-982b-fff73782aae9) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1595613347 > Sorry I'm late to the party but I had a question on this: > > > It seems like @clolov is right, I tested it both in quorum and zk mode, Kafka successfully reconciles the questionable case (when X-1 on B comes back after A has compacted the logs), so I think it's fine to merge in this PR. > > I was also thinking of creating some integration test for this but it's hard to simulate disk errors in Java and we can't have any assumptions about where the tests run, so I think that should be a separate task as it's out of scope for this one. If you folks know a good fault injection framework, I'm all ears. > > Did we confirm that if B comes back it is cleaned and or resumes correctly? Yes, I found this issue online. I used this PR to fix our online code and found that it solved the problem very well. But online issue belongs to delete policy's case, and for compact policy, I'll test it next week. @jolshan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1595608359 > @viktorsomogyi Trogdor has some fault injection capabilities. I'm not sure if disk errors are among them, but it could probably be added. See https://github.com/apache/kafka/tree/trunk/trogdor/src/main/java/org/apache/kafka/trogdor/fault for the types of faults we currently support. Thanks! @mumrah , I'm going to learn about trogdor to see how to use it. In addition, if we add integration tests, put them in this PR, or need to open another PR? @viktorsomogyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1594893813 Thanks for your comment, I have merged the latest trunk to see if it can pass the CI check. @viktorsomogyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1585730961 > So I have some context with the replica fetcher area (mostly by reading and debugging), I hope I can help. > > First, since the conversation is a bit long, let me summarize what I understand: > > * The problem is disk A reaches its capacity limits > * The solution is to move partition X-1 to disk B > * During the reassignment, log cleaning is disabled on X-1 (which can therefore fill disk A) > * The reassignment of X-1 fails, it is left failed there on B and X-1 on A keeps growing > Is this correct? > > If it is, we may need to separate the deletion and compaction cases. I think resuming deletion is safe, however resuming compaction might not be, since compaction alters the log. If an operator somehow resumes B and lets replication continue, then the history of X-1 in A and B might be different (I'm still working on a local test case that reproduces this). What do you think? Thank you for your review, @viktorsomogyi . Your summarize and understanding are very correct, I think carefully, for the distinction between compact and delete processing, I think it is very reasonable logically, for the case of turning on compact, regardless of whether delete is enabled or not, when B fails, we resume cleaning A, and when B resumes replicating, it is entirely possible inconsistent between A and B (I don't know if it has reproduced on your side). I thought for a moment, for compact, when we resume cleaning for A, is it OK to clean B completely(clear all segments of B directly and start over)? I don't know if there's a better way? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1578890756 > @hudeqi I added myself as a reviewer, I may not have time to review this today but will get to it this week. OKļ¼thanks your time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1578277690 Hello, are you free to help review this PR? @mimaison -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1572025066 > No, you are correct, I had another look at the neighbouring code after I wrote my previous comment and I agree that this appears to not cause any leadership movements. Okay, I am happy to give my approval. Can you rebase on the latest trunk so test can be reran? Hi, I have merged latest trunk into the PR branch and re-ran unit test. It seems that the failure has nothing to do with me. @clolov -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1569908827 > Apologies about the delay in reviewing this. I need another week to look into this part of the code base. pin again @divijvaidya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1569882954 > Okay, in principle I agree that cleaning should be restarted, but I do not understand why you want to mark the topicPartition as failed > > ``` > override def markPartitionFailed(topicPartition: TopicPartition): Unit = { > super.markPartitionFailed(topicPartition) <- I DO NOT UNDERSTAND THE PURPOSE OF THIS > info(s"For ReplicaAlterLogDirsThread, may also need to resume log cleaner for partition $topicPartition") > > replicaMgr.logManager.resumeCleaning(topicPartition) > } > ``` > > My overarching question is that if we mark a partition as failed regardless of whether it or its ReplicaAlterLogDirsThread fails does this not mean that leadership will change and no traffic will be served by the replica on this broker anyway? Hi, the "markPartitionFailed" method in ReplicaAlterLogDirsThread is overridden in the AbstractFetcherThread class, that is to say, if it is not modified by this pr, when encountering KafkaStorageException, Unexpected error and some fenced error, it will still execute "markPartitionFailed of AbstractFetcherThread" (I probably took a look here, only two things have been done here., add partition to failedPartitions, but failedPartitions seems to be useless except calculating metric. Remove partition from partitionStates, so that the partition will not continue to fetch in ReplicaAlterLogDirsThread). So I'm just adding resumeClean logic to the logic of markPartitionFailed in ReplicaAlterLogDirsThread, not additional markPartitionFailed logic. In addition, when executing markPartitionFailed on a partition, there are no other "side effects" as you said, except to remove the fetch of the partition in the ReplicaAlterLogDirsThread. This is my understanding. If there is any mistake, please correct me. Thank you! @clolov -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1563021307 > Heya @hudeqi, could you give more detailed explanation on what the problem you are trying to solve here because I do not understand? May I suggest you distinguish between a partition replica and a partition future replica in some way, because otherwise it is quite difficult to understand which replica you are referring to when it is just called "partition"? > > Let's say we have an original replica in log directory (backed by one disk) A, and let's say we have a future replica in log directory (backed by another disk) B on the same broker. I have confirmed that it is the case that **compaction** is paused on A when B is first created (https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L910). It is only resumed (https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L1099). I can think of two situations happening now. > > 1. A fails and B doesn't know what to do. >In this situation, if A has failed then cleaning should not be resumed on A - an operator intervention is required to understand what went wrong with the log directory. Cleaning should not be started on B either. Since A has failed the amount of data that B has does not grow because it doesn't have a source to keep copying from. > 2. B fails and A doesn't know what to do. >In this situation B's size cannot grow because it is no longer copying from A. A should resume cleaning as for all intents and purposes it acts as a normal replica to a topic partition. > > As far as I understand what you try to do is solve situation number 2 - am I correct? If I am correct, what is the reasoning behind marking A as a failed partition? Hi, clolov, thank you for your review! I'm sorry that I didn't express the issue clearly. Your scenario example and description are very correct. The problem I want to solve is indeed situation number 2. I will describe according to your scenario: Assuming that B fails due to a disk problem, then according to the current logic, the partition is directly marked as failed in the corresponding 'ReplicaAlterLogDirsThread', nothing more. However, as a replica that normally provides services, A does not know that B has failed and remains in a clean-paused state, which will cause the disk where A is located to be fully occupied. That is the problem I want to solve. And my solution is to resume cleaning the partition when B fails to ensure that A can continue to clean up. @clolov -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1535663733 There are indeed corresponding recovery methods for this issue, such as directly deleting the future log on the disk and restarting the corresponding broker. Therefore, if the failed source partition does not resume deletion immediately when marked as failed, is it necessary to add a monitoring metric to this situation for timely detection? @divijvaidya @chia7712 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1488811902 > Hi, thanks for your attention to this issue. I think there is a difference between this problem and the LogDirectory scenario you mentioned: this is when a partition is balancing between disks of the same broker, due to a problem with the target disk, the partition may be marked as failed, which in turn causes the corresponding partition of the source disk Logs are not cleaned up. When this happens, if the reassign task is not carefully checked, it is difficult to find the failed task. If it is not handled, the occupation of a healthy source disk will continue to increase (of course, if we find it, we can restart the broker to trigger the processing of leaderAndIsrRequest to re-add fetcher, or delete the future directory of the partition in target disk), I think this is a bit unreasonable. @divijvaidya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1488179126 Note that when UnifiedLog.append throws IOException, it will be caught by the outer layer and re-throw KafkaStorageException. Therefore, if there is a disk error in target dir, the "processPartitionData" will still throw KafkaStorageException, causing the partition to be marked as failed, which in turn causes the source disk to be occupied Unlimited growth. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1486291886 Hello, are you interested in seeing this issue? I think it's a serious bug. @hachikuji @mumrah -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1484043884 > Hello, for "potential exceptions", I did an experiment to simulate a disk failure, which eventually lead to the unexpected disk growing. For details, please refer to the corresponding comment in [jira](https://issues.apache.org/jira/browse/KAFKA-14824) . @chia7712 Hello, are you still following this issue? @chia7712 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
hudeqi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1480536455 Hello, are you interested in seeing this issue? I think it's a serious bug. @showuon @dengziming -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org