[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-09-11 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1713803121

   Hi, still follow this pr? @viktorsomogyi @jolshan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-09-05 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1706100541

   > Sorry I'm late to the party but I had a question on this:
   > 
   > > It seems like @clolov is right, I tested it both in quorum and zk mode, 
Kafka successfully reconciles the questionable case (when X-1 on B comes back 
after A has compacted the logs), so I think it's fine to merge in this PR.
   > > I was also thinking of creating some integration test for this but it's 
hard to simulate disk errors in Java and we can't have any assumptions about 
where the tests run, so I think that should be a separate task as it's out of 
scope for this one. If you folks know a good fault injection framework, I'm all 
ears.
   > 
   > Did we confirm that if B comes back it is cleaned and or resumes correctly?
   
   @jolshan Sorry it took so long to pick up this PR. I have verified this 
recently, and it is divided into two cases:
   1. When B restarts to resume fetch, if the fetch offset is not out of range 
(that is, it is a timely recovery situation, A has not deleted too much logs so 
that the log start offset is greater than B's fetch offset), B can continue to 
fetch with the original fetch offset. 
   2. When B restarts to resume fetch, if the fetch offset occurs out of range 
(the fetch of B may not be restored until long after A resumes cleaning, and at 
this time A's log start offset is greater than B's fetch offset), B will 
proceed `truncateFully` and start at A's log start offset, and delete all 
previously fetched logs of B.
   So I think it's as expected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-09-05 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1706059124

   @viktorsomogyi Sorry, I have been focusing on other PRs recently. I looked 
at trogdor and found kibosh, a disk fault injection tool. When simulating the 
scenario mentioned in this PR, I did this:
   1. First follow this link (https://github.com/confluentinc/kibosh) to 
install kibosh.
   2. Then submit the inter-disk copy reassign task, for example, in the 
figure1, partition 0 of test_alter_log is migrated from /disk12 to /disk2.
   3. Perform fault injection on the kafka log directory configured by /disk2: 
./kibosh /data2/kafka/logs/kafka_trunk_test --target 
/data2/kafka/logs/kafka_trunk
   _test_mirror -o nonempty
   4. Wait for a while and verify that the trunk version of Kafka service will 
not continue to clean the log of /disk12, as shown in the figure2&3.
   5. Finally, switch to the fixed version and perform the above operation 
again. After the fault is injected into /disk2, the log of /disk12 continues to 
be cleaned, as shown in the figure4.
   
   The log when injecting a fault into the /disk2 is shown in figure 5.
   
   figures:
   
figure1:![SeaTalk_IMG_20230905_111732](https://github.com/apache/kafka/assets/16536770/4810fa4e-5832-4f5a-b091-b9f05c10cb48)
   figure2:
   
![SeaTalk_IMG_20230905_111741](https://github.com/apache/kafka/assets/16536770/fa672037-c252-4a2b-b283-64b7e6d8e335)
   figure3:
   
![SeaTalk_IMG_20230905_111747](https://github.com/apache/kafka/assets/16536770/df97dde3-6f4d-49e7-b451-d4221dc661e3)
   figure4:
   
![SeaTalk_IMG_20230905_141229](https://github.com/apache/kafka/assets/16536770/5f4dd7d4-e912-4604-a52f-96619111de13)
   figure5:
   
![SeaTalk_IMG_20230905_150641](https://github.com/apache/kafka/assets/16536770/d9437c97-cad4-433a-982b-fff73782aae9)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-06-16 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1595613347

   > Sorry I'm late to the party but I had a question on this:
   > 
   > > It seems like @clolov is right, I tested it both in quorum and zk mode, 
Kafka successfully reconciles the questionable case (when X-1 on B comes back 
after A has compacted the logs), so I think it's fine to merge in this PR.
   > > I was also thinking of creating some integration test for this but it's 
hard to simulate disk errors in Java and we can't have any assumptions about 
where the tests run, so I think that should be a separate task as it's out of 
scope for this one. If you folks know a good fault injection framework, I'm all 
ears.
   > 
   > Did we confirm that if B comes back it is cleaned and or resumes correctly?
   
   Yes, I found this issue online. 
   I used this PR to fix our online code and found that it solved the problem 
very well. 
   But online issue belongs to delete policy's case, and for compact policy, 
I'll test it next week. @jolshan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-06-16 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1595608359

   > @viktorsomogyi Trogdor has some fault injection capabilities. I'm not sure 
if disk errors are among them, but it could probably be added. See 
https://github.com/apache/kafka/tree/trunk/trogdor/src/main/java/org/apache/kafka/trogdor/fault
 for the types of faults we currently support.
   
   Thanks! @mumrah , I'm going to learn about trogdor to see how to use it. 
   In addition, if we add integration tests, put them in this PR, or need to 
open another PR? @viktorsomogyi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-06-16 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1594893813

   Thanks for your comment, I have merged the latest trunk to see if it can 
pass the CI check. @viktorsomogyi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-06-10 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1585730961

   > So I have some context with the replica fetcher area (mostly by reading 
and debugging), I hope I can help.
   > 
   > First, since the conversation is a bit long, let me summarize what I 
understand:
   > 
   > * The problem is disk A reaches its capacity limits
   > * The solution is to move partition X-1 to disk B
   > * During the reassignment, log cleaning is disabled on X-1 (which can 
therefore fill disk A)
   > * The reassignment of X-1 fails, it is left failed there on B and X-1 on A 
keeps growing
   >   Is this correct?
   > 
   > If it is, we may need to separate the deletion and compaction cases. I 
think resuming deletion is safe, however resuming compaction might not be, 
since compaction alters the log. If an operator somehow resumes B and lets 
replication continue, then the history of X-1 in A and B might be different 
(I'm still working on a local test case that reproduces this). What do you 
think?
   
   Thank you for your review, @viktorsomogyi . 
   Your summarize and understanding are very correct, I think carefully, for 
the distinction between compact and delete processing, I think it is very 
reasonable logically, for the case of turning on compact, regardless of whether 
delete is enabled or not, when B fails, we resume cleaning A, and when B 
resumes replicating, it is entirely possible inconsistent between A 
and B (I don't know if it has reproduced on your side). 
   I thought for a moment, for compact, when we resume cleaning for A, is it OK 
to clean B completely(clear all segments of B directly and start over)? I don't 
know if there's a better way?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-06-06 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1578890756

   > @hudeqi I added myself as a reviewer, I may not have time to review this 
today but will get to it this week.
   
   OKļ¼Œthanks your time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-06-06 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1578277690

   Hello, are you free to help review this PR? @mimaison 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-06-01 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1572025066

   > No, you are correct, I had another look at the neighbouring code after I 
wrote my previous comment and I agree that this appears to not cause any 
leadership movements. Okay, I am happy to give my approval. Can you rebase on 
the latest trunk so test can be reran?
   
   Hi, I have merged latest trunk into the PR branch and re-ran unit test. It 
seems that the failure has nothing to do with me. @clolov 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-05-31 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1569908827

   > Apologies about the delay in reviewing this. I need another week to look 
into this part of the code base.
   
   pin again @divijvaidya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-05-31 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1569882954

   > Okay, in principle I agree that cleaning should be restarted, but I do not 
understand why you want to mark the topicPartition as failed
   > 
   > ```
   > override def markPartitionFailed(topicPartition: TopicPartition): Unit = {
   >   super.markPartitionFailed(topicPartition) <- I DO NOT UNDERSTAND THE 
PURPOSE OF THIS
   >   info(s"For ReplicaAlterLogDirsThread, may also need to resume log 
cleaner for partition $topicPartition")
   > 
   >   replicaMgr.logManager.resumeCleaning(topicPartition)
   > }
   > ```
   > 
   > My overarching question is that if we mark a partition as failed 
regardless of whether it or its ReplicaAlterLogDirsThread fails does this not 
mean that leadership will change and no traffic will be served by the replica 
on this broker anyway?
   
   Hi, the "markPartitionFailed" method in ReplicaAlterLogDirsThread is 
overridden in the AbstractFetcherThread class, that is to say, if it is not 
modified by this pr, when encountering KafkaStorageException, Unexpected error 
and some fenced error, it will still execute "markPartitionFailed of 
AbstractFetcherThread" (I probably took a look here, only two things have been 
done here., add partition to failedPartitions, but failedPartitions seems to be 
useless except calculating metric. 
   Remove partition from partitionStates, so that the partition will not 
continue to fetch in ReplicaAlterLogDirsThread). 
   So I'm just adding resumeClean logic to the logic of markPartitionFailed in 
ReplicaAlterLogDirsThread, not additional markPartitionFailed logic. 
   In addition, when executing markPartitionFailed on a partition, there are no 
other "side effects" as you said, except to remove the fetch of the partition 
in the ReplicaAlterLogDirsThread. This is my understanding. If there is any 
mistake, please correct me. Thank you! @clolov 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-05-25 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1563021307

   > Heya @hudeqi, could you give more detailed explanation on what the problem 
you are trying to solve here because I do not understand? May I suggest you 
distinguish between a partition replica and a partition future replica in some 
way, because otherwise it is quite difficult to understand which replica you 
are referring to when it is just called "partition"?
   > 
   > Let's say we have an original replica in log directory (backed by one 
disk) A, and let's say we have a future replica in log directory (backed by 
another disk) B on the same broker. I have confirmed that it is the case that 
**compaction** is paused on A when B is first created 
(https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L910).
 It is only resumed 
(https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L1099).
 I can think of two situations happening now.
   > 
   > 1. A fails and B doesn't know what to do.
   >In this situation, if A has failed then cleaning should not be resumed 
on A - an operator intervention is required to understand what went wrong with 
the log directory. Cleaning should not be started on B either. Since A has 
failed the amount of data that B has does not grow because it doesn't have a 
source to keep copying from.
   > 2. B fails and A doesn't know what to do.
   >In this situation B's size cannot grow because it is no longer copying 
from A. A should resume cleaning as for all intents and purposes it acts as a 
normal replica to a topic partition.
   > 
   > As far as I understand what you try to do is solve situation number 2 - am 
I correct? If I am correct, what is the reasoning behind marking A as a failed 
partition?
   
   Hi, clolov, thank you for your review! I'm sorry that I didn't express the 
issue clearly. Your scenario example and description are very correct. The 
problem I want to solve is indeed situation number 2. I will describe according 
to your scenario:
   
   Assuming that B fails due to a disk problem, then according to the current 
logic, the partition is directly marked as failed in the corresponding 
'ReplicaAlterLogDirsThread', nothing more. However, as a replica that normally 
provides services, A does not know that B has failed and remains in a 
clean-paused state, which will cause the disk where A is located to be fully 
occupied. That is the problem I want to solve. And my solution is to resume 
cleaning the partition when B fails to ensure that A can continue to clean up. 
@clolov 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-05-04 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1535663733

   There are indeed corresponding recovery methods for this issue, such as 
directly deleting the future log on the disk and restarting the corresponding 
broker. Therefore, if the failed source partition does not resume deletion 
immediately when marked as failed, is it necessary to add a monitoring metric 
to this situation for timely detection? @divijvaidya @chia7712 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-03-29 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1488811902

   > 
   
   Hi, thanks for your attention to this issue. I think there is a difference 
between this problem and the LogDirectory scenario you mentioned: this is when 
a partition is balancing between disks of the same broker, due to a problem 
with the target disk, the partition may be marked as failed, which in turn 
causes the corresponding partition of the source disk Logs are not cleaned up. 
When this happens, if the reassign task is not carefully checked, it is 
difficult to find the failed task. If it is not handled, the occupation of a 
healthy source disk will continue to increase (of course, if we find it, we can 
restart the broker to trigger the processing of leaderAndIsrRequest to re-add 
fetcher, or delete the future directory of the partition in target disk), I 
think this is a bit unreasonable. @divijvaidya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-03-29 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1488179126

   Note that when UnifiedLog.append throws IOException, it will be caught by 
the outer layer and re-throw KafkaStorageException. Therefore, if there is a 
disk error in target dir, the "processPartitionData" will still throw 
KafkaStorageException, causing the partition to be marked as failed, which in 
turn causes the source disk to be occupied Unlimited growth.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-03-27 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1486291886

   Hello, are you interested in seeing this issue? I think it's a serious bug. 
@hachikuji @mumrah 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-03-26 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1484043884

   
   > Hello, for "potential exceptions", I did an experiment to simulate a disk 
failure, which eventually lead to the unexpected disk growing. For details, 
please refer to the corresponding comment in 
[jira](https://issues.apache.org/jira/browse/KAFKA-14824) . @chia7712
   
   Hello, are you still following this issue? @chia7712 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [kafka] hudeqi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

2023-03-22 Thread via GitHub


hudeqi commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1480536455

   Hello, are you interested in seeing this issue? I think it's a serious bug. 
@showuon @dengziming 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org