lengyuexuexuan opened a new pull request, #2243:
URL: https://github.com/apache/incubator-pegasus/pull/2243
### What problem does this PR solve? <!--add issue link with summary if
exists-->
#2211
### What is changed and how does it work?
1. Background
During duplicationof a table, if the commands dup remove/pause are
executed or a balance operation is performed at the same time, there is a
chance that a node may core dump with signal ID 11. The core dump locations
vary, but they all have one thing in common: they occur during memory
allocation or deallocation.
2. Analysis
Based on extensive testing, the following conclusions can be drawn:
a. The issue only reproduces when there is write traffic. The difference
between having and not having write traffic is: It adds the ship and
load_private_log tasks.
b. The core dump occurs during the execution of cancel_all().
c. The issue occurs with low probability (approximately 1 in 100).
Through analysis using ASAN (AddressSanitizer):
[dup_remove_asan.txt](https://github.com/user-attachments/files/19956043/dup_remove_asan.txt)
Based on ASAN analysis, the following conclusions can be drawn:
a. The memory corruption occurs during the ship process. The mutations
obtained from replaying the plog are passed to ship, leading to the issue.
b. _load_mutations is captured by a lambda expression and then passed to
a std::function. Since std::move is used, the lifetime of _load_mutations is
tied to that of the std::function.
c. The cancel_all() function is executed in the default thread pool. At
this point, the following function is called. When the std::function is set to
nullptr, it will release the memory it
manages.https://github.com/apache/incubator-pegasus/blob/e64faa7d03f836e5212874bbbfb8f900341897d7/src/task/task.h#L341
d. However, each task executes exec_internal() in its own thread pool,
and eventually calls release_ref(), which results in delete this.
https://github.com/apache/incubator-pegasus/blob/e64faa7d03f836e5212874bbbfb8f900341897d7/src/task/task.cpp#L224
3. Conclusion
Both task.cancel() and task.exec_internal() destruct the std::function
member inside the task object. These two operations are executed in different
threads, and there is no mechanism in place to prevent race conditions between
them. As a result, it is possible for both threads to attempt to destruct the
same std::function, which can lead to a double deletion of the memory
associated with _load_mutations. This ultimately causes memory corruption.
4. Solution
Replace cancel_all() with wait_all().
##### Tests <!-- At least one of them must be included. -->
- Manual test
Based on 20 test runs, the issue was successfully avoided, confirming the
effectiveness of the solution.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]