lengyuexuexuan opened a new pull request, #2243:
URL: https://github.com/apache/incubator-pegasus/pull/2243

   ### What problem does this PR solve? <!--add issue link with summary if 
exists-->
   #2211 
   
   ### What is changed and how does it work?
   1. Background   
       During duplicationof a table, if the commands dup remove/pause are 
executed or a balance operation is performed at the same time, there is a 
chance that a node may core dump with signal ID 11. The core dump locations 
vary, but they all have one thing in common: they occur during memory 
allocation or deallocation.
   2. Analysis 
       Based on extensive testing, the following conclusions can be drawn:
       a. The issue only reproduces when there is write traffic. The difference 
between having and not having write traffic is: It adds the ship and 
load_private_log tasks.  
       b. The core dump occurs during the execution of cancel_all().
       c. The issue occurs with low probability (approximately 1 in 100).
      
       Through analysis using ASAN (AddressSanitizer): 
   
[dup_remove_asan.txt](https://github.com/user-attachments/files/19956043/dup_remove_asan.txt)
       Based on ASAN analysis, the following conclusions can be drawn:
       a. The memory corruption occurs during the ship process. The mutations 
obtained from replaying the plog are passed to ship, leading to the issue.
       b. _load_mutations is captured by a lambda expression and then passed to 
a std::function. Since std::move is used, the lifetime of _load_mutations is 
tied to that of the std::function.
       c. The cancel_all() function is executed in the default thread pool. At 
this point, the following function is called. When the std::function is set to 
nullptr, it will release the memory it 
manages.https://github.com/apache/incubator-pegasus/blob/e64faa7d03f836e5212874bbbfb8f900341897d7/src/task/task.h#L341
       d. However, each task executes exec_internal() in its own thread pool, 
and eventually calls release_ref(), which results in delete this. 
https://github.com/apache/incubator-pegasus/blob/e64faa7d03f836e5212874bbbfb8f900341897d7/src/task/task.cpp#L224
   
   3. Conclusion
   Both task.cancel() and task.exec_internal() destruct the std::function 
member inside the task object. These two operations are executed in different 
threads, and there is no mechanism in place to prevent race conditions between 
them. As a result, it is possible for both threads to attempt to destruct the 
same std::function, which can lead to a double deletion of the memory 
associated with _load_mutations. This ultimately causes memory corruption.
   
   4. Solution
   Replace cancel_all() with wait_all().
   
   
     
   ##### Tests <!-- At least one of them must be included. -->
   
   
   - Manual test 
   Based on 20 test runs, the issue was successfully avoided, confirming the 
effectiveness of the solution.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to