ninsmiracle opened a new pull request, #1590:
URL: https://github.com/apache/incubator-pegasus/pull/1590

   ### What problem does this PR solve? <!--add issue link with summary if 
exists-->
   #1589 
   
   ### What is changed and how does it work?
   There are two conflict may occure when opening duplication and balance:
   1.load(one stage of duplication) with replica close
     Add a atomic parameter and check replica still have doing log load when 
it's closed. 
     So how to end load and let replica close continue?There are two way could 
make it happen:
   -duplication step to next stage(shipping)
   -wait duplication do **update_duplication_map**,and it will 
`remove_all_duplications` when replica loss primary identity.
   
   In remove_all_duplications,duplication will use a map named `_replica` to 
check the replica identity,so I protected it when replica close. 
   
   2.gc useless replica with replica close (close one replica twice)
     After replica connected with meta, meta will request for config to replica 
every 500ms. And replica server update itself's config when meta reply.In this 
logic ,on_node_query_reply_scatter2 will gc useless replica(which status is 
PS_INACTIVE),to be precise will set status from PS_INACTIVE TO PS_ERROR.
     When replica doing `update_local_configuration`,it will exec close action 
when status changed and new status is PS_INACTIVE or PS_ERROR.
     So I judge replica is already closed or doing close when replica exec 
close replica to deal with above problem.
   
   
   ##### Tests <!-- At least one of them must be included. -->
   
   - Cluster test(mentioned in issue#1589)
   
   
   ### In summary:
     The earliest zlock coredump was caused by the necessity of acquiring a 
lock during the 'dup' load stage, which requires a lock from a member variable 
of the 'replica' class. However, at this point, the 'replica' had already been 
closed, related members are destructedresulting in obtaining,resulting zlock 
get an error `_lock` value.
   
     The root cause of the replica being closed actually stems from the logic 
where the meta requests 'replica' for configuration, triggering a 'replica gc' 
process that sets the replica itself to 'PS_ERROR'. Based on the existing 
logic, both 'PS_INACTIVE' and 'PS_ERROR' trigger the 'begin_close' logic, with 
'inactive' having a 10-minute delay and 'error' being executed immediately.
   
     A double safeguard was implemented for the modification. Firstly, when 
enqueuing the close operation, it is checked whether the replica has already 
been closed or is in the process of closing. Secondly, during the execution of 
the close operation, in case it encounters a 'dup load', it is configured to 
enter the task queue with a delay of one minute.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to