ninsmiracle opened a new pull request, #1590:
URL: https://github.com/apache/incubator-pegasus/pull/1590
### What problem does this PR solve? <!--add issue link with summary if
exists-->
#1589
### What is changed and how does it work?
There are two conflict may occure when opening duplication and balance:
1.load(one stage of duplication) with replica close
Add a atomic parameter and check replica still have doing log load when
it's closed.
So how to end load and let replica close continue?There are two way could
make it happen:
-duplication step to next stage(shipping)
-wait duplication do **update_duplication_map**,and it will
`remove_all_duplications` when replica loss primary identity.
In remove_all_duplications,duplication will use a map named `_replica` to
check the replica identity,so I protected it when replica close.
2.gc useless replica with replica close (close one replica twice)
After replica connected with meta, meta will request for config to replica
every 500ms. And replica server update itself's config when meta reply.In this
logic ,on_node_query_reply_scatter2 will gc useless replica(which status is
PS_INACTIVE),to be precise will set status from PS_INACTIVE TO PS_ERROR.
When replica doing `update_local_configuration`,it will exec close action
when status changed and new status is PS_INACTIVE or PS_ERROR.
So I judge replica is already closed or doing close when replica exec
close replica to deal with above problem.
##### Tests <!-- At least one of them must be included. -->
- Cluster test(mentioned in issue#1589)
### In summary:
The earliest zlock coredump was caused by the necessity of acquiring a
lock during the 'dup' load stage, which requires a lock from a member variable
of the 'replica' class. However, at this point, the 'replica' had already been
closed, related members are destructedresulting in obtaining,resulting zlock
get an error `_lock` value.
The root cause of the replica being closed actually stems from the logic
where the meta requests 'replica' for configuration, triggering a 'replica gc'
process that sets the replica itself to 'PS_ERROR'. Based on the existing
logic, both 'PS_INACTIVE' and 'PS_ERROR' trigger the 'begin_close' logic, with
'inactive' having a 10-minute delay and 'error' being executed immediately.
A double safeguard was implemented for the modification. Firstly, when
enqueuing the close operation, it is checked whether the replica has already
been closed or is in the process of closing. Secondly, during the execution of
the close operation, in case it encounters a 'dup load', it is configured to
enter the task queue with a delay of one minute.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]