Hi Lennart, Thanks for your comments.
I think it is good idea to pick the lgs_state & lga_state out of lga_cb, and use the mutex cb_lock to protect changes in lga control block. I reflect these things in attached file. And to make the code a bit clean, I have defined set/get functions for lgs_state & lga_state. Also, I declare the mutex cb_lock as recursive one to avoid possible recursively lock in one thread. Could you have a look and give your comments? Regards, Vu. >-----Original Message----- >From: Lennart Lund [mailto:[email protected]] >Sent: Tuesday, April 05, 2016 6:59 PM >To: Vu Minh Nguyen >Cc: [email protected]; [email protected]; Lennart >Lund >Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource in >log agent [#1705] > >Hi Vu > >See my comments inline [Lennart] > >Note. My comments does not give the complete solution. For that I need to >use more time but may give you some ideas. > >/Lennart > >> -----Original Message----- >> From: Vu Minh Nguyen [mailto:[email protected]] >> Sent: den 5 april 2016 10:48 >> To: Lennart Lund >> Cc: [email protected]; [email protected] >> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource >> in log agent [#1705] >> >> Hi Lennart, >> >> Thanks for your proposal. >> >> Still, there is race condition b/w mds thread and recover/client thread, I >> think. >> 1) Headless while recover thread 2 is running >> 2) Headless while the client thread is accessing/modifying to client list. >> >[Lennart] In the mds thread the lga_no_server_state_set() is called if server >down is detected. If recovery state is LGA_RECOVERY2 the thread is stopped >and recovery state is set to state LGA_NO_SERVER. This means that we start all >over again with recovery state LGA_RECOVERY1 when the server comes back >again. >Within the thread the cb lock is taken while the client list is deleted if >needed. >With a new mutex for lga_state this mutex should also be used in all other >places where lga_state is handled not the cb_lock. This means that the name >"recovery2" mutex is not so good instead there should be a lga_state mutex. >The lga_state variable may also be moved outside the lga_cb structure. > >> I would like to propose other one, using only one mutex cb_lock @lga_cb >> to protect common `lga_cb` resource, including `state`, `client` list, etc. >> in >> three threads. >> >[Lennart] The cb_lock is used to protect variables in the cb structure but the >lga_state mutex prevent recovery or API operations from being done at the >same time. > >> 1) Client thread >> - Lock mutex before getting the state >> - Unlock right before the API operation is finished >> >[Lennart] Yes but the lga_state mutex > >> 2) Recovery2 thread >> - Lock mutex after timeout >> - Unlock before ending the thread >> >[Lennart] No this will cause the API to be locked as long as the recovery 2 >thread is running and this is not wanted. The API shall work also while the >recovery thread is running and most of the APIs shall return with >SA_AIS_TRY_AGAIN during recovery 2 state. > >> 3) MDS thread >> - Lock/unlock before/after changing state >> In case of headless: >> - Lock/unlock mutex before modifying `client_list` in >> lga_no_server_state_set() >> >[Lennart] lga_state is not changed directly in the mds thread but indirectly >when the lga_serv_recov1state_set() and lga_no_server_state_set() functions >are called. I these functions the recovery thread and state change is handled >and the mutex used shall be changed to the lga_state mutex. > >The client data should be protected by the cb_lock mutex in all places which is >not the case. There is no protection in the mds trhead in the >lga_lgs_msg_proc() where client data is used via the >ga_find_hdl_rec_by_regid() function. This function should lock the lga_cb lock >while handling the lga_cb client data. >Also there are two functions in lga_state.c that are handling client data >locking >a lga_recov_mutex. This does not protect client data handled in the mds >thread. > >> What is your opinion? Do you think there is impact to log client if they call >> LOG API in multi threads? >> >[Lennart] The mutex handling should be cleaned up. > - Recovery state should be protected by its own mutex so that APIs can work >in respective state without locking. > - Client data should be protected in all threads by the lga_cb lock. > - The lga_recov_mutex can probably be removed > >> Regards, Vu. >> >> >> >-----Original Message----- >> >From: Lennart Lund [mailto:[email protected]] >> >Sent: Monday, April 04, 2016 3:51 PM >> >To: Vu Minh Nguyen >> >Cc: [email protected]; [email protected]; >> Lennart >> >Lund >> >Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource >> in >> >log agent [#1705] >> > >> >Hi Vu, >> > >> >A simpler solution than described below, maybe: >> > >> >Still use a separate "recovery2" mutex but a "normal one". >> > - In the API, lock the mutex just before checking if we are in recovery >> > state >> 2 >> >(before "if (lga_state == LGA_RECOVERY2) {") >> > - In the API, do not unlock until the API operation is finished >> > - In the recovery thread lock the mutex just before changing the recovery >> >state to recovery state 2 (replace lock of lga_cb.cb_lock with recovery2 >> mutex) >> >(protect the lga_state variable) >> >- In the recovery thread the mutex shall be unlocked just after lga_state is >> set >> >to recovery state 2. >> > >> >This means that the recovery thread cannot set recovery state2 or execute >> >until an eventual ongoing API operation is done. >> >If the thread starts and set recovery state just before the API checks >> recovery >> >state the API will correctly see that we are in recovery state 2 and handle >> thing >> >accordingly. >> > >> >Regards >> >Lennart >> > >> > >> >> -----Original Message----- >> >> From: Lennart Lund >> >> Sent: den 1 april 2016 16:51 >> >> To: Vu Minh Nguyen >> >> Cc: [email protected]; [email protected]; >> >> Lennart Lund >> >> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common >> resource >> >> in log agent [#1705] >> >> >> >> Hi Vu, >> >> >> >> You are right, good finding, this may happen since the recovery2 thread is >> >> started and is waiting for a timeout. If this timeout happen at the "wrong >> >> time" bad things may happen. >> >> However the actual problem should be solved. >> >> >> >> This is how it could be done (maybe you can find a better way?): >> >> When timeout the recovery thread shall set recovery state 2 as is but >> >> shall >> >> not start any recovery if there is any ongoing client activity. >> >> This could be solved by using a "recovery2" mutex (not the existing >> >> lga_cb.cb_lock). >> >> - The recovery thread lock this mutex after setting recovery state 2. >> >> - During recovery state 1 all APIs check if the mutex is taken and if so >> return >> >> TRY AGAIN (this means that we are in recovery state 2 but the state >> change >> >> may have happened after the check in the API). Note: "trylock" cannot be >> >> used for this purpose since a sequence trylock -> lock is not atomic. >> Maybe a >> >> PTHREAD_MUTEX_ERRORCHECK type of mutex can be used. >> >> - If the mutex is not taken it shall be locked by the API function and be >> >> unlocked just before the API function returns preventing the recovery >> >> thread from starting recovery during an ongoing API operation. >> >> The next time an API is called recovery state 2 is set and the API >> >> function >> will >> >> operate accordingly. >> >> >> >> Thanks >> >> Lennart >> >> >> >> > -----Original Message----- >> >> > From: Vu Minh Nguyen [mailto:[email protected]] >> >> > Sent: den 1 april 2016 16:02 >> >> > To: Lennart Lund >> >> > Cc: [email protected]; [email protected] >> >> > Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common >> >> resource >> >> > in log agent [#1705] >> >> > >> >> > Hi Lennart, >> >> > >> >> > Is it possible for following case happens? >> >> > >> >> > Client thread comes to point send_Finalize_msg() @lga_api, then LOG >> >> > service up, recovery thread starts and >> >> > comes to the point right after "while (pclient != NULL)" in >> >> > recovery2_thread @lga_state.c >> >> > Then, the context is switching to the client thread and it deletes the >> >> > resource while recovery2_thread is running? >> >> > >> >> > Regards, Vu. >> >> > >> >> > Quoting Lennart Lund <[email protected]>: >> >> > >> >> > > Hi Vu, >> >> > > >> >> > > /* Recover clients one at a time */ >> >> > > rc = recover_one_client(p_client); >> >> > > + if (p_client == NULL) goto done; >> >> > > + >> >> > > [Lennart] Why is this needed? This check is done in "while >> >> > > (p_client != NULL) { " >> >> > > and p_client cannot become NULL until "p_client = p_client->next; " >> >> > > is done as >> >> > > the last line in the while loop. >> >> > > >> >> > > [Vu] It is my intention. I am afraid if the context is switched to >> >> > > client thread >> >> > > right after the recovery thread enters the while loop. >> >> > > If it is the case, p_client will be NULL after done >> >> > > recover_one_client(). >> >> > > >> >> > > [Lennart] I don't really understand your explanation. What is >> >> > > happening in the client thread when we are in recovery state 2 that >> >> > > can set p_client == NULL? The client thread should not be able to do >> >> > > anything with the data involved when recovering client in the >> >> > > recovery thread? >> >> > > >> >> > > Regards >> >> > > Lennart >> >> > > >> >> > >> -----Original Message----- >> >> > >> From: Vu Minh Nguyen [mailto:[email protected]] >> >> > >> Sent: den 1 april 2016 14:14 >> >> > >> To: Lennart Lund >> >> > >> Cc: [email protected]; [email protected] >> >> > >> Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common >> >> > resource >> >> > >> in log agent [#1705] >> >> > >> >> >> > >> Hi Lennart, >> >> > >> >> >> > >> Please see my comment inline. >> >> > >> >> >> > >> Regards, Vu >> >> > >> >> >> > >> Quoting Lennart Lund <[email protected]>: >> >> > >> >> >> > >> > Hi Vu >> >> > >> > >> >> > >> > Ack with comments >> >> > >> > >> >> > >> > See my comments inline [Lennart]: >> >> > >> > >> >> > >> > Thanks >> >> > >> > Lennart >> >> > >> > >> >> > >> >> -----Original Message----- >> >> > >> >> From: Vu Minh Nguyen [mailto:[email protected]] >> >> > >> >> Sent: den 23 mars 2016 08:16 >> >> > >> >> To: [email protected]; Lennart Lund >> >> > >> >> Cc: [email protected] >> >> > >> >> Subject: [PATCH 1 of 1] log: miss mutex protection for common >> >> > resource >> >> > >> in >> >> > >> >> log agent [#1705] >> >> > >> >> >> >> > >> >> osaf/libs/agents/saf/lga/lga_state.c | 41 >> >> > >> >> +++++++++++++++++++++++++++++++---- >> >> > >> >> osaf/libs/agents/saf/lga/lga_state.h | 1 + >> >> > >> >> osaf/libs/agents/saf/lga/lga_util.c | 30 >> >> > >> >> +++++++++++------------- >> - >> >> > >> >> 3 files changed, 50 insertions(+), 22 deletions(-) >> >> > >> >> >> >> > >> >> >> >> > >> >> There was an race condition between client thread and recovery >> >> > thread >> >> > >> >> accessing to common resource `client_list`. >> >> > >> >> >> >> > >> >> Add mutex protection. >> >> > >> >> >> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.c >> >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.c >> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.c >> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.c >> >> > >> >> @@ -355,8 +355,11 @@ static void *recovery2_thread(void >> *dumm >> >> > >> >> /* We have been signaled to exit >> >> > >> >> */ >> >> > >> >> goto done; >> >> > >> >> } >> >> > >> >> + >> >> > >> >> /* Recover clients one at a time */ >> >> > >> >> rc = recover_one_client(p_client); >> >> > >> >> + if (p_client == NULL) goto done; >> >> > >> >> + >> >> > >> > [Lennart] Why is this needed? This check is done in "while >> >> > >> > (p_client >> >> > >> > != NULL) { " and p_client cannot become NULL until "p_client = >> >> > >> > p_client->next; " is done as the last line in the while loop. >> >> > >> > >> >> > >> [Vu] It is my intention. I am afraid if the context is switched to >> >> > >> client thread >> >> > >> right after the recovery thread enters the while loop. >> >> > >> If it is the case, p_client will be NULL after done >> recover_one_client(). >> >> > >> >> >> > >> >> TRACE("\t Client %d is recovered", p_client- >> >> > >> >> >lgs_client_id); >> >> > >> >> if (rc == -1) { >> >> > >> >> TRACE("%s recover_one_client >> >> > >> >> Fail Deleting cllient (id %d)", >> >> > >> >> @@ -593,6 +596,15 @@ int recover_one_client(lga_client_hdl_re >> >> > >> >> >> >> > >> >> TRACE_ENTER(); >> >> > >> >> >> >> > >> >> + if (p_client == NULL) { >> >> > >> >> + /** >> >> > >> >> + * Probably the client_list has been deleted by >> >> > >> >> client thread >> >> > >> >> + * in context calling saLogFinalize(). So, give up >> >> > >> >> recovering. >> >> > >> >> + */ >> >> > >> >> + rc = -1; >> >> > >> >> + goto done; >> >> > >> >> + } >> >> > >> >> + >> >> > >> >> /* This function may be called both in the recovery thread and >> >> > >> >> in the >> >> > >> >> * client thread. A possible scenario is that the recovery 2 >> >> > >> >> thread >> >> > >> >> * times out and calls this function in recovery mode 2 while >> >> > >> >> there >> >> > >> >> @@ -658,13 +670,32 @@ uint32_t delete_one_client( >> >> > >> >> lga_client_hdl_rec_t *rm_node >> >> > >> >> ) >> >> > >> >> { >> >> > >> >> - TRACE_ENTER2(); >> >> > >> >> - uint32_t ncs_rc; >> >> > >> >> + return lga_hdl_rec_del(list_head, rm_node); >> >> > >> >> +} >> >> > >> >> >> >> > >> >> +/** >> >> > >> >> + * Free the memory allocated for one client handle with mutext >> >> > >> protection. >> >> > >> >> + * >> >> > >> >> + * @param p_client_hdl >> >> > >> >> + * @return void >> >> > >> >> + */ >> >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl) >> >> > >> >> +{ >> >> > >> >> + lga_client_hdl_rec_t *client_hdl; >> >> > >> >> + >> >> > >> >> + /** >> >> > >> >> + * To prevent race condition b/w client thread and recovery >> >> > >> >> thread >> >> > >> >> + * accessing to common resource `client_list`. [#1705] >> >> > >> >> + */ >> >> > >> >> osaf_mutex_lock_ordie(&lga_recov_mutex); >> >> > >> >> - ncs_rc = lga_hdl_rec_del(list_head, rm_node); >> >> > >> >> + >> >> > >> >> + client_hdl = *p_client_hdl; >> >> > >> >> + if (client_hdl == NULL) goto done; >> >> > >> >> + >> >> > >> >> + free(client_hdl); >> >> > >> >> + client_hdl = NULL; >> >> > >> >> + >> >> > >> >> +done: >> >> > >> >> osaf_mutex_unlock_ordie(&lga_recov_mutex); >> >> > >> >> >> >> > >> >> - TRACE_LEAVE(); >> >> > >> >> - return ncs_rc; >> >> > >> >> } >> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.h >> >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.h >> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.h >> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.h >> >> > >> >> @@ -32,6 +32,7 @@ uint32_t delete_one_client( >> >> > >> >> lga_client_hdl_rec_t **list_head, >> >> > >> >> lga_client_hdl_rec_t *rm_node >> >> > >> >> ); >> >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl); >> >> > >> >> >> >> > >> >> #ifdef __cplusplus >> >> > >> >> } >> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_util.c >> >> > >> >> b/osaf/libs/agents/saf/lga/lga_util.c >> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_util.c >> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_util.c >> >> > >> >> @@ -19,6 +19,7 @@ >> >> > >> >> #include <syslog.h> >> >> > >> >> #include "lga.h" >> >> > >> >> #include "osaf_poll.h" >> >> > >> >> +#include "lga_state.h" >> >> > >> >> >> >> > >> >> /* Variables used during startup/shutdown only */ >> >> > >> >> static pthread_mutex_t lga_lock = >> PTHREAD_MUTEX_INITIALIZER; >> >> > >> >> @@ -449,11 +450,9 @@ void lga_hdl_list_del(lga_client_hdl_rec >> >> > >> >> /** clean up the channel records for this lga-client >> >> > >> >> **/ >> >> > >> >> lga_log_stream_hdl_rec_list_del(&client_hdl- >> >> > >> >> >stream_list); >> >> > >> >> - /** remove the association with hdl-mngr >> >> > >> >> - **/ >> >> > >> >> - free(client_hdl); >> >> > >> >> - client_hdl = 0; >> >> > >> >> + free_client_hdl(&client_hdl); >> >> > >> >> } >> >> > >> >> + >> >> > >> >> TRACE_LEAVE(); >> >> > >> >> } >> >> > >> >> >> >> > >> >> @@ -536,38 +535,35 @@ uint32_t >> lga_hdl_rec_del(lga_client_hdl_ >> >> > >> >> if (list_iter == rm_node) { >> >> > >> >> *list_head = rm_node->next; >> >> > >> >> >> >> > >> >> - /** detach & release the IPC >> >> > >> >> - **/ >> >> > >> >> + /* detach & release the IPC */ >> >> > >> >> m_NCS_IPC_DETACH(&rm_node->mbx, >> >> > >> >> lga_clear_mbx, NULL); >> >> > >> >> m_NCS_IPC_RELEASE(&rm_node->mbx, NULL); >> >> > >> >> >> >> > >> >> ncshm_give_hdl(rm_node->local_hdl); >> >> > >> >> ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, >> >> > >> >> rm_node->local_hdl); >> >> > >> >> - /** Free the channel records off this hdl >> >> > >> >> - **/ >> >> > >> >> + >> >> > >> >> + /* Free the channel records off this hdl */ >> >> > >> >> lga_log_stream_hdl_rec_list_del(&rm_node- >> >> > >> >> >stream_list); >> >> > >> >> + free_client_hdl(&rm_node); >> >> > >> >> >> >> > >> >> - /** free the hdl rec >> >> > >> >> - **/ >> >> > >> >> - free(rm_node); >> >> > >> >> rc = NCSCC_RC_SUCCESS; >> >> > >> >> goto out; >> >> > >> >> - } else { /* find the rec */ >> >> > >> >> + } else { >> >> > >> >> + /* find the rec */ >> >> > >> >> while (NULL != list_iter) { >> >> > >> >> if (list_iter->next == rm_node) { >> >> > >> >> list_iter->next = >> >> > >> >> rm_node->next; >> >> > >> >> >> >> > >> >> - /** detach & release the IPC */ >> >> > >> >> + /* detach & release >> >> > >> >> the IPC */ >> >> > >> >> >> >> > >> >> m_NCS_IPC_DETACH(&rm_node->mbx, lga_clear_mbx, >> >> > >> NULL); >> >> > >> >> >> >> > >> >> m_NCS_IPC_RELEASE(&rm_node->mbx, NULL); >> >> > >> >> >> >> > >> >> >> >> > >> >> ncshm_give_hdl(rm_node->local_hdl); >> >> > >> >> >> >> > >> >> ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, rm_node- >> >> > >> >> >local_hdl); >> >> > >> >> - /** Free the channel records off this lga_hdl */ >> >> > >> >> + >> >> > >> >> + /* Free the channel >> >> > >> >> records off this lga_hdl */ >> >> > >> >> >> >> > >> >> lga_log_stream_hdl_rec_list_del(&rm_node->stream_list); >> >> > >> >> - >> >> > >> >> - /** free the hdl rec */ >> >> > >> >> - free(rm_node); >> >> > >> >> + >> >> > >> >> free_client_hdl(&rm_node); >> >> > >> >> >> >> > >> >> rc = >> >> > >> >> NCSCC_RC_SUCCESS; >> >> > >> >> goto out; >> >> > >> >> >> > >>
lgsv_missMutexProtection_1705_r2.patch
Description: Binary data
------------------------------------------------------------------------------
_______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
