Hi Lennart,

Thanks for your comments.

I think it is good idea to pick the lgs_state & lga_state out of lga_cb,
and use the mutex cb_lock to protect changes in lga control block.

I reflect these things in attached file.

And to make the code a bit clean, I have defined set/get functions for 
lgs_state & lga_state.
Also, I declare the mutex cb_lock as recursive one to avoid possible 
recursively lock in one thread.

Could you have a look and give your comments?

Regards, Vu.

>-----Original Message-----
>From: Lennart Lund [mailto:lennart.l...@ericsson.com]
>Sent: Tuesday, April 05, 2016 6:59 PM
>To: Vu Minh Nguyen
>Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net; Lennart
>Lund
>Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource in
>log agent [#1705]
>
>Hi Vu
>
>See my comments inline [Lennart]
>
>Note. My comments does not give the complete solution. For that I need to
>use more time but may give you some ideas.
>
>/Lennart
>
>> -----Original Message-----
>> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au]
>> Sent: den 5 april 2016 10:48
>> To: Lennart Lund
>> Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net
>> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource
>> in log agent [#1705]
>>
>> Hi Lennart,
>>
>> Thanks for your proposal.
>>
>> Still, there is race condition b/w mds thread and recover/client thread, I
>> think.
>> 1) Headless while recover thread 2 is running
>> 2) Headless while the client thread is accessing/modifying to client list.
>>
>[Lennart] In the mds thread the lga_no_server_state_set() is called if server
>down is detected. If recovery state is LGA_RECOVERY2 the thread is stopped
>and recovery state is set to state LGA_NO_SERVER. This means that we start all
>over again with recovery state LGA_RECOVERY1 when the server comes back
>again.
>Within the thread the cb lock is taken while the client list is deleted if 
>needed.
>With a new mutex for lga_state this mutex should also be used in all other
>places where lga_state is handled not the cb_lock. This means that the name
>"recovery2" mutex is not so good instead there should be a lga_state mutex.
>The lga_state variable may also be moved outside the lga_cb structure.
>
>> I would like to propose other one, using only one mutex cb_lock @lga_cb
>> to protect common `lga_cb` resource, including `state`, `client` list, etc. 
>> in
>> three threads.
>>
>[Lennart] The cb_lock is used to protect variables in the cb structure but the
>lga_state mutex prevent recovery or API operations from being done at the
>same time.
>
>> 1) Client thread
>> - Lock mutex before getting the state
>> - Unlock right before the API operation is finished
>>
>[Lennart] Yes but the lga_state mutex
>
>> 2) Recovery2 thread
>> - Lock mutex after timeout
>> - Unlock before ending the thread
>>
>[Lennart] No this will cause the API to be locked as long as the recovery 2
>thread is running and this is not wanted. The API shall work also while the
>recovery thread is running and most of the APIs shall return with
>SA_AIS_TRY_AGAIN during recovery 2 state.
>
>> 3) MDS thread
>> - Lock/unlock before/after changing state
>> In case of headless:
>> - Lock/unlock mutex before modifying `client_list` in
>> lga_no_server_state_set()
>>
>[Lennart] lga_state is not changed directly in the mds thread but indirectly
>when the lga_serv_recov1state_set() and lga_no_server_state_set() functions
>are called. I these functions the recovery thread and state change is handled
>and the mutex used shall be changed to the lga_state mutex.
>
>The client data should be protected by the cb_lock mutex in all places which is
>not the case. There is no protection in the mds trhead in the
>lga_lgs_msg_proc() where client data is used via the
>ga_find_hdl_rec_by_regid() function. This function should lock the lga_cb lock
>while handling the lga_cb client data.
>Also there are two functions in lga_state.c that are handling client data 
>locking
>a lga_recov_mutex. This does not protect client data handled in the mds
>thread.
>
>> What is your opinion? Do you think there is impact to log client if they call
>> LOG API in multi threads?
>>
>[Lennart] The mutex handling should be cleaned up.
> - Recovery state should be protected by its own mutex so that APIs can work
>in respective state without locking.
> - Client data should be protected in all threads by the lga_cb lock.
> - The lga_recov_mutex can probably be removed
>
>> Regards, Vu.
>>
>>
>> >-----Original Message-----
>> >From: Lennart Lund [mailto:lennart.l...@ericsson.com]
>> >Sent: Monday, April 04, 2016 3:51 PM
>> >To: Vu Minh Nguyen
>> >Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net;
>> Lennart
>> >Lund
>> >Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common resource
>> in
>> >log agent [#1705]
>> >
>> >Hi Vu,
>> >
>> >A simpler solution than described below, maybe:
>> >
>> >Still use a separate "recovery2" mutex but a "normal one".
>> > - In the API, lock the mutex just before checking if we are in recovery 
>> > state
>> 2
>> >(before "if (lga_state == LGA_RECOVERY2) {")
>> > - In the API, do not unlock until the API operation is finished
>> > - In the recovery thread lock the mutex just before  changing the recovery
>> >state to recovery state 2 (replace lock of lga_cb.cb_lock with recovery2
>> mutex)
>> >(protect the lga_state variable)
>> >- In the recovery thread the mutex shall be unlocked just after lga_state is
>> set
>> >to recovery state 2.
>> >
>> >This means that the recovery thread cannot set recovery state2 or execute
>> >until an eventual ongoing API operation is done.
>> >If the thread starts and set recovery state just before the API checks
>> recovery
>> >state the API will correctly see that we are in recovery state 2 and handle
>> thing
>> >accordingly.
>> >
>> >Regards
>> >Lennart
>> >
>> >
>> >> -----Original Message-----
>> >> From: Lennart Lund
>> >> Sent: den 1 april 2016 16:51
>> >> To: Vu Minh Nguyen
>> >> Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net;
>> >> Lennart Lund
>> >> Subject: RE: [PATCH 1 of 1] log: miss mutex protection for common
>> resource
>> >> in log agent [#1705]
>> >>
>> >> Hi Vu,
>> >>
>> >> You are right, good finding, this may happen since the recovery2 thread is
>> >> started and is waiting for a timeout. If this timeout happen at the "wrong
>> >> time" bad things may happen.
>> >> However the actual problem should be solved.
>> >>
>> >> This is how it could be done (maybe you can find a better way?):
>> >> When timeout the recovery thread shall set recovery state 2 as is but 
>> >> shall
>> >> not start any recovery if there is any ongoing client activity.
>> >> This could be solved by using a "recovery2" mutex (not the existing
>> >> lga_cb.cb_lock).
>> >>  - The recovery thread lock this mutex after setting recovery state 2.
>> >>  - During recovery state 1 all APIs check if the mutex is taken and if so
>> return
>> >> TRY AGAIN (this means that we are in recovery state 2 but the state
>> change
>> >> may have happened after the check in the API). Note: "trylock" cannot be
>> >> used for this purpose since a sequence trylock -> lock is not atomic.
>> Maybe a
>> >> PTHREAD_MUTEX_ERRORCHECK type of mutex can be used.
>> >>  - If the mutex is not taken it shall be locked by the API function and be
>> >> unlocked just before the API function returns preventing the recovery
>> >> thread from starting recovery during an ongoing API operation.
>> >> The next time an API is called recovery state 2 is set and the API 
>> >> function
>> will
>> >> operate accordingly.
>> >>
>> >> Thanks
>> >> Lennart
>> >>
>> >> > -----Original Message-----
>> >> > From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au]
>> >> > Sent: den 1 april 2016 16:02
>> >> > To: Lennart Lund
>> >> > Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net
>> >> > Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common
>> >> resource
>> >> > in log agent [#1705]
>> >> >
>> >> > Hi Lennart,
>> >> >
>> >> > Is it possible for following case happens?
>> >> >
>> >> > Client thread comes to point send_Finalize_msg() @lga_api, then LOG
>> >> > service up, recovery thread starts and
>> >> > comes to the point right after "while (pclient != NULL)" in
>> >> > recovery2_thread @lga_state.c
>> >> > Then, the context is switching to the client thread and it deletes the
>> >> > resource while recovery2_thread is running?
>> >> >
>> >> > Regards, Vu.
>> >> >
>> >> > Quoting Lennart Lund <lennart.l...@ericsson.com>:
>> >> >
>> >> > > Hi Vu,
>> >> > >
>> >> > >               /* Recover clients one at a time */
>> >> > >               rc = recover_one_client(p_client);
>> >> > >  +            if (p_client == NULL) goto done;
>> >> > >  +
>> >> > >  [Lennart] Why is this needed? This check is done in "while
>> >> > > (p_client != NULL) { "
>> >> > > and p_client cannot become NULL until "p_client = p_client->next; "
>> >> > > is done as
>> >> > > the last line in the while loop.
>> >> > >
>> >> > >  [Vu] It is my intention. I am afraid if the context is switched to
>> >> > >  client thread
>> >> > >  right after the recovery thread enters the while loop.
>> >> > >  If it is the case, p_client will be NULL after done 
>> >> > > recover_one_client().
>> >> > >
>> >> > > [Lennart] I don't really understand your explanation. What is
>> >> > > happening in the client thread when we are in recovery state 2 that
>> >> > > can set p_client == NULL? The client thread should not be able to do
>> >> > > anything with the data involved when recovering client in the
>> >> > > recovery thread?
>> >> > >
>> >> > > Regards
>> >> > > Lennart
>> >> > >
>> >> > >> -----Original Message-----
>> >> > >> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au]
>> >> > >> Sent: den 1 april 2016 14:14
>> >> > >> To: Lennart Lund
>> >> > >> Cc: mathi.naic...@oracle.com; opensaf-devel@lists.sourceforge.net
>> >> > >> Subject: Re: [PATCH 1 of 1] log: miss mutex protection for common
>> >> > resource
>> >> > >> in log agent [#1705]
>> >> > >>
>> >> > >> Hi Lennart,
>> >> > >>
>> >> > >> Please see my comment inline.
>> >> > >>
>> >> > >> Regards, Vu
>> >> > >>
>> >> > >> Quoting Lennart Lund <lennart.l...@ericsson.com>:
>> >> > >>
>> >> > >> > Hi Vu
>> >> > >> >
>> >> > >> > Ack with comments
>> >> > >> >
>> >> > >> > See my comments inline [Lennart]:
>> >> > >> >
>> >> > >> > Thanks
>> >> > >> > Lennart
>> >> > >> >
>> >> > >> >> -----Original Message-----
>> >> > >> >> From: Vu Minh Nguyen [mailto:vu.m.ngu...@dektech.com.au]
>> >> > >> >> Sent: den 23 mars 2016 08:16
>> >> > >> >> To: mathi.naic...@oracle.com; Lennart Lund
>> >> > >> >> Cc: opensaf-devel@lists.sourceforge.net
>> >> > >> >> Subject: [PATCH 1 of 1] log: miss mutex protection for common
>> >> > resource
>> >> > >> in
>> >> > >> >> log agent [#1705]
>> >> > >> >>
>> >> > >> >>  osaf/libs/agents/saf/lga/lga_state.c |  41
>> >> > >> >> +++++++++++++++++++++++++++++++----
>> >> > >> >>  osaf/libs/agents/saf/lga/lga_state.h |   1 +
>> >> > >> >>  osaf/libs/agents/saf/lga/lga_util.c  |  30 
>> >> > >> >> +++++++++++-------------
>> -
>> >> > >> >>  3 files changed, 50 insertions(+), 22 deletions(-)
>> >> > >> >>
>> >> > >> >>
>> >> > >> >> There was an race condition between client thread and recovery
>> >> > thread
>> >> > >> >> accessing to common resource `client_list`.
>> >> > >> >>
>> >> > >> >> Add mutex protection.
>> >> > >> >>
>> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.c
>> >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.c
>> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.c
>> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.c
>> >> > >> >> @@ -355,8 +355,11 @@ static void *recovery2_thread(void
>> *dumm
>> >> > >> >>                   /* We have been signaled to exit
>> >> > >> >> */
>> >> > >> >>                   goto done;
>> >> > >> >>           }
>> >> > >> >> +
>> >> > >> >>           /* Recover clients one at a time */
>> >> > >> >>           rc = recover_one_client(p_client);
>> >> > >> >> +         if (p_client == NULL) goto done;
>> >> > >> >> +
>> >> > >> > [Lennart] Why is this needed? This check is done in "while 
>> >> > >> > (p_client
>> >> > >> > != NULL) { " and p_client cannot become NULL until "p_client =
>> >> > >> > p_client->next; " is done as the last line in the while loop.
>> >> > >> >
>> >> > >> [Vu] It is my intention. I am afraid if the context is switched to
>> >> > >> client thread
>> >> > >> right after the recovery thread enters the while loop.
>> >> > >> If it is the case, p_client will be NULL after done
>> recover_one_client().
>> >> > >>
>> >> > >> >>           TRACE("\t Client %d is recovered", p_client-
>> >> > >> >> >lgs_client_id);
>> >> > >> >>           if (rc == -1) {
>> >> > >> >>                   TRACE("%s recover_one_client
>> >> > >> >> Fail Deleting cllient (id %d)",
>> >> > >> >> @@ -593,6 +596,15 @@ int recover_one_client(lga_client_hdl_re
>> >> > >> >>
>> >> > >> >>   TRACE_ENTER();
>> >> > >> >>
>> >> > >> >> + if (p_client == NULL) {
>> >> > >> >> +         /**
>> >> > >> >> +          * Probably the client_list has been deleted by
>> >> > >> >> client thread
>> >> > >> >> +          * in context calling saLogFinalize(). So, give up
>> >> > >> >> recovering.
>> >> > >> >> +          */
>> >> > >> >> +         rc = -1;
>> >> > >> >> +         goto done;
>> >> > >> >> + }
>> >> > >> >> +
>> >> > >> >>   /* This function may be called both in the recovery thread and
>> >> > >> >> in the
>> >> > >> >>    * client thread. A possible scenario is that the recovery 2
>> >> > >> >> thread
>> >> > >> >>    * times out and calls this function in recovery mode 2 while
>> >> > >> >> there
>> >> > >> >> @@ -658,13 +670,32 @@ uint32_t delete_one_client(
>> >> > >> >>   lga_client_hdl_rec_t *rm_node
>> >> > >> >>   )
>> >> > >> >>  {
>> >> > >> >> - TRACE_ENTER2();
>> >> > >> >> - uint32_t ncs_rc;
>> >> > >> >> + return lga_hdl_rec_del(list_head, rm_node);
>> >> > >> >> +}
>> >> > >> >>
>> >> > >> >> +/**
>> >> > >> >> + * Free the memory allocated for one client handle with mutext
>> >> > >> protection.
>> >> > >> >> + *
>> >> > >> >> + * @param p_client_hdl
>> >> > >> >> + * @return void
>> >> > >> >> + */
>> >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl)
>> >> > >> >> +{
>> >> > >> >> + lga_client_hdl_rec_t *client_hdl;
>> >> > >> >> +
>> >> > >> >> + /**
>> >> > >> >> +  * To prevent race condition b/w client thread and recovery
>> >> > >> >> thread
>> >> > >> >> +  * accessing to common resource `client_list`. [#1705]
>> >> > >> >> +  */
>> >> > >> >>   osaf_mutex_lock_ordie(&lga_recov_mutex);
>> >> > >> >> - ncs_rc = lga_hdl_rec_del(list_head, rm_node);
>> >> > >> >> +
>> >> > >> >> + client_hdl = *p_client_hdl;
>> >> > >> >> + if (client_hdl == NULL) goto done;
>> >> > >> >> +
>> >> > >> >> + free(client_hdl);
>> >> > >> >> + client_hdl = NULL;
>> >> > >> >> +
>> >> > >> >> +done:
>> >> > >> >>   osaf_mutex_unlock_ordie(&lga_recov_mutex);
>> >> > >> >>
>> >> > >> >> - TRACE_LEAVE();
>> >> > >> >> - return ncs_rc;
>> >> > >> >>  }
>> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_state.h
>> >> > >> >> b/osaf/libs/agents/saf/lga/lga_state.h
>> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_state.h
>> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_state.h
>> >> > >> >> @@ -32,6 +32,7 @@ uint32_t delete_one_client(
>> >> > >> >>   lga_client_hdl_rec_t **list_head,
>> >> > >> >>   lga_client_hdl_rec_t *rm_node
>> >> > >> >>   );
>> >> > >> >> +void free_client_hdl(lga_client_hdl_rec_t **p_client_hdl);
>> >> > >> >>
>> >> > >> >>  #ifdef   __cplusplus
>> >> > >> >>  }
>> >> > >> >> diff --git a/osaf/libs/agents/saf/lga/lga_util.c
>> >> > >> >> b/osaf/libs/agents/saf/lga/lga_util.c
>> >> > >> >> --- a/osaf/libs/agents/saf/lga/lga_util.c
>> >> > >> >> +++ b/osaf/libs/agents/saf/lga/lga_util.c
>> >> > >> >> @@ -19,6 +19,7 @@
>> >> > >> >>  #include <syslog.h>
>> >> > >> >>  #include "lga.h"
>> >> > >> >>  #include "osaf_poll.h"
>> >> > >> >> +#include "lga_state.h"
>> >> > >> >>
>> >> > >> >>  /* Variables used during startup/shutdown only */
>> >> > >> >>  static pthread_mutex_t lga_lock =
>> PTHREAD_MUTEX_INITIALIZER;
>> >> > >> >> @@ -449,11 +450,9 @@ void lga_hdl_list_del(lga_client_hdl_rec
>> >> > >> >>   /** clean up the channel records for this lga-client
>> >> > >> >>           **/
>> >> > >> >>           lga_log_stream_hdl_rec_list_del(&client_hdl-
>> >> > >> >> >stream_list);
>> >> > >> >> - /** remove the association with hdl-mngr
>> >> > >> >> -         **/
>> >> > >> >> -         free(client_hdl);
>> >> > >> >> -         client_hdl = 0;
>> >> > >> >> +         free_client_hdl(&client_hdl);
>> >> > >> >>   }
>> >> > >> >> +
>> >> > >> >>   TRACE_LEAVE();
>> >> > >> >>  }
>> >> > >> >>
>> >> > >> >> @@ -536,38 +535,35 @@ uint32_t
>> lga_hdl_rec_del(lga_client_hdl_
>> >> > >> >>   if (list_iter == rm_node) {
>> >> > >> >>           *list_head = rm_node->next;
>> >> > >> >>
>> >> > >> >> - /** detach & release the IPC
>> >> > >> >> -         **/
>> >> > >> >> +         /* detach & release the IPC */
>> >> > >> >>           m_NCS_IPC_DETACH(&rm_node->mbx,
>> >> > >> >> lga_clear_mbx, NULL);
>> >> > >> >>           m_NCS_IPC_RELEASE(&rm_node->mbx, NULL);
>> >> > >> >>
>> >> > >> >>           ncshm_give_hdl(rm_node->local_hdl);
>> >> > >> >>           ncshm_destroy_hdl(NCS_SERVICE_ID_LGA,
>> >> > >> >> rm_node->local_hdl);
>> >> > >> >> - /** Free the channel records off this hdl
>> >> > >> >> -         **/
>> >> > >> >> +
>> >> > >> >> +         /* Free the channel records off this hdl */
>> >> > >> >>           lga_log_stream_hdl_rec_list_del(&rm_node-
>> >> > >> >> >stream_list);
>> >> > >> >> +         free_client_hdl(&rm_node);
>> >> > >> >>
>> >> > >> >> - /** free the hdl rec
>> >> > >> >> -         **/
>> >> > >> >> -         free(rm_node);
>> >> > >> >>           rc = NCSCC_RC_SUCCESS;
>> >> > >> >>           goto out;
>> >> > >> >> - } else {                /* find the rec */
>> >> > >> >> + } else {
>> >> > >> >> +         /* find the rec */
>> >> > >> >>           while (NULL != list_iter) {
>> >> > >> >>                   if (list_iter->next == rm_node) {
>> >> > >> >>                           list_iter->next =
>> >> > >> >> rm_node->next;
>> >> > >> >>
>> >> > >> >> -         /** detach & release the IPC */
>> >> > >> >> +                         /* detach & release
>> >> > >> >> the IPC */
>> >> > >> >>
>> >> > >> >>   m_NCS_IPC_DETACH(&rm_node->mbx, lga_clear_mbx,
>> >> > >> NULL);
>> >> > >> >>
>> >> > >> >>   m_NCS_IPC_RELEASE(&rm_node->mbx, NULL);
>> >> > >> >>
>> >> > >> >>
>> >> > >> >>   ncshm_give_hdl(rm_node->local_hdl);
>> >> > >> >>
>> >> > >> >>   ncshm_destroy_hdl(NCS_SERVICE_ID_LGA, rm_node-
>> >> > >> >> >local_hdl);
>> >> > >> >> -         /** Free the channel records off this lga_hdl  */
>> >> > >> >> +
>> >> > >> >> +                         /* Free the channel
>> >> > >> >> records off this lga_hdl  */
>> >> > >> >>
>> >> > >> >>   lga_log_stream_hdl_rec_list_del(&rm_node->stream_list);
>> >> > >> >> -
>> >> > >> >> -         /** free the hdl rec */
>> >> > >> >> -                         free(rm_node);
>> >> > >> >> +
>> >> > >> >>   free_client_hdl(&rm_node);
>> >> > >> >>
>> >> > >> >>                           rc =
>> >> > >> >> NCSCC_RC_SUCCESS;
>> >> > >> >>                           goto out;
>> >> > >>
>> >> >
>>

Attachment: lgsv_missMutexProtection_1705_r2.patch
Description: Binary data

------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to