Re: [devel] [PATCH 1 of 1] leap: ncs_os_process_execute_timed child process takes too long time before exec (#514)

Hans Nordebäck Wed, 07 Aug 2013 01:13:54 -0700

Hi, what is the conclusion, can this patch be acked? /BR HansN

-----Original Message-----
From: Hans Nordebäck 
Sent: den 6 augusti 2013 16:41
To: praveen malviya
Cc: Nagendra Kumar; Ramesh Babu Betham; [email protected]; 
Hans Feldt; Anders Widell; Hans Nordebäck
Subject: RE: SV: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child 
process takes too long time before exec (#514)


the fault described below is related to ticket #532.   /HansN

-----Original Message-----
From: Hans Nordebäck
Sent: den 6 augusti 2013 16:28
To: praveen malviya
Cc: Nagendra Kumar; Ramesh Babu Betham; [email protected]; 
Hans Feldt; Anders Widell; Hans Nordebäck
Subject: RE: SV: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child 
process takes too long time before exec (#514)

Hi, Yes, it seems most likely that the child is "hanging" somewhere before 
exec, if it had crashed it should be a sigchld  sent, which not seems to be the 
case here.
Then the alarm patch needs to be applied I guess to detect where it is 
"hanging". 

 /BR HansN 

-----Original Message-----
From: praveen malviya [mailto:[email protected]]
Sent: den 6 augusti 2013 09:34
To: Hans Nordebäck
Cc: Nagendra Kumar; Ramesh Babu Betham; [email protected]; 
Hans Feldt; Anders Widell
Subject: Re: SV: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child 
process takes too long time before exec (#514)

Hi,
         I added a new syslog entry(to debug) before execv and was able to 
reproduce the problem.
The reason for the time out seems to be that syslog was *hung*, but
*after* writing the new log message into /var/log/messages.
So, there is a high possibility that the child process gets hung in syslog or 
other file operations.
See bt below:
Thread 1 (Thread 0x7f030a16f700 (LWP 16331)):

#0  0x00007f0308de4b1e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007f0308dd4398 in _L_lock_634 () from /lib64/libc.so.6
#2  0x00007f0308dd3e62 in __vsyslog_chk () from /lib64/libc.so.6
#3  0x00007f0308dd4320 in syslog () from /lib64/libc.so.6
#4  0x00007f0309d38feb in ncs_os_process_execute_timed (req=0x7fffdcc52770, 
flag=false) at os_defs.c:1059
#5  0x0000000000416bd6 in avnd_comp_clc_cmd_execute (cb=0x6578a0, 
comp=0x673840, cmd_type=AVND_COMP_CLC_CMD_TYPE_INSTANTIATE)
     at avnd_clc.c:2687
#6  0x000000000041a630 in avnd_comp_clc_uninst_inst_hdler (cb=0x6578a0,
comp=0x673840) at avnd_clc.c:1384
#7  0x000000000041865b in avnd_comp_clc_fsm_run (cb=0x6578a0, comp=0x673840, 
ev=AVND_COMP_CLC_PRES_FSM_EV_INST)
     at avnd_clc.c:874
#8  0x0000000000433f5f in avnd_su_pres_insting_compinst_hdler
(cb=0x6578a0, su=0x66fc70, comp=0x674e10) at avnd_susm.c:1759
#9  0x0000000000435764 in avnd_su_pres_fsm_run (cb=0x6578a0, su=0x66fc70, 
comp=0x674e10,
     ev=AVND_SU_PRES_FSM_EV_COMP_INSTANTIATED) at avnd_susm.c:1293
#10 0x000000000041748b in avnd_comp_clc_st_chng_prc (cb=0x6578a0, 
comp=0x674e10, prv_st=SA_AMF_PRESENCE_INSTANTIATING,
     final_st=SA_AMF_PRESENCE_INSTANTIATED) at avnd_clc.c:1313
#11 0x00000000004186ef in avnd_comp_clc_fsm_run (cb=0x6578a0, comp=0x674e10, 
ev=AVND_COMP_CLC_PRES_FSM_EV_INST_SUCC)
     at avnd_clc.c:897
#12 0x000000000041f400 in avnd_comp_reg_prc (cb=0x6578a0, comp=0x674e10, 
pxy_comp=<optimized out>, reg=0x677528,
     dest=0x677518) at avnd_comp.c:709
#13 0x000000000041fbd7 in avnd_evt_ava_comp_reg_evh (cb=0x6578a0,
evt=0x683530) at avnd_comp.c:212
#14 0x000000000042fcc8 in avnd_evt_process (evt=<optimized out>) at
avnd_proc.c:278
#15 avnd_main_process () at avnd_proc.c:219
#16 0x0000000000408835 in main (argc=2, argv=0x7fffdcc53588) at
amfnd_main.c:53


Thanks,
Praveen.


On 03-Aug-13 9:47 PM, Hans Nordebäck wrote:
> Hi,
>
> I think the problem with "child before exec" "hangs" and the problem where 
> "child" at cleanup times out instantly, even though a 10 sec. timeout is set, 
> are related.
>
> Some reasons why the child seems to hang could be:
>
> 1. Not run at all
> 2. "Hangs" in e.g. close.
> 3. Crashes, but then whe should have core dumps.
>
> but the problem with "hanging" and instantly times out could be:
>
> 4. When parent process adds the pid in the patricia tree, 
> add_new_req_pid_in_list, ncs_exec_mod_hdlr is able to kill this child pid 
> before it has started.
> A e.g. race condtion in this code could explain both these faults.
> I looked briefly in this code and one thing is the delivery of the 
> SIGCHLD signal, any thread can get that signal, e.g amfnd, mds, timer or 
> ncs_exec_mod_hdlr. I think a dedicated thread should handle SIGCHLD.
> I guess we should look closer to this part.
>
> /BR HansN
> ________________________________________
> Från: Hans Nordebäck
> Skickat: den 30 juli 2013 13:05
> Till: Nagendra Kumar
> Cc: Ramesh Babu Betham; [email protected]; Praveen 
> Malviya; Hans Feldt
> Ämne: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child 
> process takes too long time before exec (#514)
>
> Hi Nagu, I have tested the patch and simulated "hanging" in the child part, 
> the core dump produced points out where.
> The system works as before, exept the core dump if it hangs in the 
> child part. We have this patch running at a site waiting for it to trigger, 
> so we don't know yet what is causing this and if the patch will help.
>
>   /BR HansN
>
> -----Original Message-----
> From: Nagendra Kumar [mailto:[email protected]]
> Sent: den 30 juli 2013 13:05
> To: Hans Nordebäck
> Cc: Ramesh Babu Betham; [email protected]; Praveen 
> Malviya; Hans Feldt
> Subject: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child 
> process takes too long time before exec (#514)
>
> Hi Hans N,
>
> Close is not a sync call. I am still not convinced, these calls may get 
> blocked because of NFS system issue.
> Have you tested by keeping abort in child process and see how amfnd or other 
> services behaves(just to simulate abort)?
>
> Thanks
> -Nagu
>
> -----Original Message-----
> From: Hans Nordebäck [mailto:[email protected]]
> Sent: 30 July 2013 16:16
> To: Nagendra Kumar
> Cc: Ramesh Babu Betham; [email protected]; Praveen 
> Malviya; Hans Feldt
> Subject: Re: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child 
> process takes too long time before exec (#514)
>
> Hi Nagu, I agree with Ramesh, e.g. in the child part:
>
>       /* By default we close all inherited file descriptors in the child */
>           if (getenv("OPENSAF_KEEP_FD_OPEN_AFTER_FORK") == NULL) {
>               /* Close all inherited file descriptors */
>               int i = sysconf(_SC_OPEN_MAX);
>               if (i == -1) {
>                   syslog(LOG_ERR, "%s: sysconf failed - %s", __FUNCTION__, 
> strerror(errno));
>                   exit(EXIT_FAILURE);
>               }
>               for (i--; i >= 0; --i)
>                   (void) close(i); /* close all descriptors */
>
>               /* Redirect standard files to /dev/null */
>               if (freopen("/dev/null", "r", stdin) == NULL)
>                   syslog(LOG_ERR, "%s: freopen stdin failed - %s", 
> __FUNCTION__, strerror(errno));
>               if (freopen("/dev/null", "w", stdout) == NULL)
>                   syslog(LOG_ERR, "%s: freopen stdout failed - %s", 
> __FUNCTION__, strerror(errno));
>               if (freopen("/dev/null", "w", stderr) == NULL)
>                   syslog(LOG_ERR, "%s: freopen stderr failed - %s", 
> __FUNCTION__, strerror(errno));
>           }
>
> close are issued.
>
> /BR HansN
>
> On 07/30/13 12:34, Nagendra Kumar wrote:
>> But where are file operations coming during fork, can you please explain ?
>>
>> Thanks
>> -Nagu
>> -----Original Message-----
>> From: Ramesh Betham
>> Sent: 30 July 2013 15:53
>> To: Nagendra Kumar
>> Cc: Hans Nordebäck; [email protected]; Praveen 
>> Malviya; Hans Feldt
>> Subject: Re: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child 
>> process takes too long time before exec (#514)
>>
>> Nagu: Hans N might be pointing to the chances of hung of file-operation 
>> calls (esp.  when some inconsistency happens with NFS).  Just a guess, let 
>> Hans N confirm it.
>>
>> Thanks,
>> Ramesh.
>>
>> On 7/30/2013 3:27 PM, Nagendra Kumar wrote:
>>> Hi,
>>>
>>>>> regarding what can "hang" in the child part, e.g close of file 
>>>>> descriptors close of file descriptors.
>>> When this can happen? After fork is successful, this shouldn't happen. Can 
>>> you please provide any example.
>>>
>>> Thanks
>>> -Nagu
>>>
>>> -----Original Message-----
>>> From: Hans Nordebäck [mailto:[email protected]]
>>> Sent: 30 July 2013 12:53
>>> To: Nagendra Kumar
>>> Cc: Hans Nordebäck; [email protected]; Praveen 
>>> Malviya; Ramesh Babu Betham; Hans Feldt
>>> Subject: Re: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child 
>>> process takes too long time before exec (#514)
>>>
>>> Hi Nagu, regarding what can "hang" in the child part, e.g close of file 
>>> descriptors. /BR HansN On 07/30/13 09:01, Hans Nordebäck wrote:
>>>> Hi Nagu,
>>>>
>>>> On 07/30/13 08:54, Nagendra Kumar wrote:
>>>>> Hi Hans N,
>>>>>
>>>>>>> 1. OPENSAF_CHILD_EXEC_TIME_TOLERANCE is the name of a new 
>>>>>>> environment variable where value is used as input to alarm,  if 
>>>>>>> not set it is default 2 seconds.
>>>>> Do we have some place holder for this variable for configuration 
>>>>> and are we going to add it in README for information.
>>>> Perhaps the name isn't the best, but it should be handled as the 
>>>> other env variable I guess, e.g. "AVND_PM_MONITORING_RATE", etc.
>>>>>>> if the child  "hangs" before exec this extra coredump should 
>>>>>>> give information  where/what is wrong.
>>>>> This means that fork hangs, am I right ? If yes, then dump is not 
>>>>> going to provide any information as it is a system call, it can 
>>>>> only show, ithangs in fork.
>>>> I don't think fork hangs as the parent part continues and later, 
>>>> with the help of ncs_exec_mod_hdlr,  the parent detects that the 
>>>> child or "exec" has timed out,
>>>> 10 sec in this case. But in this case the exec has not been performed.
>>>>>>> After exec, it will work as usual
>>>>> This confirms that we are only targeting fork to debug.
>>>> Yes, the extra core dump will help troubleshooting.
>>>> /BR HansN
>>>>> Thanks
>>>>> -Nagu
>>>>>
>>>>> -----Original Message-----
>>>>> From: Hans Nordebäck [mailto:[email protected]]
>>>>> Sent: 30 July 2013 11:57
>>>>> To: Nagendra Kumar
>>>>> Cc: [email protected]; Praveen Malviya; Ramesh 
>>>>> Babu Betham; Hans Feldt; Hans Nordebäck
>>>>> Subject: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed 
>>>>> child process takes too long time before exec (#514)
>>>>>
>>>>> Hi Nagu,
>>>>>
>>>>> 1. OPENSAF_CHILD_EXEC_TIME_TOLERANCE is the name of a new 
>>>>> environment variable where value is used as input to alarm,  if 
>>>>> not set it is default 2 seconds.
>>>>> 2. Yes you are right, in this particular case it is set to 10 sec, 
>>>>> that's why the env. variable above can be set.
>>>>> 3. This alarm is just an additional precaution, at no extra cost, 
>>>>> to check the child part before the exec.  After exec
>>>>>          it will work as usual but if the child  "hangs" before 
>>>>> exec this extra coredump should give information  where/what is wrong.
>>>>>
>>>>> /BR HansN
>>>>>
>>>>> -----Original Message-----
>>>>> From: Nagendra Kumar [mailto:[email protected]]
>>>>> Sent: den 30 juli 2013 07:11
>>>>> To: Hans Nordebäck; Praveen Malviya; Hans Feldt; Ramesh Babu 
>>>>> Betham
>>>>> Cc: [email protected]
>>>>> Subject: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed 
>>>>> child process takes too long time before exec (#514)
>>>>>
>>>>> Hi Hans N,
>>>>>            For my understanding, can you please provide the below
>>>>> information:
>>>>>
>>>>> 1.    I can't find OPENSAF_CHILD_EXEC_TIME_TOLERANCE in opensaf
>>>>> source code.
>>>>> 2.    I hope the child process is hung for more than
>>>>> saAmfCtDefClcCliTimeout resulting in CLC time out. Am I right?
>>>>> 3.    Even we add assert in child process and we get core dump, but
>>>>> it may not give any information as it got delayed because of
>>>>>        system issue. Are we targeting, which system call the child 
>>>>> process is hung?
>>>>>
>>>>> Thanks
>>>>> -Nagu
>>>>>
>>>>> -----Original Message-----
>>>>> From: Hans Nordeback [mailto:[email protected]]
>>>>> Sent: 22 July 2013 17:07
>>>>> To: Nagendra Kumar; Praveen Malviya; [email protected]; 
>>>>> Ramesh Babu Betham
>>>>> Cc: [email protected]
>>>>> Subject: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child 
>>>>> process takes too long time before exec (#514)
>>>>>
>>>>>      osaf/libs/core/leap/os_defs.c |  27 +++++++++++++++++++++++++++
>>>>>      1 files changed, 27 insertions(+), 0 deletions(-)
>>>>>
>>>>>
>>>>> amfnd calls ncs_os_process_execute_timed and the child process 
>>>>> takes too long time before exec, (10 sec timeout). An alarm is set 
>>>>> in the ncs_os_process_execute_timed child process. If timed out a 
>>>>> core dump will be produced to be able to trouble shoot.
>>>>>
>>>>> diff --git a/osaf/libs/core/leap/os_defs.c 
>>>>> b/osaf/libs/core/leap/os_defs.c
>>>>> --- a/osaf/libs/core/leap/os_defs.c
>>>>> +++ b/osaf/libs/core/leap/os_defs.c
>>>>> @@ -65,6 +65,15 @@ bool gl_ncs_atomic_mtx_initialise = fals
>>>>>       * description of SOCK_CLOEXEC. */
>>>>>      static pthread_mutex_t s_cloexec_mutex = PTHREAD_MUTEX_INITIALIZER;
>>>>>      +/*
>>>>> + * ALRM signal is used to detect if child process takes too long
>>>>> time before exec.
>>>>> + *
>>>>> + * @param sig
>>>>> + */
>>>>> +static void sigalrm_handler(int sig) {
>>>>> +    abort();
>>>>> +}
>>>>> /***************************************************************************
>>>>>       *
>>>>>       * uns64
>>>>> @@ -999,6 +1008,22 @@ uint32_t ncs_os_process_execute_timed(NC
>>>>>          osaf_mutex_lock_ordie(&s_cloexec_mutex);
>>>>>            if ((pid = fork()) == 0) {
>>>>> +                unsigned int alarm_time_sec;
>>>>> +                char* alarm_time;
>>>>> +
>>>>> +                if (signal(SIGALRM, sigalrm_handler) == SIG_ERR) {
>>>>> +                        LOG_ER("signal ALRM failed: %s",
>>>>> strerror(errno));
>>>>> +                }
>>>>> +                if ((alarm_time =
>>>>> getenv("OPENSAF_CHILD_EXEC_TIME_TOLERANCE")) != NULL) {
>>>>> +                        alarm_time_sec = strtol(alarm_time, NULL, 0);
>>>>> +                }
>>>>> +                else {
>>>>> +                        // default alarm timeout 2 seconds
>>>>> +                        alarm_time_sec = 2;
>>>>> +                }
>>>>> +
>>>>> +                alarm(alarm_time_sec);
>>>>> +
>>>>>              /*
>>>>>               ** Make sure forked processes have default scheduling class
>>>>>               ** independent of the callers scheduling class.
>>>>> @@ -1054,6 +1079,8 @@ uint32_t ncs_os_process_execute_timed(NC
>>>>>              }
>>>>>      #endif
>>>>>      +                alarm(0);
>>>>> +
>>>>>              /* child part */
>>>>>              if (execvp(req->i_script, req->i_argv) == -1) {
>>>>>                  syslog(LOG_ERR, "%s: execvp '%s' failed - %s", 
>>>>> __FUNCTION__, req->i_script, strerror(errno));


------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] leap: ncs_os_process_execute_timed child process takes too long time before exec (#514)

Reply via email to