[ 
https://issues.apache.org/jira/browse/MYNEWT-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Collins updated MYNEWT-745:
---------------------------------------
    Description: 
The problem appears to occur when a system call is interrupted by a sim context 
switch.  Because a sim context switch is implemented as a signal handler that 
never returns (it calls longjmp()), the system call is left unfinished.  In 
some cases, it seems the system call acquired some resources that it never got 
a chance to release, leading to deadlock on a subsequent system call. For 
whatever reason, when the original system call is resumed (i.e., when Mynewt 
switch back to the original task), the syscall is unable to recover.

In sim, a context switch is triggered by delivery of a SIGURG signal. A few 
events generate this signal:
# A task calls an OS function with the potential to switch tasks (e.g., 
os_eventq_get(), os_mutex_release(), etc.).
# An OS tick occurs.

The problem appears to occur when an OS tick generates the SIGURG signal.  The 
OS ticker is implemented via an itimer, which generates the SIG_ALRM signal on 
each tick.  The SIG_ALRM handler advances the OS time, and then calls 
os_sched(), potentially generating a SIGURG signal.  If the current task 
happened to be in the middle of a syscall when the tick timer expired, the 
SIGURG signal gets handled before the syscall returns.

Here is a stack trace showing a context switch in the middle of a system call:

{noformat}
(gdb) whe
#0  0x0804a3bd in ctxsw_handler (sig=23)
    at kernel/os/src/arch/sim/os_arch_sim.c:150
#1  <signal handler called>
#2  0xf7ffdbe7 in __kernel_vsyscall ()
#3  0x08097630 in __lll_lock_wait_private ()
#4  0x080923b0 in __tz_convert ()
#5  0x08091673 in localtime ()
#6  0x0809162c in ctime ()
#7  0x08048a5a in task1_handler (arg=0x0) at apps/slinky/src/main.c:162
#8  0x0804a2c8 in os_arch_task_start (sf=0x8160314, rc=1)
    at kernel/os/src/arch/sim/os_arch_sim.c:88
#9  0x0804ad90 in os_arch_frame_init ()
    at kernel/os/src/arch/sim/os_arch_stack_frame.s:98
#10 0x0804ad90 in os_arch_frame_init ()
    at kernel/os/src/arch/sim/os_arch_stack_frame.s:98
{noformat}

Attached is a simple Mynewt app that can be used to replicate this issue 
(main.c).

  was:
The problem appears to occur when a system call is interrupted by a sim context 
switch.  Because a sim context switch is implemented as a signal handler that 
never returns (it calls longjmp()), the system call is left unfinished.  In 
some cases, it seems the system call acquired some resources that it never got 
a chance to release, leading to deadlock on a subsequent system call.

Sim has protections in place to prevent this problem from happening.  
Specifically, a context switch is triggered by delivery of a SIGURG signal, and 
SIGURG is only sent from within the SIGALARM signal handler.  These handlers 
are configured such that all signals are blocked until the handlers complete (I 
am not sure how this works for the SIGURG handler, considering it never 
returns).

My initial guess was that a pending SIGURG signal does not get delivered as 
soon as it is unblocked at the end of the SIGALARM handler.  However, a simple 
test using sigpending() and sleep prove that this is not the case.

Here is a stack trace showing a context switch in the middle of a system call:

{noformat}
(gdb) whe
#0  0x0804a3bd in ctxsw_handler (sig=23)
    at kernel/os/src/arch/sim/os_arch_sim.c:150
#1  <signal handler called>
#2  0xf7ffdbe7 in __kernel_vsyscall ()
#3  0x08097630 in __lll_lock_wait_private ()
#4  0x080923b0 in __tz_convert ()
#5  0x08091673 in localtime ()
#6  0x0809162c in ctime ()
#7  0x08048a5a in task1_handler (arg=0x0) at apps/slinky/src/main.c:162
#8  0x0804a2c8 in os_arch_task_start (sf=0x8160314, rc=1)
    at kernel/os/src/arch/sim/os_arch_sim.c:88
#9  0x0804ad90 in os_arch_frame_init ()
    at kernel/os/src/arch/sim/os_arch_stack_frame.s:98
#10 0x0804ad90 in os_arch_frame_init ()
    at kernel/os/src/arch/sim/os_arch_stack_frame.s:98
{noformat}

Attached is a simple Mynewt app that can be used to replicate this issue 
(main.c).


> Sim - deadlock involving system calls
> -------------------------------------
>
>                 Key: MYNEWT-745
>                 URL: https://issues.apache.org/jira/browse/MYNEWT-745
>             Project: Mynewt
>          Issue Type: Bug
>            Reporter: Christopher Collins
>             Fix For: v1_1_0_rel
>
>         Attachments: main.c
>
>
> The problem appears to occur when a system call is interrupted by a sim 
> context switch.  Because a sim context switch is implemented as a signal 
> handler that never returns (it calls longjmp()), the system call is left 
> unfinished.  In some cases, it seems the system call acquired some resources 
> that it never got a chance to release, leading to deadlock on a subsequent 
> system call. For whatever reason, when the original system call is resumed 
> (i.e., when Mynewt switch back to the original task), the syscall is unable 
> to recover.
> In sim, a context switch is triggered by delivery of a SIGURG signal. A few 
> events generate this signal:
> # A task calls an OS function with the potential to switch tasks (e.g., 
> os_eventq_get(), os_mutex_release(), etc.).
> # An OS tick occurs.
> The problem appears to occur when an OS tick generates the SIGURG signal.  
> The OS ticker is implemented via an itimer, which generates the SIG_ALRM 
> signal on each tick.  The SIG_ALRM handler advances the OS time, and then 
> calls os_sched(), potentially generating a SIGURG signal.  If the current 
> task happened to be in the middle of a syscall when the tick timer expired, 
> the SIGURG signal gets handled before the syscall returns.
> Here is a stack trace showing a context switch in the middle of a system call:
> {noformat}
> (gdb) whe
> #0  0x0804a3bd in ctxsw_handler (sig=23)
>     at kernel/os/src/arch/sim/os_arch_sim.c:150
> #1  <signal handler called>
> #2  0xf7ffdbe7 in __kernel_vsyscall ()
> #3  0x08097630 in __lll_lock_wait_private ()
> #4  0x080923b0 in __tz_convert ()
> #5  0x08091673 in localtime ()
> #6  0x0809162c in ctime ()
> #7  0x08048a5a in task1_handler (arg=0x0) at apps/slinky/src/main.c:162
> #8  0x0804a2c8 in os_arch_task_start (sf=0x8160314, rc=1)
>     at kernel/os/src/arch/sim/os_arch_sim.c:88
> #9  0x0804ad90 in os_arch_frame_init ()
>     at kernel/os/src/arch/sim/os_arch_stack_frame.s:98
> #10 0x0804ad90 in os_arch_frame_init ()
>     at kernel/os/src/arch/sim/os_arch_stack_frame.s:98
> {noformat}
> Attached is a simple Mynewt app that can be used to replicate this issue 
> (main.c).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to