Is a utrace engine with .report_jctl enabled suppose to handle
do_notify_parent_cldstop(current, notify) processing for the last
stopping task ? Or should it muck with task->ptrace to force
tracehook_notify_jctl() to return a non-zero value ?
I ask because I have a simple multi-threaded process with a utrace engine
attached to the process group leader; .report_jctl is enabled.
If I SIGTSTP the process, occasionally control is not returned to the
shell.
On my 2.6.32-44.2.el6 kernel this happens because when utrace_report_jctl()
releases spin_unlock_irq(&task->sighand->siglock) it breaks serialization
with sig->group_stop_count as required by do_signal_stop()'s
do_notify_parent_cldstop(current, notify) processing.
Let me explain:
Consider the following code fragment from kernel/signal.c in
function do_signal_stop(), released in rhel's 6x beta2 2.6.32-44.2.el6
kernel:
1707 /*
1708 * If there are no other threads in the group, or if there is
1709 * a group stop in progress and we are the last to stop, report
1710 * to the parent. When ptraced, every thread reports itself.
1711 */
1712 notify = sig->group_stop_count == 1 ? CLD_STOPPED : 0;
1713 notify = tracehook_notify_jctl(notify, CLD_STOPPED);
1714 /*
1715 * tracehook_notify_jctl() can drop and reacquire siglock, so
1716 * we keep ->group_stop_count != 0 before the call. If SIGCONT
1717 * or SIGKILL comes in between ->group_stop_count == 0.
1718 */
1719 if (sig->group_stop_count) {
1720 if (!--sig->group_stop_count)
1721 sig->flags = SIGNAL_STOP_STOPPED;
1722 current->exit_code = sig->group_exit_code;
1723 __set_current_state(TASK_STOPPED);
1724 }
1725 spin_unlock_irq(¤t->sighand->siglock);
1726
1727 if (notify) {
1728 read_lock(&tasklist_lock);
1729 do_notify_parent_cldstop(current, notify);
1730 read_unlock(&tasklist_lock);
1731 }
1732
1733 /* Now we don't run again until woken by SIGCONT or SIGKILL */
1734 do {
1735 schedule();
1736 } while (try_to_freeze());
1737
1738 tracehook_finish_jctl();
1739 current->exit_code = 0;
1740
1741 return 1;
For the sake if discussion:
* Let the task group have 2 tasks;
therefore initially sig->group_stop_count == 2
* For both tasks task_ptrace(current) returns zero
(see tracehook_notify_jctl() for why this matters)
* Let task1 be the process group leader and let it be the first task to
execute
do_signal_stop()
* Let task1 have a trace engine attached with .report_jctl enabled and let
all engine ops be no-ops; they do nothing; simply return UTRACE_RESUME
Now when I send a SIGTSTP via ctl-z on the terminal of this multi
threaded process, the following can happen:
* at line 1713 task1 calls tracehook_notify_jctl() with notify == 0
because sig->group_stop_count == 2
* Because task1 has a utrace engine with .report_jctl, it releases
task->sighand->siglock in utrace_report_jctl()
* Now task2 may enter do_signal_stop() with the task->sighand->siglock held.
* For task2 sig->group_stop_count == 2 is still true because task1 is
either
off executing utrace code or it is waiting on task->sighand->siglock
held
by task2; task1 has not executed line 1720
* For task2 because sig->group_stop_count == 2 and because
tracehook_notify_jctl(notify, CLD_STOPPED) returns zero, notify == 0
* Therefore when task2 executes line 1727 do_notify_parent_cldstop() is
not executed.
* After task2 releases the lock, task1 continues, but unfortunately because
when it was setting the "notify" cookie sig->group_stop_count == 2 and
tracehook_notify_jctl(notify, CLD_STOPPED) returned zero because
notify was
initially zero and task_ptrace(current) returned zero.
* Therefore for task1, after tracehook_notify_jctl(), notify == 0
* Finally, when task1 executes line 1727 do_notify_parent_cldstop() is not
executed.
The result is a control-z that does not return control to the parent
because
line 1729 was never executed. One possible fix is to re-examine
sig->group_stop_count after tracehook_notify_jctl() with something like:
notify = notify ?: sig->group_stop_count == 1 ? CLD_STOPPED : 0;