On 3/9/2017 4:59 PM, Bryan Drewery wrote:
> On 3/9/2017 3:57 PM, Bryan Drewery wrote:
>> On 3/9/2017 3:47 PM, Bryan Drewery wrote:
>>> On 3/9/2017 3:11 PM, Jilles Tjoelker wrote:
>>>> On Thu, Mar 09, 2017 at 04:46:46PM +0200, Konstantin Belousov wrote:
>>>>> Yes, there is a race, apparently, with the child zombie still not 
>>>>> finishing
>>>>> sending the SIGCHLD to the parent and parent exiting.  The following 
>>>>> should
>>>>> fix the issue, but I do not think that reproducing the problem is easy.
>>>>
>>>>> diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c
>>>>> index c524fe5df37..ba5ff84e9de 100644
>>>>> --- a/sys/kern/kern_exit.c
>>>>> +++ b/sys/kern/kern_exit.c
>>>>> @@ -189,6 +189,7 @@ exit1(struct thread *td, int rval, int signo)
>>>>>  {
>>>>>   struct proc *p, *nq, *q, *t;
>>>>>   struct thread *tdt;
>>>>> + ksiginfo_t ksi;
>>>>>  
>>>>>   mtx_assert(&Giant, MA_NOTOWNED);
>>>>>   KASSERT(rval == 0 || signo == 0, ("exit1 rv %d sig %d", rval, signo));
>>>>> @@ -456,7 +457,12 @@ exit1(struct thread *td, int rval, int signo)
>>>>>                   proc_reparent(q, q->p_reaper);
>>>>>                   if (q->p_state == PRS_ZOMBIE) {
>>>>>                           PROC_LOCK(q->p_reaper);
>>>>> -                         pksignal(q->p_reaper, SIGCHLD, q->p_ksi);
>>>>> +                         if (q->p_ksi != NULL) {
>>>>> +                                 ksiginfo_init(&ksi);
>>>>> +                                 ksiginfo_copy(q->p_ksi, &ksi);
>>>>> +                         }
>>>>> +                         pksignal(q->p_reaper, SIGCHLD, q->p_ksi !=
>>>>> +                             NULL ? &ksi : NULL);
>>>>>                           PROC_UNLOCK(q->p_reaper);
>>>>>                   }
>>>>>           } else {
>>>
>>> I just got something weird with this patch that wasn't happening before:
>>>
>>> /usr/bin/time -l src/bin/poudriere -e /usr/local/etc testport -j
>>> exp-10amd64 -p commit -z test devel/ccache
>>> [poudriere runs and completes with exit status 0]
>>>> time: command terminated abnormally                                        
>>>>                  
>>>>        28.08 real         9.92 user        10.38 sys                       
>>>>                  
>>>>      23464  maximum resident set size                                      
>>>>                  
>>>>       4996  average shared memory size                                     
>>>>                  
>>>>         88  average unshared data size                                     
>>>>                  
>>>>        127  average unshared stack size                                    
>>>>                  
>>>>     282705  page reclaims                                                  
>>>>                  
>>>>       5623  page faults                                                    
>>>>                  
>>>>          0  swaps                                                          
>>>>                  
>>>>       2673  block input operations                                         
>>>>                  
>>>>       4836  block output operations                                        
>>>>                  
>>>>         33  messages sent                                                  
>>>>                  
>>>>          0  messages received                                              
>>>>                  
>>>>         37  signals received                                               
>>>>                  
>>>>      11226  voluntary context switches                                     
>>>>                  
>>>>        780  involuntary context switches                                   
>>>>                  
>>>> zsh: alarm      /usr/bin/time -l src/bin/poudriere -e /usr/local/etc 
>>>> testport -j exp-10amd64
>>> exit status: 142 (SIGALRM).
>>>
>>> I don't see time(1) using SIGALRM or proc reaper at all.
>>>
>>> Rerunning it, and trying other simpler test cases, does not produce the
>>> same result.  It may be some race unrelated to this patch, dunno.
>>>
>>
>> I'm consistently getting foreground processes getting the wrong signals
>> now. I'm removing this patch for now.
> 
> It wasn't this patch doing this. Something else is very wrong with
> signal handling right now.
> 

False alarm.  This spurious SIGALRM issue is purely my fault. Ignore
those reports.

>>
>>>
>>>>
>>>> This patch introduces a subtle correctness bug. A real SIGCHLD ksiginfo
>>>> should always be the zombie's p_ksi; otherwise, the siginfo may be lost
>>>> if there are too many signals pending for the target process or in the
>>>> system. If the siginfo is lost and the reaper normally passes si_pid to
>>>> waitpid() or similar (instead of passing WAIT_ANY or P_ALL), a zombie
>>>> will remain until the reaper terminates.
>>>>
>>>> Conceptually the siginfo is sent to one process at a time only, so the
>>>> bug is an artifact of the implementation. Perhaps the piece of code
>>>> added in r309886 can be moved or the ksiginfo can be removed from the
>>>> parent's queue.
>>>>
>>>> If such a fix is not possible, it may be better to send a bare SIGCHLD
>>>> (si_code is SI_KERNEL or 0, depending on how many signals are pending)
>>>> in this situation and document that reapers must use WAIT_ANY or P_ALL.
>>>> (However, compared to the pre-r309886 situation they can still use
>>>> SIGCHLD to get notified when to call waitpid() or similar.)
>>>>
>>>
>>>
>>
>>
> 
> 


-- 
Regards,
Bryan Drewery

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to