Re: [go-nuts] zombie parent scenario with golang

Ian Lance Taylor Thu, 10 Sep 2020 18:42:10 -0700

On Thu, Sep 10, 2020 at 5:09 PM Kurtis Rader <kra...@skepticism.us> wrote:
>
> A defunct process is a process that has terminated but whose parent process 
> has not called wait() or one of its variants. I don't know why lsof still 
> reports open files. It shouldn't since a dead process should have its 
> resources, such as its file descriptor table, freed by the kernel even if the 
> parent hasn't called wait(). You didn't tell us the details of the OS you're 
> using so I would simply assume it's a quirk of your OS. It might be more 
> productive to look into why your program is panicing at map_faststr.go:275. A 
> likely explanation is you have a race in your program that is causing it to 
> attempt to mutate a map concurrently or you're trying to insert into a nil 
> map.


That's a good point.  What OS are you using?  I don't think you said.

Ian


> On Thu, Sep 10, 2020 at 4:43 PM Uday Kiran Jonnala <judayki...@gmail.com> 
> wrote:
>>
>> Hi Ian,
>>
>> Again. Thanks for the reply. Problem here is we see go process is in defunt 
>> process and sure parent process did not get SIGCHILD and looking deeper,
>> I see a thread in  futex_wait_queue_me. If we think we are just getting the 
>> stack trace and the go process actually got killed, why would I see
>> associated fd's in file table and fd table is still intact (see lsof 
>> information)
>>
>> Process which is in defunt state which got panic is <87548>, checking for 
>> threads in this which is 87548
>>
>> bash-4.2# cat /proc/87548/status
>>  Name: replicator
>>  State: Z (zombie)
>>
>> bash-4.2# ls -Fl /proc/87548/task/87561/fd | grep 606649
>> l-wx------. 1 root root 64 Aug 25 10:59 1 -> pipe:[606649]
>> l-wx------. 1 root root 64 Aug 25 10:59 2 -> pipe:[606649]
>>
>> Listing the threads
>>
>> bash-4.2# ps -aefT | grep 87548
>> root 87548 87548 87507 0 Aug23 ? 00:00:00 [replicator] <defunct>
>> root 87548 87561 87507 0 Aug23 ? 00:00:00 [replicator] <defunct>
>> root 112448 112448 42566 0 17:13 pts/0 00:00:00 grep 87548
>>
>> bash-4.2# lsof | grep 606649
>> replicato  87548  87561    root    1w     FIFO               0,11       0t0  
>>    606649 pipe
>> replicato  87548  87561    root    2w     FIFO               0,11       0t0  
>>    606649 pipe
>>
>> Why does lsof show the entry for the FIFO file of this process?
>>
>> So I feel we have a scenario the thread which is sleeping on 
>> futex_wait_queue_me is not cleanup during panic() and causing the main
>> thread to be exited leaving detached thread which waiting in 
>> futex_wait_queue_me is still present.
>>
>> The main issue is I am not able to reproduce this, since this go process is 
>> very big.
>>
>> Any way to verify this OR  take it further.
>>
>> Thanks & Regards,
>> Uday Kiran
>> On Monday, September 7, 2020 at 12:05:05 PM UTC-7 Ian Lance Taylor wrote:
>>>
>>> On Mon, Sep 7, 2020 at 12:03 AM Uday Kiran Jonnala <juday...@gmail.com> 
>>> wrote:
>>> >
>>> > Thanks for the reply, I get the point on zombie, I do not think the issue 
>>> > here is parent not reaping child, seems like go process has not finished 
>>> > execution of some
>>> > internal threads (waiting on some futex) and causing SIGCHILD not to be 
>>> > sent to parent.
>>> >
>>> > go process named <replicator> hit with panic and I see this went into 
>>> > zombie state
>>> >
>>> > $ ps -ef | grep replicator
>>> > root 87548 87507 0 Aug23 ? 00:00:00 [replicator] <defunct>
>>> >
>>> > Now looking at the tasks within the process
>>> >
>>> > I see the stack trace of the threads within the process still stuck on 
>>> > following
>>> >
>>> > bash-4.2# cat /proc/87548/task/87561/stack
>>> > [<ffffffffbb114714>] futex_wait_queue_me+0xc4/0x120
>>> > [<ffffffffbb11520a>] futex_wait+0x10a/0x250
>>> > [<ffffffffbb1182ce>] do_futex+0x35e/0x5b0
>>> > [<ffffffffbb11865b>] SyS_futex+0x13b/0x180
>>> > [<ffffffffbb003c09>] do_syscall_64+0x79/0x1b0
>>> > [<ffffffffbba00081>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>> > [<ffffffffffffffff>] 0xffffffffffffffff
>>> >
>>> > From the above example if we are creating some internal threads and main 
>>> > thread is excited due to panic and left some detached threads, process 
>>> > will be in zombie state until the threads
>>> > within the process completes.
>>> >
>>> > It appears there is some run away threads hung state scenario causing 
>>> > this. I am not able to reproduce it with main go routine explict panic 
>>> > and some go routine still executing.
>>> >
>>> > Does the above stack trace sound familiar wrt internal threads of Go 
>>> > runtime ?
>>>
>>> If the process is defunct, then none of the thread stacks matter.
>>> They are just where the thread happened to be when the process exited.
>>>
>>> What is the real problem you are seeing?
>>>
>>> Ian
>>>
>>>
>>>
>>>
>>> > On Thursday, August 27, 2020 at 1:43:39 PM UTC-7 Ian Lance Taylor wrote:
>>> >>
>>> >> On Thu, Aug 27, 2020 at 10:01 AM Uday Kiran Jonnala
>>> >> <juday...@gmail.com> wrote:
>>> >> >
>>> >> > I have a situation on zombie parent scenario with golang
>>> >> >
>>> >> > A process (in the case replicator) has many goroutines internally
>>> >> >
>>> >> > We hit into panic() and I see the replicator process is in Zombie state
>>> >> >
>>> >> > <<>>>:~$ ps -ef | grep replicator
>>> >> >
>>> >> > root 87548 87507 0 Aug23 ? 00:00:00 [replicator] <defunct>
>>> >> >
>>> >> >
>>> >> >
>>> >> > Main go routine (or the supporting P) excited, but panic left the 
>>> >> > other P thread to be still in executing state (main P could be 87548 
>>> >> > and supporting P thread 87561 is still there) in blocked state
>>> >> >
>>> >> > bash-4.2# ls -Fl /proc/87548/task/87561/fd | grep 606649l-wx------. 1 
>>> >> > root root 64 Aug 25 10:59 1 -> pipe:[606649]l-wx------. 1 root root 64 
>>> >> > Aug 25 10:59 2 -> pipe:[606649]
>>> >> >
>>> >> > Stack trace
>>> >> >
>>> >> > bash-4.2# cat /proc/87548/task/87561/stack[<ffffffffbb114714>] 
>>> >> > futex_wait_queue_me+0xc4/0x120[<ffffffffbb11520a>] 
>>> >> > futex_wait+0x10a/0x250[<ffffffffbb1182ce>] 
>>> >> > do_futex+0x35e/0x5b0[<ffffffffbb11865b>] 
>>> >> > SyS_futex+0x13b/0x180[<ffffffffbb003c09>] 
>>> >> > do_syscall_64+0x79/0x1b0[<ffffffffbba00081>] 
>>> >> > entry_SYSCALL_64_after_hwframe+0x3d/0xa2[<ffffffffffffffff>] 
>>> >> > 0xffffffffffffffff
>>> >> >
>>> >> >
>>> >> >
>>> >> > We have panic internally from main go routine
>>> >> >
>>> >> > fatal error: concurrent map writes
>>> >> >
>>> >> > goroutine 666359 [running]:
>>> >> > runtime.throw(0x101d6ae, 0x15)
>>> >> > /home/ll/ntnx/toolchain-builds/78ae837ba07c8ef8f0ea782407d8d4626815552b.x86_64/go/src/runtime/panic.go:608
>>> >> >  +0x72 fp=0xc00374b6f0 sp=0xc00374b6c0 pc=0x42da62
>>> >> > runtime.mapassign_faststr(0xdb71c0, 0xc00023f5f0, 0xc000aca990, 0x83, 
>>> >> > 0xc0009d03c8)
>>> >> > /home/ll/ntnx/toolchain-builds/78ae837ba07c8ef8f0ea782407d8d4626815552b.x86_64/go/src/runtime/map_faststr.go:275
>>> >> >  +0x3bf fp=0xc00374b758 sp=0xc00374b6f0 pc=0x41527f
>>> >> > github.eng.nutanix.com/xyz/abc/metadata.UpdateRecvInProgressFlag(0xc000aca990,
>>> >> >  0x83, 0x0)
>>> >> >
>>> >> > .......
>>> >> >
>>> >> > goroutine 665516 [chan receive, 2 minutes]:
>>> >> > zeus.(*Leadership).LeaderValue.func1(0xc003d5c120, 0x0, 0xc002e906c0, 
>>> >> > 0x52, 0xc00302ec60, 0x29)
>>> >> > /home/ll/ntnx/main/build/.go/src/zeus/leadership.go:244 +0x34
>>> >> > created by zeus.(*Leadership).LeaderValue
>>> >> > /home/ll/ntnx/main/build/.go/src/zeus/leadership.go:243 +0x277
>>> >> > 2020-08-03 00:35:04 rolled over log file
>>> >> > ERROR: logging before flag.Parse: I0803 00:35:04.426906 196123 
>>> >> > dataset.go:26] initialize zfs linking
>>> >> > ERROR: logging before flag.Parse: I0803 00:35:04.433296 196123 
>>> >> > dataset.go:34] completed zfs linking successfully
>>> >> > I0803 00:35:04.433447 196123 main.go:86] Gflags passed NodeUuid: 
>>> >> > c238e584-0eeb-48bd-b299-2a25b13602f1, External Ip: 10.15.96.163
>>> >> > I0803 00:35:04.433460 196123 main.go:99] Component name using for this 
>>> >> > process : abc-c238e584-0eeb-48bd-b299-2a25b13602f1
>>> >> > I0803 00:35:04.433467 196123 main.go:120] Trying to initialize DB
>>> >> >
>>> >> > If there is panic() from main P thread, as I understand we exit() and 
>>> >> > cleanup all P threads of the process.
>>> >> >
>>> >> > Are we hitting into the following scenario, I did not look into M-P-G 
>>> >> > implantation in detail.
>>> >> >
>>> >> > Example:
>>> >> >
>>> >> > #include <stdio.h>
>>> >> > #include <pthread.h>
>>> >> > #include <unistd.h>
>>> >> > #include <stdlib.h>
>>> >> >
>>> >> > void *thread_function(void *args)
>>> >> > {
>>> >> > printf("The is new thread! Sleep 20 seconds...\n");
>>> >> > sleep(100);
>>> >> > printf("Exit from thread\n");
>>> >> > pthread_exit(0);
>>> >> > }
>>> >> >
>>> >> > int main(int argc, char **argv)
>>> >> > {
>>> >> > pthread_t thrd;
>>> >> > pthread_attr_t attr;
>>> >> > int res = 0;
>>> >> > res = pthread_attr_init(&attr);
>>> >> > res = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
>>> >> > res = pthread_create(&thrd, &attr, thread_function, NULL);
>>> >> > res = pthread_attr_destroy(&attr);
>>> >> > printf("Main thread. Sleep 5 seconds\n");
>>> >> > sleep(5);
>>> >> > printf("Exit from main process\n");
>>> >> > pthread_exit(0);
>>> >> > }
>>> >> >
>>> >> > kkk@ ~/mycode/go () $ ./a.out &
>>> >> > [1] 108418Main thread. Sleep 5 secondsThe is new thread! Sleep 20 
>>> >> > seconds...
>>> >> > kkk@ ~/mycode/go () $
>>> >> > Exit from main processs
>>> >> > PID TTY TIME CMD
>>> >> > 49313 pts/26 00:00:01 bash108418 pts/26 00:00:00 [a.out] 
>>> >> > <defunct>108449 pts/26 00:00:00 ps
>>> >> >
>>> >> > See the main process is <defunct> and child is still hanging around
>>> >> >
>>> >> > kkk@ ~/mycode/go () $ sudo cat 
>>> >> > /proc/108418/task/108420/stack[<ffffffff810b4c1d>] 
>>> >> > hrtimer_nanosleep+0xbd/0x1d0[<ffffffff810b4dae>] 
>>> >> > SyS_nanosleep+0x7e/0x90[<ffffffff816a63c9>] 
>>> >> > system_call_fastpath+0x16/0x1b[<ffffffffffffffff>] 
>>> >> > 0xffffffffffffffffujonnala@ ~/mycode/go () $ Exit from thread
>>> >> >
>>> >> > Any help in this regard is appreciated.
>>> >>
>>> >>
>>> >> I think you are misreading something somewhere. Zombie status is a
>>> >> feature of a process, not a thread. It means that the child process
>>> >> has exited but that the parent process, the one which started the
>>> >> child process via the fork system call (or, on GNU/Linux, the clone
>>> >> system call), has not called the wait (or waitpid or wait3 or wait4)
>>> >> system call to collect its status.
>>> >>
>>> >> So don't look at threads or P's. Look at the parent process that
>>> >> started the process that became a zombie.
>>> >>
>>> >> Ian
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google Groups 
>>> > "golang-nuts" group.
>>> > To unsubscribe from this group and stop receiving emails from it, send an 
>>> > email to golang-nuts...@googlegroups.com.
>>> > To view this discussion on the web visit 
>>> > https://groups.google.com/d/msgid/golang-nuts/f70e42f4-622d-4d91-b51d-ed00f2e11ac4n%40googlegroups.com.
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to golang-nuts+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/golang-nuts/f1c6abc0-13b2-41ca-a365-fe0fbc7f129an%40googlegroups.com.
>
>
>
> --
> Kurtis Rader
> Caretaker of the exceptional canines Junior and Hank
>
> --
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/golang-nuts/CABx2%3DD_Peg%2BMtJHGOwrqUKS%3D4JhPJgTS4WCMxocJWmX9J52VKg%40mail.gmail.com.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAOyqgcV9rb%2BKJD3wcpumjm0BSKnsLaRTMnbEPcFPvMd%3DOHbdHQ%40mail.gmail.com.

Re: [go-nuts] zombie parent scenario with golang

Reply via email to