On Thu, Sep 10, 2020 at 5:09 PM Kurtis Rader <kra...@skepticism.us> wrote: > > A defunct process is a process that has terminated but whose parent process > has not called wait() or one of its variants. I don't know why lsof still > reports open files. It shouldn't since a dead process should have its > resources, such as its file descriptor table, freed by the kernel even if the > parent hasn't called wait(). You didn't tell us the details of the OS you're > using so I would simply assume it's a quirk of your OS. It might be more > productive to look into why your program is panicing at map_faststr.go:275. A > likely explanation is you have a race in your program that is causing it to > attempt to mutate a map concurrently or you're trying to insert into a nil > map.
That's a good point. What OS are you using? I don't think you said. Ian > On Thu, Sep 10, 2020 at 4:43 PM Uday Kiran Jonnala <judayki...@gmail.com> > wrote: >> >> Hi Ian, >> >> Again. Thanks for the reply. Problem here is we see go process is in defunt >> process and sure parent process did not get SIGCHILD and looking deeper, >> I see a thread in futex_wait_queue_me. If we think we are just getting the >> stack trace and the go process actually got killed, why would I see >> associated fd's in file table and fd table is still intact (see lsof >> information) >> >> Process which is in defunt state which got panic is <87548>, checking for >> threads in this which is 87548 >> >> bash-4.2# cat /proc/87548/status >> Name: replicator >> State: Z (zombie) >> >> bash-4.2# ls -Fl /proc/87548/task/87561/fd | grep 606649 >> l-wx------. 1 root root 64 Aug 25 10:59 1 -> pipe:[606649] >> l-wx------. 1 root root 64 Aug 25 10:59 2 -> pipe:[606649] >> >> Listing the threads >> >> bash-4.2# ps -aefT | grep 87548 >> root 87548 87548 87507 0 Aug23 ? 00:00:00 [replicator] <defunct> >> root 87548 87561 87507 0 Aug23 ? 00:00:00 [replicator] <defunct> >> root 112448 112448 42566 0 17:13 pts/0 00:00:00 grep 87548 >> >> bash-4.2# lsof | grep 606649 >> replicato 87548 87561 root 1w FIFO 0,11 0t0 >> 606649 pipe >> replicato 87548 87561 root 2w FIFO 0,11 0t0 >> 606649 pipe >> >> Why does lsof show the entry for the FIFO file of this process? >> >> So I feel we have a scenario the thread which is sleeping on >> futex_wait_queue_me is not cleanup during panic() and causing the main >> thread to be exited leaving detached thread which waiting in >> futex_wait_queue_me is still present. >> >> The main issue is I am not able to reproduce this, since this go process is >> very big. >> >> Any way to verify this OR take it further. >> >> Thanks & Regards, >> Uday Kiran >> On Monday, September 7, 2020 at 12:05:05 PM UTC-7 Ian Lance Taylor wrote: >>> >>> On Mon, Sep 7, 2020 at 12:03 AM Uday Kiran Jonnala <juday...@gmail.com> >>> wrote: >>> > >>> > Thanks for the reply, I get the point on zombie, I do not think the issue >>> > here is parent not reaping child, seems like go process has not finished >>> > execution of some >>> > internal threads (waiting on some futex) and causing SIGCHILD not to be >>> > sent to parent. >>> > >>> > go process named <replicator> hit with panic and I see this went into >>> > zombie state >>> > >>> > $ ps -ef | grep replicator >>> > root 87548 87507 0 Aug23 ? 00:00:00 [replicator] <defunct> >>> > >>> > Now looking at the tasks within the process >>> > >>> > I see the stack trace of the threads within the process still stuck on >>> > following >>> > >>> > bash-4.2# cat /proc/87548/task/87561/stack >>> > [<ffffffffbb114714>] futex_wait_queue_me+0xc4/0x120 >>> > [<ffffffffbb11520a>] futex_wait+0x10a/0x250 >>> > [<ffffffffbb1182ce>] do_futex+0x35e/0x5b0 >>> > [<ffffffffbb11865b>] SyS_futex+0x13b/0x180 >>> > [<ffffffffbb003c09>] do_syscall_64+0x79/0x1b0 >>> > [<ffffffffbba00081>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >>> > [<ffffffffffffffff>] 0xffffffffffffffff >>> > >>> > From the above example if we are creating some internal threads and main >>> > thread is excited due to panic and left some detached threads, process >>> > will be in zombie state until the threads >>> > within the process completes. >>> > >>> > It appears there is some run away threads hung state scenario causing >>> > this. I am not able to reproduce it with main go routine explict panic >>> > and some go routine still executing. >>> > >>> > Does the above stack trace sound familiar wrt internal threads of Go >>> > runtime ? >>> >>> If the process is defunct, then none of the thread stacks matter. >>> They are just where the thread happened to be when the process exited. >>> >>> What is the real problem you are seeing? >>> >>> Ian >>> >>> >>> >>> >>> > On Thursday, August 27, 2020 at 1:43:39 PM UTC-7 Ian Lance Taylor wrote: >>> >> >>> >> On Thu, Aug 27, 2020 at 10:01 AM Uday Kiran Jonnala >>> >> <juday...@gmail.com> wrote: >>> >> > >>> >> > I have a situation on zombie parent scenario with golang >>> >> > >>> >> > A process (in the case replicator) has many goroutines internally >>> >> > >>> >> > We hit into panic() and I see the replicator process is in Zombie state >>> >> > >>> >> > <<>>>:~$ ps -ef | grep replicator >>> >> > >>> >> > root 87548 87507 0 Aug23 ? 00:00:00 [replicator] <defunct> >>> >> > >>> >> > >>> >> > >>> >> > Main go routine (or the supporting P) excited, but panic left the >>> >> > other P thread to be still in executing state (main P could be 87548 >>> >> > and supporting P thread 87561 is still there) in blocked state >>> >> > >>> >> > bash-4.2# ls -Fl /proc/87548/task/87561/fd | grep 606649l-wx------. 1 >>> >> > root root 64 Aug 25 10:59 1 -> pipe:[606649]l-wx------. 1 root root 64 >>> >> > Aug 25 10:59 2 -> pipe:[606649] >>> >> > >>> >> > Stack trace >>> >> > >>> >> > bash-4.2# cat /proc/87548/task/87561/stack[<ffffffffbb114714>] >>> >> > futex_wait_queue_me+0xc4/0x120[<ffffffffbb11520a>] >>> >> > futex_wait+0x10a/0x250[<ffffffffbb1182ce>] >>> >> > do_futex+0x35e/0x5b0[<ffffffffbb11865b>] >>> >> > SyS_futex+0x13b/0x180[<ffffffffbb003c09>] >>> >> > do_syscall_64+0x79/0x1b0[<ffffffffbba00081>] >>> >> > entry_SYSCALL_64_after_hwframe+0x3d/0xa2[<ffffffffffffffff>] >>> >> > 0xffffffffffffffff >>> >> > >>> >> > >>> >> > >>> >> > We have panic internally from main go routine >>> >> > >>> >> > fatal error: concurrent map writes >>> >> > >>> >> > goroutine 666359 [running]: >>> >> > runtime.throw(0x101d6ae, 0x15) >>> >> > /home/ll/ntnx/toolchain-builds/78ae837ba07c8ef8f0ea782407d8d4626815552b.x86_64/go/src/runtime/panic.go:608 >>> >> > +0x72 fp=0xc00374b6f0 sp=0xc00374b6c0 pc=0x42da62 >>> >> > runtime.mapassign_faststr(0xdb71c0, 0xc00023f5f0, 0xc000aca990, 0x83, >>> >> > 0xc0009d03c8) >>> >> > /home/ll/ntnx/toolchain-builds/78ae837ba07c8ef8f0ea782407d8d4626815552b.x86_64/go/src/runtime/map_faststr.go:275 >>> >> > +0x3bf fp=0xc00374b758 sp=0xc00374b6f0 pc=0x41527f >>> >> > github.eng.nutanix.com/xyz/abc/metadata.UpdateRecvInProgressFlag(0xc000aca990, >>> >> > 0x83, 0x0) >>> >> > >>> >> > ....... >>> >> > >>> >> > goroutine 665516 [chan receive, 2 minutes]: >>> >> > zeus.(*Leadership).LeaderValue.func1(0xc003d5c120, 0x0, 0xc002e906c0, >>> >> > 0x52, 0xc00302ec60, 0x29) >>> >> > /home/ll/ntnx/main/build/.go/src/zeus/leadership.go:244 +0x34 >>> >> > created by zeus.(*Leadership).LeaderValue >>> >> > /home/ll/ntnx/main/build/.go/src/zeus/leadership.go:243 +0x277 >>> >> > 2020-08-03 00:35:04 rolled over log file >>> >> > ERROR: logging before flag.Parse: I0803 00:35:04.426906 196123 >>> >> > dataset.go:26] initialize zfs linking >>> >> > ERROR: logging before flag.Parse: I0803 00:35:04.433296 196123 >>> >> > dataset.go:34] completed zfs linking successfully >>> >> > I0803 00:35:04.433447 196123 main.go:86] Gflags passed NodeUuid: >>> >> > c238e584-0eeb-48bd-b299-2a25b13602f1, External Ip: 10.15.96.163 >>> >> > I0803 00:35:04.433460 196123 main.go:99] Component name using for this >>> >> > process : abc-c238e584-0eeb-48bd-b299-2a25b13602f1 >>> >> > I0803 00:35:04.433467 196123 main.go:120] Trying to initialize DB >>> >> > >>> >> > If there is panic() from main P thread, as I understand we exit() and >>> >> > cleanup all P threads of the process. >>> >> > >>> >> > Are we hitting into the following scenario, I did not look into M-P-G >>> >> > implantation in detail. >>> >> > >>> >> > Example: >>> >> > >>> >> > #include <stdio.h> >>> >> > #include <pthread.h> >>> >> > #include <unistd.h> >>> >> > #include <stdlib.h> >>> >> > >>> >> > void *thread_function(void *args) >>> >> > { >>> >> > printf("The is new thread! Sleep 20 seconds...\n"); >>> >> > sleep(100); >>> >> > printf("Exit from thread\n"); >>> >> > pthread_exit(0); >>> >> > } >>> >> > >>> >> > int main(int argc, char **argv) >>> >> > { >>> >> > pthread_t thrd; >>> >> > pthread_attr_t attr; >>> >> > int res = 0; >>> >> > res = pthread_attr_init(&attr); >>> >> > res = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); >>> >> > res = pthread_create(&thrd, &attr, thread_function, NULL); >>> >> > res = pthread_attr_destroy(&attr); >>> >> > printf("Main thread. Sleep 5 seconds\n"); >>> >> > sleep(5); >>> >> > printf("Exit from main process\n"); >>> >> > pthread_exit(0); >>> >> > } >>> >> > >>> >> > kkk@ ~/mycode/go () $ ./a.out & >>> >> > [1] 108418Main thread. Sleep 5 secondsThe is new thread! Sleep 20 >>> >> > seconds... >>> >> > kkk@ ~/mycode/go () $ >>> >> > Exit from main processs >>> >> > PID TTY TIME CMD >>> >> > 49313 pts/26 00:00:01 bash108418 pts/26 00:00:00 [a.out] >>> >> > <defunct>108449 pts/26 00:00:00 ps >>> >> > >>> >> > See the main process is <defunct> and child is still hanging around >>> >> > >>> >> > kkk@ ~/mycode/go () $ sudo cat >>> >> > /proc/108418/task/108420/stack[<ffffffff810b4c1d>] >>> >> > hrtimer_nanosleep+0xbd/0x1d0[<ffffffff810b4dae>] >>> >> > SyS_nanosleep+0x7e/0x90[<ffffffff816a63c9>] >>> >> > system_call_fastpath+0x16/0x1b[<ffffffffffffffff>] >>> >> > 0xffffffffffffffffujonnala@ ~/mycode/go () $ Exit from thread >>> >> > >>> >> > Any help in this regard is appreciated. >>> >> >>> >> >>> >> I think you are misreading something somewhere. Zombie status is a >>> >> feature of a process, not a thread. It means that the child process >>> >> has exited but that the parent process, the one which started the >>> >> child process via the fork system call (or, on GNU/Linux, the clone >>> >> system call), has not called the wait (or waitpid or wait3 or wait4) >>> >> system call to collect its status. >>> >> >>> >> So don't look at threads or P's. Look at the parent process that >>> >> started the process that became a zombie. >>> >> >>> >> Ian >>> > >>> > -- >>> > You received this message because you are subscribed to the Google Groups >>> > "golang-nuts" group. >>> > To unsubscribe from this group and stop receiving emails from it, send an >>> > email to golang-nuts...@googlegroups.com. >>> > To view this discussion on the web visit >>> > https://groups.google.com/d/msgid/golang-nuts/f70e42f4-622d-4d91-b51d-ed00f2e11ac4n%40googlegroups.com. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "golang-nuts" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to golang-nuts+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/golang-nuts/f1c6abc0-13b2-41ca-a365-fe0fbc7f129an%40googlegroups.com. > > > > -- > Kurtis Rader > Caretaker of the exceptional canines Junior and Hank > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/golang-nuts/CABx2%3DD_Peg%2BMtJHGOwrqUKS%3D4JhPJgTS4WCMxocJWmX9J52VKg%40mail.gmail.com. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAOyqgcV9rb%2BKJD3wcpumjm0BSKnsLaRTMnbEPcFPvMd%3DOHbdHQ%40mail.gmail.com.