[ https://issues.apache.org/jira/browse/MESOS-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Mahler updated MESOS-873: ---------------------------------- Summary: Crash in os::killtree on Mavericks (was: Crash in os::killtree on Mavericks ) > Crash in os::killtree on Mavericks > ---------------------------------- > > Key: MESOS-873 > URL: https://issues.apache.org/jira/browse/MESOS-873 > Project: Mesos > Issue Type: Bug > Components: libprocess > Environment: Mac OS X Mavericks > Reporter: Niklas Quarfot Nielsen > Assignee: Benjamin Hindman > Fix For: 0.19.0 > > > This is a crash we experienced on a Mavericks installation. We haven't been > able to reproduce it on other machines since, but managed to capture core > files from the crashes. > Here is the stack trace from the crashing thread: > thread #2: tid = 0x0001, 0x0000000106816de5 mesos-executor`os::process(int) > + 4133, stop reason = signal SIGSTOP > frame #0: 0x0000000106816de5 mesos-executor`os::process(int) + 4133 > frame #1: 0x000000010681734c mesos-executor`os::processes() + 316 > frame #2: 0x0000000106817752 mesos-executor`os::killtree(int, int, bool, > bool) + 66 > frame #3: 0x0000000106819748 > mesos-executor`mesos::internal::CommandExecutorProcess::shutdown(mesos::ExecutorDriver*) > + 200 > frame #4: 0x000000010798be70 > frame #5: 0x000000010798be60 > frame #6: 0x0000000106b21c20 > libmesos-0.16.0.dylib`process::Event::~Event() + 32 > frame #7: 0x90c307894810c083 > The stop condition is wrong (all threads in the core file is reported as > stopped). > Here is a snippet of disassemble of the failing frame: > 0x106817306: je 0x106817460 ; os::processes() + 592 > 0x10681730c: movq 16(%rsp), %rax > 0x106817311: movq 296(%rsp), %rbx > 0x106817319: leaq 16(%rax), %r14 > 0x10681731d: leaq 128(%rsp), %rax > 0x106817325: addq $8, %r14 > 0x106817329: movq %rax, 24(%rsp) > 0x10681732e: leaq 384(%rsp), %rbp > 0x106817336: cmpq %rbx, %r14 > 0x106817339: je 0x106817530 ; os::processes() + 800 > 0x10681733f: movl 32(%rbx), %esi > 0x106817342: movq 24(%rsp), %rdi > 0x106817347: callq 0x10681d5a0 ; symbol stub for: > os::process(int) > -> 0x10681734c: movl 128(%rsp), %esi > 0x106817353: testl %esi, %esi > 0x106817355: jne 0x1068173e0 ; os::processes() + 464 > 0x10681735b: movq 136(%rsp), %rsi > 0x106817363: movq %rbp, %rdi > 0x106817366: callq 0x10681d58e ; symbol stub for: > os::Process::Process(os::Process const&) > 0x10681736b: movl $112, %edi > 0x106817370: callq 0x10681d9e4 ; symbol stub for: operator > new(unsigned long) > We got to (while investigation the crash live in lldb) that using sysctl to > get argument count probably was the reason for the crash, but still with no > ways to validate this. > We can dig further into the core dump, if you know any suspected reasons for > the failure / where to look further. > Also, since we haven't been able to reproduce the crash. If we don't hear of > any others with the same problem, we can probably mark this as won't fix. -- This message was sent by Atlassian JIRA (v6.2#6252)