I brought up a similar issue some time back (http://www.opensolaris.org/jive/thread.jspa?threadID=20355) which pertained to ligHTTPd. A fix on that issue has been integrated into snv_57, however, I'm seeing a similar situation with other applications include Mongrel and Apache and uncertain as to whether or not its the same issue.
The symptoms are always the same: some process stops responding or, less frequently, goes nuts. The user typically attempts to restart the daemon and can't. They then try to 'kill -9' the process but can't. As a last ditch effort they reboot their zone which then ends up in a "hung state" (ie: perpetual "shutting down"). The result for us is that we end up needing to reboot the entire system to clear just that one zone. I'm still confused by the fact that any process can become so wedged that the only course of action is to reboot the entire system. Even if a process isn't responding or answering signals why can't the scheduler just dump the process and its mapped memory? In general this issue is really hurting Solaris's image. I'm getting piles of FUD each time we hit one of these issues. When this happened today I tried to dig up as much information as possible to demonstrate the issue: [globalzone:/] root# zoneadm list -vc ID NAME STATUS PATH 0 global running / 64 zone123 shutting_down /zones/zone123 ... [globalzone:/] root# ps -efZ | grep zone123 global root 13404 1 0 Feb 22 ? 0:01 zoneadmd -z zone123 zone123 root 14220 1 0 Feb 22 ? 0:00 zsched zone123 nobody 15389 1 0 19:46:26 ? 0:00 /opt/csw/apache2/sbin/httpd -k start zone123 0000113 15932 1 0 19:51:29 ? 0:00 ruby ./script/backgroundrb start [globalzone:/] root# truss -pf 15389 truss: unanticipated system error: 15389 [globalzone:/] root# truss -pf 13404 13404/4: door_return(0x00000000, 0, 0x00000000, 0xFE65DE00, 1007360) (sleeping...) 13404/3: door_return(0x00000000, 0, 0x00000000, 0xFE779E00, 1007360) (sleeping...) 13404/2: door_unref() (sleeping...) 13404/1: pollsys(0x08046BD0, 4, 0x00000000, 0x00000000) (sleeping...) [globalzone:/] root# kill 15389 [globalzone:/] root# ps -efZ | grep 15389 zone123 nobody 15389 1 0 19:46:26 ? 0:00 /opt/csw/apache2/sbin/httpd -k start [globalzone:/] root# kill -9 15389 [globalzone:/] root# ps -efZ | grep 15389 zone123 nobody 15389 1 0 19:46:26 ? 0:00 /opt/csw/apache2/sbin/httpd -k start [globalzone:/] root# pstack 15389 pstack: cannot examine 15389: unanticipated system error [globalzone:/] root# pflags 15389 15389: /opt/csw/apache2/sbin/httpd -k start data model = _ILP32 flags = ORPHAN|MSACCT|MSFORK sigpend = 0x0040c101,0x00000000 /1: flags = 0 [globalzone:/] root# pfiles 15389 pfiles: unanticipated system error: 15389 [globalzone:/] root# pldd 15389 pldd: cannot examine 15389: unanticipated system error [globalzone:/] root# kill -9 13404 [globalzone:/] root# zoneadm list -vc ID NAME STATUS PATH 0 global running / 64 zone123 shutting_down /zones/zone123 ... [globalzone:/] root# ps -efZ | grep zone123 zone123 root 14220 1 0 Feb 22 ? 0:00 zsched zone123 nobody 15389 1 0 19:46:26 ? 0:00 /opt/csw/apache2/sbin/httpd -k start zone123 0000113 15932 1 0 19:51:29 ? 0:00 ruby ./script/backgroundrb start [globalzone:/] root# kill -9 14220 [globalzone:/] root# ps -efZ | grep zone123 zone123 root 14220 1 0 Feb 22 ? 0:00 zsched zone123 nobody 15389 1 0 19:46:26 ? 0:00 /opt/csw/apache2/sbin/httpd -k start zone123 0000113 15932 1 0 19:51:29 ? 0:00 ruby ./script/backgroundrb start [globalzone:/] root# ps -efZ | grep zone123 zone123 root 14220 1 0 Feb 22 ? 0:00 zsched zone123 nobody 15389 1 0 19:46:26 ? 0:00 /opt/csw/apache2/sbin/httpd -k start zone123 0000113 15932 1 0 19:51:29 ? 0:00 ruby ./script/backgroundrb start [globalzone:/] root# dtrace -n profile-1234hz'/pid == 15932/[EMAIL PROTECTED]()] = count()}' dtrace: description 'profile-1234hz' matched 1 probe ^C [globalzone:/] root# dtrace -n profile-1234hz'/pid == 14220/[EMAIL PROTECTED]()] = count()}' dtrace: description 'profile-1234hz' matched 1 probe ^C [globalzone:/] root# dtrace -n profile-1234hz'/pid == 15389/[EMAIL PROTECTED]()] = count()}' dtrace: description 'profile-1234hz' matched 1 probe ^C [globalzone:/] root# mdb -k Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 uppc pcplusmp ufs ip sctp usba lofs zfs random crypto ptm fcip fctl md fcp logindmux nfs ipc ] > ::zone ADDR ID NAME PATH fffffffffbcb9420 0 global / ffffff691158d1c0 64 zone123 /zones/zone123/root/ ... > ::ps -z S PID PPID PGID SID ZONE UID FLAGS ADDR NAME R 15932 1 24851 24851 64 113 0x42030d00 ffffff692a3b6138 ruby R 15389 1 11756 11756 64 60001 0x52020d00 fffffe81b21dc0c0 httpd > ffffff692a3b6138::stack > fffffe81b21dc0c0::stack 0xfffffe81015d2378() > ffffff692a3b6138::thread ADDR STATE FLG PFLG SFLG PRI EPRI PIL INTR ffffff692a3b6138 run ffbf ffff 0 0 0 0 8060938 > fffffe81b21dc0c0::thread ADDR STATE FLG PFLG SFLG PRI EPRI PIL INTR fffffe81b21dc0c0 run ff69 ffff 0 0 0 0 80accf2 > ::status debugging live kernel (64-bit) on globalzone operating system: 5.11 snv_43 (i86pc) > ffffff692a3b6138::kill mdb: command is not supported by current target > fffffe81b21dc0c0::kill mdb: command is not supported by current target > ffffff692a3b6138::pte PTE=ffffff692a3b6138: noexec page=0xffff692a3b6 global uncached !VALID > fffffe81b21dc0c0::pte PTE=fffffe81b21dc0c0: noexec page=0xfffe81b21dc wrback !VALID > fffffe81b21dc0c0::whatis fffffe81b21dc0c0 is fffffe81b21dc0c0+0, allocated from process_cache > ffffff692a3b6138::whatis ffffff692a3b6138 is ffffff692a3b6138+0, allocated from process_cache Ideas are appreciated. benr. This message posted from opensolaris.org _______________________________________________ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org