I brought up a similar issue some time back 
(http://www.opensolaris.org/jive/thread.jspa?threadID=20355) which pertained to 
ligHTTPd.  A fix on that issue has been integrated into snv_57, however, I'm 
seeing a similar situation with other applications include Mongrel and Apache 
and uncertain as to whether or not its the same issue.

The symptoms are always the same: some process stops responding or, less 
frequently, goes nuts.  The user typically attempts to restart the daemon and 
can't.  They then try to 'kill -9' the process but can't.  As a last ditch 
effort they reboot their zone which then ends up in a "hung state" (ie: 
perpetual  "shutting down").  The result for us is that we end up needing to 
reboot the entire system to clear just that one zone.

I'm still confused by the fact that any process can become so wedged that the 
only course of action is to reboot the entire system.  Even if a process isn't 
responding or answering signals why can't the scheduler just dump the process 
and its mapped memory?

In general this issue is really hurting Solaris's image.  I'm getting piles of 
FUD each time we hit one of these issues.

When this happened today I tried to dig up as much information as possible to 
demonstrate the issue:

[globalzone:/] root# zoneadm list -vc
  ID NAME             STATUS         PATH                          
   0 global           running        /                             
  64 zone123         shutting_down  /zones/zone123               
...

[globalzone:/] root# ps -efZ | grep zone123
  global     root 13404     1   0   Feb 22 ?           0:01 zoneadmd -z zone123
zone123     root 14220     1   0   Feb 22 ?           0:00 zsched
zone123   nobody 15389     1   0 19:46:26 ?           0:00 
/opt/csw/apache2/sbin/httpd -k start
zone123 0000113 15932     1   0 19:51:29 ?           0:00 ruby 
./script/backgroundrb start

[globalzone:/] root# truss -pf 15389
truss: unanticipated system error: 15389

[globalzone:/] root# truss -pf 13404
13404/4:        door_return(0x00000000, 0, 0x00000000, 0xFE65DE00, 1007360) 
(sleeping...)
13404/3:        door_return(0x00000000, 0, 0x00000000, 0xFE779E00, 1007360) 
(sleeping...)
13404/2:        door_unref()                    (sleeping...)
13404/1:        pollsys(0x08046BD0, 4, 0x00000000, 0x00000000) (sleeping...)

[globalzone:/] root# kill 15389
[globalzone:/] root# ps -efZ | grep 15389
zone123   nobody 15389     1   0 19:46:26 ?           0:00 
/opt/csw/apache2/sbin/httpd -k start

[globalzone:/] root# kill -9 15389
[globalzone:/] root# ps -efZ | grep 15389
zone123   nobody 15389     1   0 19:46:26 ?           0:00 
/opt/csw/apache2/sbin/httpd -k start

[globalzone:/] root# pstack 15389
pstack: cannot examine 15389: unanticipated system error

[globalzone:/] root# pflags 15389
15389:  /opt/csw/apache2/sbin/httpd -k start
        data model = _ILP32  flags = ORPHAN|MSACCT|MSFORK
        sigpend = 0x0040c101,0x00000000
 /1:    flags = 0

[globalzone:/] root# pfiles 15389
pfiles: unanticipated system error: 15389

[globalzone:/] root# pldd 15389
pldd: cannot examine 15389: unanticipated system error


[globalzone:/] root# kill -9 13404
[globalzone:/] root# zoneadm list -vc
  ID NAME             STATUS         PATH                          
   0 global           running        /                             
  64 zone123         shutting_down  /zones/zone123  
...

[globalzone:/] root# ps -efZ | grep zone123
zone123     root 14220     1   0   Feb 22 ?           0:00 zsched
zone123   nobody 15389     1   0 19:46:26 ?           0:00 
/opt/csw/apache2/sbin/httpd -k start
zone123 0000113 15932     1   0 19:51:29 ?           0:00 ruby 
./script/backgroundrb start

[globalzone:/] root# kill -9 14220
[globalzone:/] root# ps -efZ | grep zone123
zone123     root 14220     1   0   Feb 22 ?           0:00 zsched
zone123   nobody 15389     1   0 19:46:26 ?           0:00 
/opt/csw/apache2/sbin/httpd -k start
zone123 0000113 15932     1   0 19:51:29 ?           0:00 ruby 
./script/backgroundrb start

[globalzone:/] root# ps -efZ | grep zone123
zone123     root 14220     1   0   Feb 22 ?           0:00 zsched
zone123   nobody 15389     1   0 19:46:26 ?           0:00 
/opt/csw/apache2/sbin/httpd -k start
zone123 0000113 15932     1   0 19:51:29 ?           0:00 ruby 
./script/backgroundrb start


[globalzone:/] root# dtrace -n profile-1234hz'/pid == 15932/[EMAIL 
PROTECTED]()] = count()}' 
dtrace: description 'profile-1234hz' matched 1 probe
^C

[globalzone:/] root# dtrace -n profile-1234hz'/pid == 14220/[EMAIL 
PROTECTED]()] = count()}' 
dtrace: description 'profile-1234hz' matched 1 probe
^C

[globalzone:/] root# dtrace -n profile-1234hz'/pid == 15389/[EMAIL 
PROTECTED]()] = count()}' 
dtrace: description 'profile-1234hz' matched 1 probe
^C

[globalzone:/] root# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 uppc 
pcplusmp ufs ip sctp usba lofs zfs random crypto ptm fcip fctl md fcp logindmux 
nfs ipc ]
> ::zone
            ADDR     ID NAME                 PATH
fffffffffbcb9420      0 global               /
ffffff691158d1c0     64 zone123             /zones/zone123/root/
...
> ::ps -z 
S    PID   PPID   PGID    SID  ZONE    UID      FLAGS             ADDR NAME

R  15932      1  24851  24851    64    113 0x42030d00 ffffff692a3b6138 ruby
R  15389      1  11756  11756    64  60001 0x52020d00 fffffe81b21dc0c0 httpd

> ffffff692a3b6138::stack
> fffffe81b21dc0c0::stack
0xfffffe81015d2378()

> ffffff692a3b6138::thread
            ADDR    STATE  FLG PFLG SFLG   PRI  EPRI PIL             INTR
ffffff692a3b6138 run      ffbf ffff    0     0     0   0          8060938

> fffffe81b21dc0c0::thread 
            ADDR    STATE  FLG PFLG SFLG   PRI  EPRI PIL             INTR
fffffe81b21dc0c0 run      ff69 ffff    0     0     0   0          80accf2

> ::status
debugging live kernel (64-bit) on globalzone
operating system: 5.11 snv_43 (i86pc)

> ffffff692a3b6138::kill
mdb: command is not supported by current target
> fffffe81b21dc0c0::kill
mdb: command is not supported by current target

> ffffff692a3b6138::pte 
PTE=ffffff692a3b6138: noexec page=0xffff692a3b6 global uncached !VALID 
> fffffe81b21dc0c0::pte 
PTE=fffffe81b21dc0c0: noexec page=0xfffe81b21dc wrback !VALID 

> fffffe81b21dc0c0::whatis
fffffe81b21dc0c0 is fffffe81b21dc0c0+0, allocated from process_cache
> ffffff692a3b6138::whatis
ffffff692a3b6138 is ffffff692a3b6138+0, allocated from process_cache



Ideas are appreciated.

benr.
 
 
This message posted from opensolaris.org
_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Reply via email to