As part of some Immutable Service Container[1] demonstration that I am
creating for an event in January. I have the need to start/stop a zone
quite a few times (as part of a Self-Cleansing[2] demo). During the
course of my testing, I have been able to repeatedly get zoneadm to
hang.
Since I am working with a highly customized configuration, I started
over with a default zone on OpenSolaris (b127) and was able to repeat
this issue. To reproduce this problem use the following script after
creating a zone usual the normal/default steps:
isc...@osol-isc:~$ while : ; do
> echo "`date`: ZONE BOOT"
> pfexec zoneadm -z test boot
> sleep 30
> pfexec zoneamd -z test halt
> echo "`date`: ZONE HALT"
> sleep 10
> done
This script works just fine for a while, but eventually zoneadm hangs
(was at pass #90 in my last test). When this happens, zoneadm is shown
to be consuming quite a bit of CPU:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
16598 root 11M 3140K run 1 0 0:54:49 74% zoneadm/1
A stack trace of zoneadm shows:
isc...@osol-isc:~$ pfexec pstack `pgrep zoneadm`
16082: zoneadmd -z test
----------------- lwp# 1 --------------------------------
----------------- lwp# 2 --------------------------------
feef41c6 door (0, 0, 0, 0, 0, 8)
feed99f7 door_unref_func (3ed2, fef81000, fe33efe8, feeee39e) + 67
feeee3f3 _thrp_setup (fe5b0a00) + 9b
feeee680 _lwp_start (fe5b0a00, 0, 0, 0, 0, 0)
----------------- lwp# 3 --------------------------------
feef420f __door_return () + 2f
----------------- lwp# 4 --------------------------------
feef420f door (0, 0, 0, fe140e00, f5f00, a)
feed9f57 door_create_func (0, fef81000, fe140fe8, feeee39e) + 2f
feeee3f3 _thrp_setup (fe5b1a00) + 9b
feeee680 _lwp_start (fe5b1a00, 0, 0, 0, 0, 0)
16598: zoneadm -z test boot
feef3fc8 door (6, 80476d0, 0, 0, 0, 3)
feede653 door_call (6, 80476d0, 400, fe3d43f7) + 7b
fe3d44f0 zonecfg_call_zoneadmd (8047e33, 8047730, 8078448, 1) + 124
0805792d boot_func (0, 8047d74, 100, 805ff0b) + 1cd
08060125 main (4, 8047d64, 8047d78, 805570f) + 2b9
0805576d _start (4, 8047e28, 8047e30, 8047e33, 8047e38, 0) + 7d
A stack trace of zoneadmd shows:
isc...@osol-isc:~$ pfexec pstack `pgrep zoneadmd`
16082: zoneadmd -z test
----------------- lwp# 1 --------------------------------
----------------- lwp# 2 --------------------------------
feef41c6 door (0, 0, 0, 0, 0, 8)
feed99f7 door_unref_func (3ed2, fef81000, fe33efe8, feeee39e) + 67
feeee3f3 _thrp_setup (fe5b0a00) + 9b
feeee680 _lwp_start (fe5b0a00, 0, 0, 0, 0, 0)
----------------- lwp# 3 --------------------------------
feef4147 __door_ucred (80a37c8, fef81000, fe23e838, feed9cfe) + 27
feed9d0d door_ucred (fe23f870, 1000, 0, 0) + 32
08058a88 server (0, fe23f8f0, 510, 0, 0, 8058a04) + 84
feef4240 __door_return () + 60
----------------- lwp# 4 --------------------------------
feef420f door (0, 0, 0, fe140e00, f5f00, a)
feed9f57 door_create_func (0, fef81000, fe140fe8, feeee39e) + 2f
feeee3f3 _thrp_setup (fe5b1a00) + 9b
feeee680 _lwp_start (fe5b1a00, 0, 0, 0, 0, 0)
A truss of zoneadm (-f -vall -wall -tall) shows this looping:
16598: door_call(6, 0x080476D0) = 0
16598: data_ptr=8047730 data_size=0
16598: desc_ptr=0x0 desc_num=0
16598: rbuf=0x807F2D8 rsize=4096
16598: close(6) = 0
16598: mkdir("/var/run/zones", 0700) Err#17 EEXIST
16598: chmod("/var/run/zones", 0700) = 0
16598: open("/var/run/zones/test.zoneadm.lock", O_RDWR|O_CREAT, 0600) = 6
16598: fcntl(6, F_SETLKW, 0x08046DC0) = 0
16598: typ=F_WRLCK whence=SEEK_SET start=0 len=0
sys=4277003009 pid=6
16598: open("/var/run/zones/test.zoneadmd_door", O_RDONLY) = 7
16598: door_info(7, 0x08047230) = 0
16598: target=16082 proc=0x8058A04 data=0x0
16598: attributes=DOOR_UNREF|DOOR_REFUSE_DESC|DOOR_NO_CANCEL
16598: uniquifier=26426
16598: close(7) = 0
16598: close(6) = 0
16598: open("/var/run/zones/test.zoneadmd_door", O_RDONLY) = 6
16082/3: door_return(0x00000000, 0, 0x00000000, 0xFE23FE00,
1007360) = 0
16082/3: door_ucred(0x080A37C8) = 0
16082/3: euid=0 egid=0
16082/3: ruid=0 rgid=0
16082/3: pid=16598 zoneid=0
16082/3: E: all
16082/3: I: basic
16082/3: P: all
16082/3: L: all
PID 16598 is zoneadm and PID 16082 is zoneadmd.
Is this a known issue? Are there any other things that I can do to
help debug this situation? Once things get into this state, I have
only been able to recover by rebooting the zone.
Please advise.
g
[1] http://kenai.com/projects/isc/pages/OpenSolaris
[2]
http://kenai.com/attachments/wiki_images/isc/isc-autonomic-cleansing-time-v1.3.png
_______________________________________________
zones-discuss mailing list
zones-discuss@opensolaris.org