Do you have the panic message or crash dump? -Steve L.
On Wed, Dec 23, 2009 at 09:26:17AM -0500, Glenn Brunette wrote: > > Frank, > > Just verified that something is still wrong in b129, but the problem is > _not_ with a vanilla configuration. This time around boot/halt #102, > the system apparently shutdown/panic'ed? I was running it overnight > and came in to a system that had been rebooted. I did not see any > problem in the audit log nor in /var/adm/messages. Any pointers? > > I am running an Immutable Service Container configuration, based upon > the installation steps at: > > http://kenai.com/projects/isc/pages/OpenSolaris > > Specifically: > > pfexec pkg install SUNWmercurial > hg clone https://kenai.com/hg/isc~source isc > pfexec isc/bin/iscadm.ksh -N 0 > pfexec bootadm update-archive > pfexec shutdown -g 0 -i 0 -y > [after reboot] > zlogin -C isc1 > [wait for zone isc1 to fully complete boot process] > > then run the script that I provided that stops and starts the zone. > > Apparently, there must be something wrong with the interaction of > components. In this configuration, we have things like resource > controls, auditing, IP Filter/IP NAT, and zones all enabled. > > Would it be possible for you to try the steps above on a fresh > install of 2009.06 or later (b129 is where I am right now). Also, > if you have other debugging methods, please let me know. > > I am going to kick this off again to see if I can catch any > error messages. > > g > > > On 12/16/09 3:49 AM, Frank Batschulat (Home) wrote: >> Glenn, I've not been able to reproduce this on onnv build 126 (it's running >> for a day now) >> >> if that script would reproduce 6894901 straight away it should be doing so >> on 126 as well (similar to what you've seen in 127) >> >> this pose the question if there are either some other details in your >> environment that I don't have or if that script really reliably reproduces >> 6894901 >> >> cheers >> frankB >> >> On Tue, 15 Dec 2009 15:23:06 +0100, Frank Batschulat >> (Home)<frank.batschu...@sun.com> wrote: >> >>> Glenn, I've been running this test case now for nearly a day on build 129, >>> could'nt >>> reproduce at all. good chance this being indeed fixed by 6894901 in build >>> 128. >>> >>> I'll also try to reproduce this now on buil 126. >>> >>> cheers >>> frankB >>> >>> On Fri, 11 Dec 2009 21:48:52 +0100, Glenn Brunette<glenn.brune...@sun.com> >>> wrote: >>>> >>>> As part of some Immutable Service Container[1] demonstration that I am >>>> creating for an event in January. I have the need to start/stop a zone >>>> quite a few times (as part of a Self-Cleansing[2] demo). During the >>>> course of my testing, I have been able to repeatedly get zoneadm to >>>> hang. >>>> >>>> Since I am working with a highly customized configuration, I started >>>> over with a default zone on OpenSolaris (b127) and was able to repeat >>>> this issue. To reproduce this problem use the following script after >>>> creating a zone usual the normal/default steps: >>>> >>>> isc...@osol-isc:~$ while : ; do >>>> > echo "`date`: ZONE BOOT" >>>> > pfexec zoneadm -z test boot >>>> > sleep 30 >>>> > pfexec zoneamd -z test halt >>>> > echo "`date`: ZONE HALT" >>>> > sleep 10 >>>> > done >>>> >>>> This script works just fine for a while, but eventually zoneadm hangs >>>> (was at pass #90 in my last test). When this happens, zoneadm is shown >>>> to be consuming quite a bit of CPU: >>>> >>>> PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP >>>> >>>> 16598 root 11M 3140K run 1 0 0:54:49 74% zoneadm/1 >>>> >>>> >>>> A stack trace of zoneadm shows: >>>> >>>> isc...@osol-isc:~$ pfexec pstack `pgrep zoneadm` >>>> 16082: zoneadmd -z test >>>> ----------------- lwp# 1 -------------------------------- >>>> ----------------- lwp# 2 -------------------------------- >>>> feef41c6 door (0, 0, 0, 0, 0, 8) >>>> feed99f7 door_unref_func (3ed2, fef81000, fe33efe8, feeee39e) + 67 >>>> feeee3f3 _thrp_setup (fe5b0a00) + 9b >>>> feeee680 _lwp_start (fe5b0a00, 0, 0, 0, 0, 0) >>>> ----------------- lwp# 3 -------------------------------- >>>> feef420f __door_return () + 2f >>>> ----------------- lwp# 4 -------------------------------- >>>> feef420f door (0, 0, 0, fe140e00, f5f00, a) >>>> feed9f57 door_create_func (0, fef81000, fe140fe8, feeee39e) + 2f >>>> feeee3f3 _thrp_setup (fe5b1a00) + 9b >>>> feeee680 _lwp_start (fe5b1a00, 0, 0, 0, 0, 0) >>>> 16598: zoneadm -z test boot >>>> feef3fc8 door (6, 80476d0, 0, 0, 0, 3) >>>> feede653 door_call (6, 80476d0, 400, fe3d43f7) + 7b >>>> fe3d44f0 zonecfg_call_zoneadmd (8047e33, 8047730, 8078448, 1) + 124 >>>> 0805792d boot_func (0, 8047d74, 100, 805ff0b) + 1cd >>>> 08060125 main (4, 8047d64, 8047d78, 805570f) + 2b9 >>>> 0805576d _start (4, 8047e28, 8047e30, 8047e33, 8047e38, 0) + 7d >>>> >>>> >>>> A stack trace of zoneadmd shows: >>>> >>>> isc...@osol-isc:~$ pfexec pstack `pgrep zoneadmd` >>>> 16082: zoneadmd -z test >>>> ----------------- lwp# 1 -------------------------------- >>>> ----------------- lwp# 2 -------------------------------- >>>> feef41c6 door (0, 0, 0, 0, 0, 8) >>>> feed99f7 door_unref_func (3ed2, fef81000, fe33efe8, feeee39e) + 67 >>>> feeee3f3 _thrp_setup (fe5b0a00) + 9b >>>> feeee680 _lwp_start (fe5b0a00, 0, 0, 0, 0, 0) >>>> ----------------- lwp# 3 -------------------------------- >>>> feef4147 __door_ucred (80a37c8, fef81000, fe23e838, feed9cfe) + 27 >>>> feed9d0d door_ucred (fe23f870, 1000, 0, 0) + 32 >>>> 08058a88 server (0, fe23f8f0, 510, 0, 0, 8058a04) + 84 >>>> feef4240 __door_return () + 60 >>>> ----------------- lwp# 4 -------------------------------- >>>> feef420f door (0, 0, 0, fe140e00, f5f00, a) >>>> feed9f57 door_create_func (0, fef81000, fe140fe8, feeee39e) + 2f >>>> feeee3f3 _thrp_setup (fe5b1a00) + 9b >>>> feeee680 _lwp_start (fe5b1a00, 0, 0, 0, 0, 0) >>>> >>>> >>>> A truss of zoneadm (-f -vall -wall -tall) shows this looping: >>>> >>>> 16598: door_call(6, 0x080476D0) = 0 >>>> 16598: data_ptr=8047730 data_size=0 >>>> 16598: desc_ptr=0x0 desc_num=0 >>>> 16598: rbuf=0x807F2D8 rsize=4096 >>>> 16598: close(6) = 0 >>>> 16598: mkdir("/var/run/zones", 0700) Err#17 EEXIST >>>> 16598: chmod("/var/run/zones", 0700) = 0 >>>> 16598: open("/var/run/zones/test.zoneadm.lock", O_RDWR|O_CREAT, 0600) = 6 >>>> 16598: fcntl(6, F_SETLKW, 0x08046DC0) = 0 >>>> 16598: typ=F_WRLCK whence=SEEK_SET start=0 len=0 >>>> sys=4277003009 pid=6 >>>> 16598: open("/var/run/zones/test.zoneadmd_door", O_RDONLY) = 7 >>>> 16598: door_info(7, 0x08047230) = 0 >>>> 16598: target=16082 proc=0x8058A04 data=0x0 >>>> 16598: attributes=DOOR_UNREF|DOOR_REFUSE_DESC|DOOR_NO_CANCEL >>>> 16598: uniquifier=26426 >>>> 16598: close(7) = 0 >>>> 16598: close(6) = 0 >>>> 16598: open("/var/run/zones/test.zoneadmd_door", O_RDONLY) = 6 >>>> 16082/3: door_return(0x00000000, 0, 0x00000000, 0xFE23FE00, >>>> 1007360) = 0 >>>> 16082/3: door_ucred(0x080A37C8) = 0 >>>> 16082/3: euid=0 egid=0 >>>> 16082/3: ruid=0 rgid=0 >>>> 16082/3: pid=16598 zoneid=0 >>>> 16082/3: E: all >>>> 16082/3: I: basic >>>> 16082/3: P: all >>>> 16082/3: L: all >>>> >>>> >>>> PID 16598 is zoneadm and PID 16082 is zoneadmd. >>>> >>>> >>>> Is this a known issue? Are there any other things that I can do to >>>> help debug this situation? Once things get into this state, I have >>>> only been able to recover by rebooting the zone. >>>> >>>> >>>> >>>> Please advise. >>>> >>>> g >>>> >>>> >>>> [1] http://kenai.com/projects/isc/pages/OpenSolaris >>>> [2] >>>> http://kenai.com/attachments/wiki_images/isc/isc-autonomic-cleansing-time-v1.3.png >>>> _______________________________________________ >>>> zones-discuss mailing list >>>> zones-discuss@opensolaris.org >>>> >>> >>> >>> >> >> >> > _______________________________________________ > zones-discuss mailing list > zones-discuss@opensolaris.org _______________________________________________ zones-discuss mailing list zones-discuss@opensolaris.org