zVM crash update

Burton, Randy Wed, 15 Jun 2011 06:53:24 -0700

The LPAR had been up and running for weeks, minding its own business,
and chugging right along.  Then, 3:30 PM Monday, kaboom, disabled wait.



IBM has determined we had a tight loop condition that triggered the
processor being taken offline.  We did get a MCW002 abend and dump.
Dump analysis should lead IBM and us to fixing the loop and thus
stopping it from happening again.  

Best theory so far is that zVM tried to restart following the MCW002
abend, couldn't find a console, thus the 1010 disabled wait.

Thanks for all the suggestions!



-----Original Message-----
From: Burton, Randy 
Sent: Tuesday, June 14, 2011 9:47 AM
To: 'IBMVM@LISTSERV.UARK.EDU'
Subject: zVM crash

I'm curious if this error rings a bell with any of you.  We of course
have an ETR open and are working with IBM.  No hardware errors on the
HMC, so we believe this was software and not hardware.  Here's the last
operator log message before the LPAR went into a disabled wait:

HCPMPG9152E PROCESSOR 01 IS BEING VARIED OFFLINE BECAUSE IT IS NOT
RESPONSIVE. 

Disabled wait PSW was:
00020000000000000000000000001010

HMC message was:
Central processor (CP) 0 in partition VMD1, entered disabled wait state.

Fortunately this was our development (test) zVM system, running a bunch
of test zLinux guests.  We're running zVM 6.1 on a z10.  Of course we
are nervous because what happens in test can happen in production.  We
IPLed and so far so good.

Thanks in advance for any help/suggestions!

Randy Burton
BB&T Bank

zVM crash update

Reply via email to