>>> On Tue, Oct 30, 2007 at  7:15 PM, in message
<[EMAIL PROTECTED]>,
<[EMAIL PROTECTED]> wrote: 
> Thanks Robert But we already went down this route as I explained we know 
> it was a hardware hit to a brocade device. That really wasnt my question 
> we were told within a few minutes the chip'ds came back VM stayed up. But 
> zLinux crashed on the VM system. Those of course involved with the chip'd 
> taking the errors. Im just trying to find out is this normal, there are no 

As just about every response on mailing lists start with, it depends...  On 
what OS was using a particular device for.  If it had been one of z/VM's paging 
packs involved, things could have gotten ugly.  (Not necessarily terminal, but 
certainly a little scary.)  If it was one of Linux's application data volumes, 
more than likely Linux would have stayed up while the application died.  
There's no hard and fast rule here.

> time out values for zLinux to wait before going down hard or if it cant 
> get to Root lets say once it dies? Sounds that way but wanting to see if 
> there is anything we can do to prevent this other then the obvious make 
> sure we dont take hardware hits ;)

The Linux DASD device drivers have a fair amount of their own error recovery 
code in them.  Not as good as z/VM's, probably (I'm in no position to judge 
that), but Linux doesn't just fall over with the first I/O error, either, since 
it also runs in an LPAR, and can't count on z/VM doing error recovery for it.

It's usually going to be something fairly serious that causes a Linux system to 
crash.  For example, I've been following an internal mailing list thread that 
was talking about a customer's midrange SLES system having the root file system 
get re-mounted as read-only.  Various people confirmed that if Linux 
experiences "non temporary" errors writing to a file system, even "/", it will 
re-mount the file system as read-only in an effort to prevent any (further) 
data corruption on that file system.  If Linux is no longer able to even _read_ 
things from a file system that it needs to keep running, then yeah, your system 
is likely to throw a kernel panic and die.

In your case, it sounds like you had something important to Linux go away for a 
long while (a "few minutes" is an eternity when you're talking about computers 
and I/O), and z/VM wasn't depending on any of those devices for its own 
continued functioning.  Just be glad it wasn't the other way around.  :)

The things you'll want to look at are redundancy everywhere.  In your paths to 
the switches (plural!), from the switches to the storage arrays (plural!), and 
so on.  If an application is important enough, then you need to be looking at 
High Availability clustering techniques, and so on.  With mainframe hardware, 
simply eliminating single points of failure gets you most of the way there.


Mark Post

Reply via email to