On Wed, 5 Mar 2025 18:01:01 -0500, Phil Smith III <[email protected]> wrote:
>Rupert Reynolds wrote about taking down a system by compressing a PDS. What
>stories can y'all share about times you or someone you worked with took down a
>system in a way that made you SMH afterward?
Stupid outage I *fixed*, twice:
In the early days of VMLINK, it had a bug that would send it into a tight loop
if nicknames contained a circular reference in their :list tags. Our
OfficeVision support team managed to put out an untested update with something
like:
:nick.OFFICE
:list.OV
:nick.OV
:userid.SYSADMIN
:addr.399
:list.OFFICE
As soon as people started logging on, every single system froze up with 100%
CPU usage. I owned the server that managed the disk they put it on, and I was
the one to think of a way to corrupt the file by overwriting it in place, so I
got to spend the whole day waiting for logons to process and wrecking the file
to free up each system, then restoring the old version.
I added the NAMES files with a special fixed weekly schedule to our automated
test process that staged updates on a test disk for a week before putting them
in production. A few months later, they *removed* a nickname from a file,
exposing the same problem that it had masked in *another* file. Sadly, the
test didn't cover that case, because:
* VMLINK uses *all* matching NAMES files when filemode * is specified
in the CONTROL file, not just the first in the search, and
* the test disk was accessed ahead of the production disk, not in place of it.
I think there was some unrelated update to the file that became urgent and I
promoted it for them manually in the afternoon. IIRC the bad file was actually
on the Y-disk. Luckily, somebody recognized the symptom as soon as one system
froze up *and* could get me onto an ID with write access to the Y-disk, and I
was able to jump in and corrupt them all within a few minutes and not spend all
night at work.
After that, I set up a convoluted process using an altered VMLINK CONTROL and
renamed NAMES files in test, which has been the bane of my successors for the
past two decades. I keep offering to help them undo it, since there are hardly
any updates to worry about anymore and that bug is long gone, but they're
scared to touch it.
¬R
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN