Re: TSM DB Corruption and DB Recovery - what happened ? (a saga , not a short story)

2002-04-19 Thread Prather, Wanda

I'm curious if Tivoli support gave you any idea of what TYPE of corruption
this error message indicates, and what they said about the DB mirror.  With
MIRRORWRITEDB=SEQUENTIAL, you could (at least in theory) have had corruption
in the primary copy, but not the mirror copy.  (Or maybe TSM doesn't issue
the message until it has tried both copies already?)

I would have tried to remove the primary copy of the DB from the
configuration, and come up with just the mirror, just to see if the result
was any different...



-Original Message-
From: Kent Monthei [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 11:21 AM
To: [EMAIL PROTECTED]
Subject: TSM DB Corruption and DB Recovery - what happened ? (a saga,
not a short story)


Our TSM DB was corrupted last week.  Worse yet, it appears* that an
earlier DB Backup operation backed up a corrupted DB, but reported
"completed successfully".  Efforts to restore from that DB Backup failed
twice, in the end costing us 24 hours restore time.  We ended up having to
restore (rollback to) the prior-day DB Backup.  Between the 1-Day rollback
and 24 hours lost time performing 2 failed DB Restores, we lost 2 night's
backups for 83 Unix Servers, including 1 night of otherwise-successful
backup processing.

(please review the "Summary of Events" appended to this email before
reading on)

We run TSM Server 4.1.2.0 on Solaris 2.6.  Library is IBM 3494.  DB >50GB
& under 90% Util.  TSM DB & Log volumes are TSM-mirrored.  TSM Log =4GB &
rarely over 10% Util.  TSM LogMode=Normal.  MirrorWrite-Log=Parallel.
MirrorWrite-DB=sequential.  We do not perform automatic expiration or tape
reclamation, and no tape reclamation was performed on this or the prior
day.

We normally don't run client backups during our TSM DB Backups - but
that's just local practice & is a consequence of the sequence of our Daily
Task processing, not a policy.  I know (and Tivoli Support confirmed) that
TSM is designed to support concurrent client backup session and DB Backup
processes (how else could you run 24x7 backups?), and previously we have
allowed certain long-running backups to run concurrently with DB Backups
before, without consequence.  Still, for us, a concurrently-running client
backup and DB Backup is an
atypical event.

I would like experienced feedback on where things went awry in the
following series of events (appended below), opinions as to what step may
have caused the DB corruption and subsequent apparently-corrupted DB
Backup (despite reporting successful completion).

* Tivoli Support asserted that:
 - in TSM 4.1 the DB Backup operation does not perform robust consistency
checking, so could backup a corrupt DB and still report success -
apparently an APAR exists on this.  Can anyone confirm this?
 - consistency checking Tivoli has been improved & is more robust in TSM
4.1 and/or 5.1.  Can anyone confirm this?
 - it is impossible to incorporate thorough consistency-checking, as is
performed by Audit DB, in presumed-daily DB Backups because of the elapsed
time it requires.  On a >50GB DB such as ours, Tivoli asserted that Audit
DB would take over 50 hours (obviously not possible twice daily).   Can
anyone with similar configuration (>50GB TSM DB on Solaris) confirm this
time estimate for Audit DB?

I would also welcome any opinions/ideas how to recover that
apparently-corrupted DB Backup (I'm still not 100% convinced it is).  We
want to do this on a test server to salvage the prior nights backup data
and also salvage the Activity Log for further review, if possible.  All
copypool and DB Backup tapes from that day were pulled/preserved.  I don't
think there's a way for us to re-incorporate the lost data back into our
production TSM backups easily - but brilliant ideas are welcome.  Even so,
we would at least regain the ability to access/restore the prior night's
backup data from those preserved copypool tapes, which would mitigate
potential service impact of this incident by 50% (just 1 day lost, not 2).

Are there any methods for restoring only selected parts of a TSM DB ?

- rsvp, thanks (experienced respondents only, please)

Kent Monthei
GlaxoSmithKline
_

Summary of Events:

1)  A rogue client backup started just before our morning DB Backup (1st
of 2 scheduled daily full DB Backups).  This was after a scripted check to
ensure that no client backups were running (none were) and just prior to
start of the DB Backup (we think).  The check for client sessions is not so
much to ensure nothing runs during DB Backup - it's to ensure that all
client data reaches the diskpool and then gets migrated or copied to tape
pools before the DB Backup starts.  Still, for us, the rogue client backup
was an atypical event.

2)  The DB Backup stalled immediately, sitting at 0 pages backed up for
over an hour, but TSM Services were no

TSM DB Corruption and DB Recovery - what happened ? (a saga, not a short story)

2002-04-19 Thread Kent Monthei

Our TSM DB was corrupted last week.  Worse yet, it appears* that an
earlier DB Backup operation backed up a corrupted DB, but reported
"completed successfully".  Efforts to restore from that DB Backup failed
twice, in the end costing us 24 hours restore time.  We ended up having to
restore (rollback to) the prior-day DB Backup.  Between the 1-Day rollback
and 24 hours lost time performing 2 failed DB Restores, we lost 2 night's
backups for 83 Unix Servers, including 1 night of otherwise-successful
backup processing.

(please review the "Summary of Events" appended to this email before
reading on)

We run TSM Server 4.1.2.0 on Solaris 2.6.  Library is IBM 3494.  DB >50GB
& under 90% Util.  TSM DB & Log volumes are TSM-mirrored.  TSM Log =4GB &
rarely over 10% Util.  TSM LogMode=Normal.  MirrorWrite-Log=Parallel.
MirrorWrite-DB=sequential.  We do not perform automatic expiration or tape
reclamation, and no tape reclamation was performed on this or the prior
day.

We normally don't run client backups during our TSM DB Backups - but
that's just local practice & is a consequence of the sequence of our Daily
Task processing, not a policy.  I know (and Tivoli Support confirmed) that
TSM is designed to support concurrent client backup session and DB Backup
processes (how else could you run 24x7 backups?), and previously we have
allowed certain long-running backups to run concurrently with DB Backups
before, without consequence.  Still, for us, a concurrently-running client backup and 
DB Backup is an
atypical event.

I would like experienced feedback on where things went awry in the
following series of events (appended below), opinions as to what step may
have caused the DB corruption and subsequent apparently-corrupted DB
Backup (despite reporting successful completion).

* Tivoli Support asserted that:
 - in TSM 4.1 the DB Backup operation does not perform robust consistency
checking, so could backup a corrupt DB and still report success -
apparently an APAR exists on this.  Can anyone confirm this?
 - consistency checking Tivoli has been improved & is more robust in TSM
4.1 and/or 5.1.  Can anyone confirm this?
 - it is impossible to incorporate thorough consistency-checking, as is
performed by Audit DB, in presumed-daily DB Backups because of the elapsed
time it requires.  On a >50GB DB such as ours, Tivoli asserted that Audit
DB would take over 50 hours (obviously not possible twice daily).   Can
anyone with similar configuration (>50GB TSM DB on Solaris) confirm this
time estimate for Audit DB?

I would also welcome any opinions/ideas how to recover that
apparently-corrupted DB Backup (I'm still not 100% convinced it is).  We
want to do this on a test server to salvage the prior nights backup data
and also salvage the Activity Log for further review, if possible.  All
copypool and DB Backup tapes from that day were pulled/preserved.  I don't
think there's a way for us to re-incorporate the lost data back into our
production TSM backups easily - but brilliant ideas are welcome.  Even so,
we would at least regain the ability to access/restore the prior night's
backup data from those preserved copypool tapes, which would mitigate
potential service impact of this incident by 50% (just 1 day lost, not 2).

Are there any methods for restoring only selected parts of a TSM DB ?

- rsvp, thanks (experienced respondents only, please)

Kent Monthei
GlaxoSmithKline
_

Summary of Events:

1)  A rogue client backup started just before our morning DB Backup (1st
of 2 scheduled daily full DB Backups).  This was after a scripted check to
ensure that no client backups were running (none were) and just prior to
start of the DB Backup (we think).  The check for client sessions is not so much to 
ensure nothing runs during DB Backup - it's to ensure that all
client data reaches the diskpool and then gets migrated or copied to tape
pools before the DB Backup starts.  Still, for us, the rogue client backup
was an atypical event.

2)  The DB Backup stalled immediately, sitting at 0 pages backed up for
over an hour, but TSM Services were not hung & did not fail.  The client
backup was progressing fine & pushing a lot of data.  To resolve the
stalled DB Backup, we cancelled the client backup session (no effect). We
then cancelled the DB Backup process (no effect - it entered & sat for an
hour in 'Cancel Pending' state).

3)  At that point, we decided to halt/restart the TSM Server process.
'dsmserv' came back up normally.

4)  We then repeated the 1st DB Backup, which then progressed normally in
the usual time and reported successful completion.  We continued with Daily Task 
processing, which went smoothly up to the
2nd DB Backup.

5)  Almost immediately after startup (0 DB pages backed up), the 2nd DB
Backup process failed with a 'dballoc.c / SMP page mismatch /
initialization of DB page allocator failed' error (something to that
effect).

6)  We decided to halt/resta