I'm curious if Tivoli support gave you any idea of what TYPE of corruption this error message indicates, and what they said about the DB mirror. With MIRRORWRITEDB=SEQUENTIAL, you could (at least in theory) have had corruption in the primary copy, but not the mirror copy. (Or maybe TSM doesn't issue the message until it has tried both copies already?)
I would have tried to remove the primary copy of the DB from the configuration, and come up with just the mirror, just to see if the result was any different... -----Original Message----- From: Kent Monthei [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 11:21 AM To: [EMAIL PROTECTED] Subject: TSM DB Corruption and DB Recovery - what happened ? (a saga, not a short story) Our TSM DB was corrupted last week. Worse yet, it appears* that an earlier DB Backup operation backed up a corrupted DB, but reported "completed successfully". Efforts to restore from that DB Backup failed twice, in the end costing us 24 hours restore time. We ended up having to restore (rollback to) the prior-day DB Backup. Between the 1-Day rollback and 24 hours lost time performing 2 failed DB Restores, we lost 2 night's backups for 83 Unix Servers, including 1 night of otherwise-successful backup processing. (please review the "Summary of Events" appended to this email before reading on) We run TSM Server 4.1.2.0 on Solaris 2.6. Library is IBM 3494. DB >50GB & under 90% Util. TSM DB & Log volumes are TSM-mirrored. TSM Log =4GB & rarely over 10% Util. TSM LogMode=Normal. MirrorWrite-Log=Parallel. MirrorWrite-DB=sequential. We do not perform automatic expiration or tape reclamation, and no tape reclamation was performed on this or the prior day. We normally don't run client backups during our TSM DB Backups - but that's just local practice & is a consequence of the sequence of our Daily Task processing, not a policy. I know (and Tivoli Support confirmed) that TSM is designed to support concurrent client backup session and DB Backup processes (how else could you run 24x7 backups?), and previously we have allowed certain long-running backups to run concurrently with DB Backups before, without consequence. Still, for us, a concurrently-running client backup and DB Backup is an atypical event. I would like experienced feedback on where things went awry in the following series of events (appended below), opinions as to what step may have caused the DB corruption and subsequent apparently-corrupted DB Backup (despite reporting successful completion). * Tivoli Support asserted that: - in TSM 4.1 the DB Backup operation does not perform robust consistency checking, so could backup a corrupt DB and still report success - apparently an APAR exists on this. Can anyone confirm this? - consistency checking Tivoli has been improved & is more robust in TSM 4.1 and/or 5.1. Can anyone confirm this? - it is impossible to incorporate thorough consistency-checking, as is performed by Audit DB, in presumed-daily DB Backups because of the elapsed time it requires. On a >50GB DB such as ours, Tivoli asserted that Audit DB would take over 50 hours (obviously not possible twice daily). Can anyone with similar configuration (>50GB TSM DB on Solaris) confirm this time estimate for Audit DB? I would also welcome any opinions/ideas how to recover that apparently-corrupted DB Backup (I'm still not 100% convinced it is). We want to do this on a test server to salvage the prior nights backup data and also salvage the Activity Log for further review, if possible. All copypool and DB Backup tapes from that day were pulled/preserved. I don't think there's a way for us to re-incorporate the lost data back into our production TSM backups easily - but brilliant ideas are welcome. Even so, we would at least regain the ability to access/restore the prior night's backup data from those preserved copypool tapes, which would mitigate potential service impact of this incident by 50% (just 1 day lost, not 2). Are there any methods for restoring only selected parts of a TSM DB ? - rsvp, thanks (experienced respondents only, please) Kent Monthei GlaxoSmithKline _________________________________________________________________ Summary of Events: 1) A rogue client backup started just before our morning DB Backup (1st of 2 scheduled daily full DB Backups). This was after a scripted check to ensure that no client backups were running (none were) and just prior to start of the DB Backup (we think). The check for client sessions is not so much to ensure nothing runs during DB Backup - it's to ensure that all client data reaches the diskpool and then gets migrated or copied to tape pools before the DB Backup starts. Still, for us, the rogue client backup was an atypical event. 2) The DB Backup stalled immediately, sitting at 0 pages backed up for over an hour, but TSM Services were not hung & did not fail. The client backup was progressing fine & pushing a lot of data. To resolve the stalled DB Backup, we cancelled the client backup session (no effect). We then cancelled the DB Backup process (no effect - it entered & sat for an hour in 'Cancel Pending' state). 3) At that point, we decided to halt/restart the TSM Server process. 'dsmserv' came back up normally. 4) We then repeated the 1st DB Backup, which then progressed normally in the usual time and reported successful completion. We continued with Daily Task processing, which went smoothly up to the 2nd DB Backup. 5) Almost immediately after startup (0 DB pages backed up), the 2nd DB Backup process failed with a 'dballoc.c / SMP page mismatch / initialization of DB page allocator failed' error (something to that effect). 6) We decided to halt/restart services a 2nd time. This time, services wouldn't restart. There were no errors in dsmserv.err, no OS/hardware errors in /var/adm/messages and no core file. After working with Tivoli Support, we started dsmserv in the foreground and saw that it was now reporting the same 'dballoc.c' error as the attempted/failed DB Backup earlier in the day. 7) We elected to perform a TSM Restore DB from the 1st DB Backup that day (the repeat attempt that reported successful completion). The Restore DB successfully reformatted the TSM Log and successfully restored 100% of DB Pages in about 3 hours, but then failed during DB Initialization with the same 'dballoc.c' error. 8) On the outside chance there was a log-pinned/Log-full condition, we preformed a TSM Extend Log, which ran to completion & added 400MB, then attempted to restart TSM, but 'dsmserv' failed with same 'dballoc.c' error. 9) With the extended log now in place, we repeated step 7. The Restore DB performed identically, failing after 3 hours. 10) We rolled back to the 2nd DB Backup from the prior day and performed another Restore DB, which succeeded after 3 hours. We immediately disabled client sessions and then performed Audit Volume on all diskpool volumes. Tape reclamation had not been performed the prior day. We pulled/preserved all copypool tapes created the day the problem occurred and also pulled/preserved the apparently-corrupt DB Backup tape.