RE: dump(8) race conditions?
On 07-Feb-02 Markus Stumpf wrote: We use amanda and dump for backups. Some hosts have rather busy disks even during non prime time hours when backup is run. From time to time amanda reports dump(8) errors like the following: sendbackup: info end | DUMP: Date of this level 5 dump: Wed Feb 6 01:53:12 2002 | DUMP: Date of last level 4 dump: Mon Feb 4 02:31:40 2002 | DUMP: Dumping /dev/rda4s1e (/share/turing/disk07) to standard output | DUMP: mapping (Pass I) [regular files] | DUMP: mapping (Pass II) [directories] | DUMP: estimated 2423080 tape blocks. | DUMP: dumping (Pass III) [directories] | DUMP: dumping (Pass IV) [regular files] | DUMP: 14.72% done, finished in 0:28 | DUMP: 33.78% done, finished in 0:19 | DUMP: 52.84% done, finished in 0:13 | DUMP: 71.65% done, finished in 0:07 ? DUMP: read error from /dev/rda4s1e: Invalid argument: [block -410921522]: count=3072 ? DUMP: DUMP: read error from /dev/rda4s1e: Invalid argument: [sector -410921522]: count=512 ? DUMP: read error from /dev/rda4s1e: Invalid argument: [block -410921532]: count=5120 ? DUMP: read error from /dev/rda4s1e: Invalid argument: [block -1001057530]: count=1024 [ ... ] First time we saw this we took down the machine to single user, unmounted the disk and fsck'd it. No errors where found and the next backups (even level 0) made it without errors. As we where still suspicious as to what might be the reason for this really sporadic error messages from different machines and different disks I look through the source of dump. If I do interpret the code correctly dump caches directory inode lists. Now, if during a dump and after caching the inode infos files get removed/shrunk dump has a dirty cache and tries to access blocks that are not/no longer allocated and the result are the above errors. Am I right with my interpretation or are this really hardware errors? You are essentially correct, and your message is probably a good reminder for those of us who routinely use dump on active filesystems. Dump is a two pass system, and any activity which modifies inodes between the first pass and the second is likely to cause problems, either for dump or for restore. It has always been thus, even as far back as V7 (and probably v6). Dumps which report errors such as the ones you mention are likely to cause difficulities on restore. Sometimes they will be completely unreadable; sometimes partial or interactive restores will succeed (for some files). It is even possible that the dump may be completely restorable, but with corrupted files. On the other hand, dumps which *don't* report errors can still be subtly corrupted. Elizabeth Zwicky, in a ten year old paper entitled Torture Testing Backup and Archive Programs', discusses a couple of situations where this can occur. It is operationally (and sometimes politically) difficult to dump on unmounted filesystems, so most of us (I think) bite the bullet and try to dump at times when the subject filesystem is likely to be quiescent. It may also be smart to dump more frequently than otherwise called for, just to increase the odds. Your message reminds us of the risks we take. It is worth noting that activity can be occurring on a filestystem and dump will succeed if there is no activity which alters inodes significantly between passes. -- Duane H. Hesser [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
dump(8) race conditions?
We use amanda and dump for backups. Some hosts have rather busy disks even during non prime time hours when backup is run. From time to time amanda reports dump(8) errors like the following: sendbackup: info end | DUMP: Date of this level 5 dump: Wed Feb 6 01:53:12 2002 | DUMP: Date of last level 4 dump: Mon Feb 4 02:31:40 2002 | DUMP: Dumping /dev/rda4s1e (/share/turing/disk07) to standard output | DUMP: mapping (Pass I) [regular files] | DUMP: mapping (Pass II) [directories] | DUMP: estimated 2423080 tape blocks. | DUMP: dumping (Pass III) [directories] | DUMP: dumping (Pass IV) [regular files] | DUMP: 14.72% done, finished in 0:28 | DUMP: 33.78% done, finished in 0:19 | DUMP: 52.84% done, finished in 0:13 | DUMP: 71.65% done, finished in 0:07 ? DUMP: read error from /dev/rda4s1e: Invalid argument: [block -410921522]: count=3072 ? DUMP: DUMP: read error from /dev/rda4s1e: Invalid argument: [sector -410921522]: count=512 ? DUMP: read error from /dev/rda4s1e: Invalid argument: [block -410921532]: count=5120 ? DUMP: read error from /dev/rda4s1e: Invalid argument: [block -1001057530]: count=1024 [ ... ] First time we saw this we took down the machine to single user, unmounted the disk and fsck'd it. No errors where found and the next backups (even level 0) made it without errors. As we where still suspicious as to what might be the reason for this really sporadic error messages from different machines and different disks I look through the source of dump. If I do interpret the code correctly dump caches directory inode lists. Now, if during a dump and after caching the inode infos files get removed/shrunk dump has a dirty cache and tries to access blocks that are not/no longer allocated and the result are the above errors. Am I right with my interpretation or are this really hardware errors? Thanks, \Maex -- SpaceNet AG| Joseph-Dollinger-Bogen 14 | Fon: +49 (89) 32356-0 Research Development | D-80807 Muenchen| Fax: +49 (89) 32356-299 The security, stability and reliability of a computer system is reciprocally proportional to the amount of vacuity between the ears of the admin To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: dump(8) race conditions?
In the last episode (Feb 07), Markus Stumpf said: We use amanda and dump for backups. Some hosts have rather busy disks even during non prime time hours when backup is run. From time to time amanda reports dump(8) errors like the following: ? DUMP: read error from /dev/rda4s1e: Invalid argument: [block -410921522]: count=3072 ? DUMP: DUMP: read error from /dev/rda4s1e: Invalid argument: [sector -410921522]: count=512 ? DUMP: read error from /dev/rda4s1e: Invalid argument: [block -410921532]: count=5120 ? DUMP: read error from /dev/rda4s1e: Invalid argument: [block -1001057530]: count=1024 [ ... ] Dump should ideally be run on an unmounted filesystem. The next best is to create a snapshot ( /usr/src/sys/ufs/ffs/README.snapshot ) and dump that. -- Dan Nelson [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: dump(8) race conditions?
On Thu, Feb 07, 2002 at 11:54:02AM -0600, Dan Nelson wrote: Dump should ideally be run on an unmounted filesystem. The next best is to create a snapshot ( /usr/src/sys/ufs/ffs/README.snapshot ) and dump that. True. But on systems that host e.g. mailservers or webservers its unacceptable to disrupt services tp umount and backup the system :/ $ uname -a FreeBSD 4.4-RELEASE $ more /usr/src/sys/ufs/ffs/README.snapshot /usr/src/sys/ufs/ffs/README.snapshot: No such file or directory :-))) Located it in stable, but the READNE says: 2) Run dump on the snapshot. You will get a dump that is consistent with the filesystem as of the timestamp of the snapshot. Note that I have not yet changed dump to set the dumpdates file correctly, so do not use this feature in production until that fix is made. :-(( Anyway, I have no problem with the errors per se, just wanted to know if they could result from the race conditions or if I have to better change the disks. \Maex -- SpaceNet AG| Joseph-Dollinger-Bogen 14 | Fon: +49 (89) 32356-0 Research Development | D-80807 Muenchen| Fax: +49 (89) 32356-299 The security, stability and reliability of a computer system is reciprocally proportional to the amount of vacuity between the ears of the admin To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message