This is an attempt to summarize my investigation of corruption at the
higest levels of my filesystems.
Seriously now, several months ago we had several incidents of bizzare
volume behavior:
o 'vos backup' failed.
o 'vos move' failed.
o 'vos dump' failed.
o volume contents completely accessible.
o salvage makes lots of the contents go away.
I finally discovered that I could rescue the contents of these volumes
by 'tar | tar'ing the contents to a newly created volume; then
removing the old volume. Sometimes the old volume needed to be
zapped.
This happened sufficiently often (twice with the same volume!) that I
began to wonder what was causing the corruption. Transarc suggested
that I 'fsck' the filesystems containing the AFS partitions containing
the offending volumes.
[I believe that my server configuration is relevant here. I use three
very old IBM RS6000 model 530's running AIX 3.2.5 with the AFS
filesystem overflow PTF and AFS 3.3a. Each server has a rich variety
of IBM and Seagate disks.]
I thought this suggestion somewhat odd, since one of the servers on
which the problem occurred has neither crashed nor had any disk errors
(I check for disk errors automatically daily). However, I tried
checking anyway and found errors of the type
** Phase 6 - Check Block Map
Bad Block Map; SALVAGE? y
** Phase 6b - Salvage Block Map
on _some_ (but not all) of the partitions which had misbehaving volumes.
I then salvaged the partitions repeatedly until the SalvageLog indicated
no more errors. (One of my burning questions is why one sometimes has
to 'salvage' more than once to eliminate errors.)
I decided to wait a month or so and try this again on one of the
servers (21GB total storage). The result is that all the fsck's were
OK, but there were a fair number of 'salvage' errors.
Specifically, a type of error which pops up quite frequently is
directory vnode 1: invalid entry .gopherrc~ [0-13634] ; deleted
This error showed up in nine different volumes in the first cleanup,
and in six volumes in the second. Hence the subject line.
Although I haven't encountered any more un-backupable volumes since
the first cleanup, the fact that things which 'salvage' thinks it
needs to fix seem to happen to files in AFS volumes during the regular
course of events is somewhat unsettling. There were 158 such errors
in the first cleanup and 32 in the second.
Questions:
o What caused the original debilitating volume access problem? (probably
unanswerable at this point)
o What causes the 'directory vnode nnn: invalid entry' errors? Is this
a known bug? Is it fixed in 3.4?
o Is it prudent to fsck and salvage all partitions on an AFS fileserver
periodically? If so, how often?
o What is so special about '.gopherrc'? (It is possible that it is one
of the first files written to new home directory volumes.)
I am posting this to info-afs to determine if anyone else has encountered
such problems. The Transarc problem number is TR-13810.
-Rick
--
|Rick Cochran 607-255-7223|
|Cornell Materials Science Center [EMAIL PROTECTED]|
|E20 Clark Hall, Ithaca, N.Y. 14853 cornell!msc.cornell.edu!rick|
| "Workstations - I bet you can't eat just one!" |