Re: [Q] Why salvaging server occurs frequently??

Marcus Watts Wed, 25 Mar 1998 11:18:46 +0100 (MET)
You wrote:
> From: Kwon Oh-hoon <[EMAIL PROTECTED]>
> Message-Id: <[EMAIL PROTECTED]>
> Subject: [Q] Why salvaging server occurs frequently??
> To: [EMAIL PROTECTED]
> Date: Wed, 25 Mar 1998 13:58:15 +0000 (KST)
> Content-Type: text/plain; charset=EUC-KR
> Sender: [EMAIL PROTECTED]
> 
> 
>     We have three database servers on alpha_osf32 plaforms.
>     AFS Product version of these servers is afs3.4 5.38.
>     Our three database servers are also file servers.
>     Because of salvaging file server frequently in DB Server,
>     all users in our cell must stop doing work on almost everyday.
> 
>     In FileLog.old file, I found an error message "file assertion failed".
>     To solve this problem, we upgraded our database servers from afs3.4 4=
> .35
>     to afs3.4 5.38. But, this error occured again.=20
> 
>     After using backup command "vos backupsys" for daily backup=20
>     of all volumes, I think this problem has occured.
> 
>     Log files are in ftp.transarc.com:/pub/afsps/ftp/pohang-univ :
>     FileLog, SalvageLog, FileLog.old, SalvageLog.old, core.file.fs
> 
>     Qustion 1) Why salvaging server occurs frequently in this case?
>         How can this error "file assertion failed" be solved?
>     Qustion 2) /vicepx/V0xxxxxxx.vol file may be removed manually.
>         The volume is not in VLDB and not removed by the command "vos zap=
".

I assume you mean the files under:
        /afs/transarc.com/public/anon-ftp/pub/afsps/ftp/pohang-univ
As you noted, the important message (why it failed) is:
        Assertion failed! file afsfileprocs.c, line 6016.
To really be sure what this means, it's necessary to contact your
transarc customer support representative.  Assuming, however, that
the build for "afs 3.4 5.38" contains this ident line in "fileserver":
        $Header: 
/afs/transarc.com/project/fs/dev/afs/3.4/.stage13/rcs/viced/RCS/afsfileprocs.c,v 2.453 
1997/09/26 19:08:18 chengjie Exp $
then the assertion on line 6016 happens in the routine CopyOnWrite upon
any read error, or any write error but ENOSPC happens.  When this assertion
happens, you should also have a core file for the "fileserver" process.
The core dump will probably be named
        /usr/afs/logs/core.file.fs - or some such.
You should probably rename it to something else before studying it; otherwise,
it could be overwritten by another core dump.  You can look at it with
your favorite debugger (say, adb), with something like:
        # adb /usr/afs/bin/fileserver /usr/afs/logs/core.file.fs
        errno/D
        $c
If errno was set by the read or write, then it is likely to be useful
in terms of telling what the problem is.  The $c will tell you where
the assertion was that failed.  If you don't see CopyOnWrite, then
that may mean that some other assertion failed, and you will need to
transarc for more clues about what went wrong.  With some patience,
it is also possible to determine what disk, and what volume were being
updated, but you'll really want to have transarc do this for you.
You can facilitate this by saving a copy of your core dump & the
corresponding fileserver binary, somewhere where your transarc customer
service representative can look at it.

A likely cause is a disk error.  In this case, you should find that errno
is set to EIO.  This will not be the only clue that there are problems.
You should also find that there are messages on the console about disk
read and write errors, and these messages should also be recorded in some
file on the system (often /var/adm/messages, but check to be sure.)
These messages should include the name of the disk that was failing,
and the block number.  If you do find these, it's well worth your while
to fix this as soon as possible, before you lose much data and time.
A simple way that will find many disk errors is to use "dd" from the raw
or block device, to /dev/null.  Any errors before the end of the disk
are cause for alarm (an error at the *end* of the disk is acceptable; some
Unix disk drivers return an error instead of EOF when this condition is
hit).  Sometimes, your system will also come with a disk diagnostic aid
that can format the disk; fancier versions may contain additional tests
such as a non-destructive sequential read, or a random seek read, or
some sort of write/read surface certification routine.  It is not a bad
idea to run a write/read surface certification routine for a day or so
before putting a new disk into service.   Be careful -- some of those
tests may erase data on the disk.

                                -Marcus Watts
                                UM ITD PD&D Umich Systems Group
Re: [Q] Why salvaging server occurs frequently??

Reply via email to