I guess these problems have nothing to do with your hardware.
Have you checked the file he tries to read exists in the /vicepX partition
and has the correct size?
If so, I would try to run the non-threaded version of the fileserver which
is built in src/viced. Also the problem of processes not going away looks
for me like a problem in the pthread environment.
Hartmut
On Mon, 4 Mar 2002 [EMAIL PROTECTED] wrote:
Marco Foglia ([EMAIL PROTECTED]) wrote:
we are having continuous troubles with CopyOnWrite failures on a linux
file server and as a result orphaned files. We have tried several setups
and even changed hardware but it is still there. The current setup is
redhat-6.2 with 2.2.19-6.2.12smp and Openafs 1.2.2.
It only happens after a full backup (vos dump volume.backup -time 0) of
a backup volume. You can find the following message in the file server
log file:
CopyOnWrite failed: volume 536872685 in partition /vicepa (tried
reading 8192, read 0, wrote 0, errno 4) volume needs salvage
Does anybody have the same problems? Solutions?
I have been fighting this problem for almost a year with no solution.
I didn't post anything to this list before because I wasn't convinced
that the problem wasn't in my hardware. By now, I've tried enough
different combinations that I'm convinced that it's a software problem.
My hardware: Dual 800 MHz Pentium IIIs on two different motherboards,
1 GB SDRAM
DPT Century VI RAID controller
or
Adaptec 3200S RAID controller
or
Mylex AcceleRAID 352 RAID controller
or
non-RAID Symbios SCSI controller
My software: RedHat 6.2, 2.2.16-3
Transarc AFS Base configuration afs3.6 2.0
or
RedHat 6.2, 2.2.16-3
Transarc AFS Base configuration afs3.6 2.3
or
RedHat 6.2, 2.2.16-3
OpenAFS 1.0.3
or
[ many different versions in between ]
or
RedHat 7.2, kernel 2.4.9-21
OpenAFS 1.2.3
As you can see, I've tried many different hardware and software
combinations and I still get corrupted volumes. It can happen
on any backup, not just full backups. I have tried running the
servers in uniprocessor mode and while that seemed to help, it
did not eliminate the problem.
Here are some excerpts from the log files on the most recent
corruption, using the last software configuration above:
FileLog:
Thu Feb 28 15:47:19 2002 CopyOnWrite failed: volume 536877621 in partition /vicepc
(tried reading 8192, read 0, wrote 0, errno 4) volume needs salvage
Thu Feb 28 15:58:46 2002 VAttachVolume: volume salvage flag is ON for
/vicepc//V0536877621.vol; volume needs salvage
VolserLog:
Thu Feb 28 15:35:09 2002 1 Volser: Clone: Recloning volume 536876261 to volume
536876263
[...]
Thu Feb 28 15:58:46 2002 VAttachVolume: volume salvage flag is ON for
/vicepc/V0536877621.vol; volume needs salvage
Thu Feb 28 15:58:46 2002 1 Volser: ListVolumes: Could not attach volume 536877621
(V0536877621.vol) error=101
SalvageLog.old:
@(#) OpenAFS 1.2.3 built 2002-02-01
02/28/2002 16:00:48 STARTING AFS SALVAGER 2.4 (/usr/afs/bin/salvager /vicepc
536877621)
02/28/2002 16:00:48 CHECKING CLONED VOLUME 536877622.
02/28/2002 16:00:48 cs.usr0.naveen.backup (536877622) updated 02/28/2002 15:30
02/28/2002 16:00:48 Vnode 1: length incorrect; (is 8192 should be 0)
02/28/2002 16:00:48 SALVAGING VOLUME 536877621.
02/28/2002 16:00:48 cs.usr0.naveen (536877621) updated 02/28/2002 15:46
02/28/2002 16:00:48 Vnode 1198: version inode version; fixed (old status)
02/28/2002 16:00:48 Vnode 1514: version inode version; fixed (old status)
02/28/2002 16:00:48 Vnode 2754: version inode version; fixed (old status)
[ similar lines deleted ]
02/28/2002 16:00:48 Vnode 10078: version inode version; fixed (old status)
02/28/2002 16:00:48 Vnode 1: length incorrect; changed from 8192 to 0
02/28/2002 16:02:19 First page in directory does not exist.
02/28/2002 16:02:19 Directory bad, vnode 1; salvaging...
02/28/2002 16:02:19 Salvaging directory 1...
02/28/2002 16:02:19 Failed to read first page of fromDir!
02/28/2002 16:02:19 Checking the results of the directory salvage...
02/28/2002 16:02:20 dir vnode 601: special old unlink-while-referenced file
.__afsD8D is deleted (vnode 1120)
02/28/2002 16:02:20 dir vnode 601: special old unlink-while-referenced file
.__afs8036 is deleted (vnode 780)
02/28/2002 16:02:20 dir vnode 623: special old unlink-while-referenced file
.__afsDED1 is deleted (vnode 7988)
[ similar lines deleted ]
02/28/2002 16:02:20 dir vnode 623: special old unlink-while-referenced file
.__afs4975 is deleted (vnode 5862)
02/28/2002 16:02:20 Vnode 1: link count incorrect (was 45, now 2)
02/28/2002 16:02:30 Found 4784