More information....  Here's the 'format' of the write error messages:

afs: Lost contact with file server 9.41.253.103 in cell austin.ibm.com (all multi-homed ip addresses down for the server)
afs: Lost contact with file server 9.41.253.103 in cell austin.ibm.com (all multi-homed ip addresses down for the server)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: file server 9.41.253.103 in cell austin.ibm.com is back up (multi-homed address; other same-host interfaces may still be down)
afs: file server 9.41.253.103 in cell austin.ibm.com is back up (multi-homed address; other same-host interfaces may still be down)

At that point, here's a partial ps -ef list:

root     12749     1  0 10:18 ?        00:00:00 [afsd]
root     12751     1  0 10:18 ?        00:00:00 [afs_checkserver]
root     12753     1  0 10:18 ?        00:00:00 [afs_background]
root     12755     1  0 10:18 ?        00:00:00 [afs_background]
root     12757     1  0 10:18 ?        00:00:00 [afs_background]
root     12759     1  0 10:18 ?        00:00:00 [afs_background]
root     12761     1  0 10:18 ?        00:00:00 [afs_background]
root     12763     1  0 10:18 ?        00:00:00 [afs_background]
root     12765     1  0 10:18 ?        00:00:00 [afs_background]
root     12767     1  0 10:18 ?        00:00:00 [afs_background]
root     12769     1  0 10:18 ?        00:00:00 [afs_background]
root     12771     1  0 10:18 ?        00:00:00 [afs_background]
root     12773     1  0 10:18 ?        00:00:00 [afs_background]
root     12775     1  0 10:18 ?        00:00:00 [afs_background]
root     12777     1  0 10:18 ?        00:00:00 [afs_background]
root     12779     1  0 10:18 ?        00:00:00 [afs_background]
root     12781     1  0 10:18 ?        00:00:00 [afs_background]
root     12783     1  0 10:18 ?        00:00:00 [afs_background]
root     12821     1  0 10:18 ?        00:00:05 [afs_cachetrim]

Note that there are 16 'zombie' afs_background processes...  the same number of 'daemon' processes I specified in OPTIONS.

=======================
Kirby Bakken
ESW Build Architect
Rochester, MN
email: [EMAIL PROTECTED]
ezpage:kirbyb
507-253-4549 / Tie:  553-4549
Fax:  507-253-3495

......one more straw can't possibly matter....



Kirby Bakken/Rochester/[EMAIL PROTECTED]
Sent by: [EMAIL PROTECTED]

11/13/2006 12:54 PM

To
[email protected]
cc
Subject
[OpenAFS-devel] write errors, servers going 'down'






Help!


I'm running RHEL4 U4 (uname -r => 2.6.9-42.0.3.ELsmp) x86_64 on one of many dual Opteron 'linux' servers.  Our servers are running 'some' level of afs...  that may be important, but for now I'm trying to figure out where to start debug....


I get these messages in 'demsg':


afs: Lost contact with file server 9.10.228.186 in cell rchland.ibm.com (all multi-homed ip addresses down for the server)

afs: Lost contact with file server 9.10.228.186 in cell rchland.ibm.com (all multi-homed ip addresses down for the server)

afs: file server 9.10.228.186 in cell rchland.ibm.com is back up (multi-homed address; other same-host interfaces may still be down)

afs: file server 9.10.228.186 in cell rchland.ibm.com is back up (multi-homed address; other same-host interfaces may still be down)


I'm also seeing 'write' errors in the dmesg log, but don't currently have an exact 'paste' of that info....


These errors only occur at 'high' load.  Multiple processes writing/reading to the same afs volume.  I'm running these options and cache settings:


LARGE="-stat 2800 -dcache 2400 -daemons 5 -volumes 128"


I had been running with 'medium' settings, and that's when I saw the write errors...  now I just see the 'Lost contact' errors, and 'failed to store file' in the program writing files. (we're compiling/linking at the time these errors occur).


I've got the cache size 'set':


CACHESIZE=600000


although when I cat out the cacheinfo file I get this:


cat /usr/vice/etc/cacheinfo

/afs:/usr/vice/cache:3628512


I'm seeing these problems both with 'kernel-smp-module-openafs-1.4.0-2.6.9_42.0.3.EL_6_rhel4' and with 'openafs-kernel-smp-1.4.2-2.6.9_42.ELsmp_1.x86_64'


We had been seeing similar problems last March on RHEL4 U3, but an 'intermediate' AFS build of 'openafs-1.4.1rc2-rhel4.0.x86_64' seemed to work 'most of the time'...  (we get hangs about once every two weeks on each of 6 of the dual Opteron servers, and can't even log-into a local console to gather info..  so we're not sure if this is afs related or what).


What do I do to figure this out, or make it go away?  Is there a 'Having problems with openafs?  Here's what to try...." set of instructions somewhere that I've missed?


Thank you very much in advance for any help.


=======================
Kirby Bakken
ESW Build Architect
Rochester, MN
email: [EMAIL PROTECTED]
ezpage:kirbyb
507-253-4549 / Tie:  553-4549
Fax:  507-253-3495

......one more straw can't possibly matter....

Reply via email to