Re: [OpenAFS] fileserver crashes

Jeffrey Hutzelman Wed, 13 Oct 2004 08:05:11 -0700

I don't know if our fileserver crashes are related to what Matthew Cocker
and others are seeing, but we are indeed seeing problems here at umich.

Background information:
The machines in question are dual pentium 4 machines with
hyperthreading enabled running linux 2.4.26 (SMP) and glibc 2.3.2.  The
actual file storage is on "cheap" raid devices that use multiple IDE
drives but talk SCSI to the rest of the world.  These raids have their
own set of problems, so I would not count them as super-reliable file
storage.  We're running the "pthreads" version of the fileserver.

I think we're seeing at least 3 distinct problems with openafs 1.2.11.

The first may actually be networking. We get these with varying frequency in VolserLog: Sun Oct 10 22:05:09 2004 1 Volser: DumpVolume: Rx call failed during

dump, error -1

Tue Oct 12 11:38:07 2004 1 Volser: DumpVolume: Rx call failed during

dump, error -1

Tue Oct 12 13:39:23 2004 1 Volser: DumpVolume: Rx call failed during

dump, error -1

Tue Oct 12 15:06:46 2004 1 Volser: DumpVolume: Rx call failed during

dump, error -1

Helpful message, eh?  Can't tell what volume was being dumped,
or where it was going.


Well, -1 basically means the rx connection timed out.  There should be a
corresponding error on whatever client was doing the dump, unless the
issue was that that client decided to abort the call.  We see that all the
time, because there are cases where our backup system will parse the start
of a volume dump, decide it doesn't want it after all, and abort.

We have at various times gotten problems with read-only replicas that
are oddly truncated.  This might or might not be the consequence of the
previous problem.


Hm.  That sounds familiar, but I thought that bug was fixed some time ago.
In fact, Derrick confirms that the fix is in 1.2.11

Another probably completely different problem we have concerns volumes
with really small volume IDs.  Modern AFS software creates large 10
digit volume IDs.  But we have volumes that were created long before
AFS 3.1, with small 3 digit volume IDs.  Those volumes are rapidly
disappearing as one by one, during various restarts, the fileserver and
salvager proceed to discard all the data, then the volume header.


That's... bizarre.  I've never heard of such a thing, but then, we don't
have any Linux fileservers in our cell.  I understand the Andrew cell was
seeing this for a while, but it went away without anyone successfully
debugging it.


The last problem you describe sounds suspiciously like something Derrick
has been trying to track down for the last 2 or 3 weeks.  I'll leave that
to him, since he has a better idea than I of the current status of that.
_______________________________________________
OpenAFS-info mailing list
[EMAIL PROTECTED]
https://lists.openafs.org/mailman/listinfo/openafs-info

Re: [OpenAFS] fileserver crashes

Reply via email to