> On Tue, Sep 01, 2020 at 03:43:37PM +0100, Jose M Calhariz wrote:
> > The an example of errors on the logs are:
> > 
> > afs: disk cache read error in CacheItems slot 350195 off 28015620/35000020 
> > code -4/80
> > afs: Error while alloc'ing cache slot for file 204:536874423.964.4794; 
> > failing with an i/o error

Hi, I'm the person that mentioned this briefly during the AFS workshop
this week. These messages are not in themselves a problem; they are just
reporting that we got an error code from the Linux kernel when trying to
read from the disk cache.

On Tue, 1 Sep 2020 16:07:55 -0700
Benjamin Kaduk <ka...@mit.edu> wrote:

> This error message is supposed to indicate that a read from the cache
> filesystem got EIO, which in turn is supposed to indicate a physical
> problem with the drive.  That said, I'm not going to jump to conclusions
> and try to blame your drive, as there are several other things that could
> be coming into play.

The code logged is -4, which is EINTR (EIO would be -5). The most likely
trigger of this is a process that got a SIGKILL signal (or other fatal
signal) while we were reading from the disk cache. Traditionally we
wouldn't get errors in that case, but Linux started returning errors in
that situation after some version (possibly depending on the local fs in
use? but I don't recall exactly).

If you think these messages happen when some other bug or problem is
happening, that's possible, but the messages themselves are not a
problem. If you want to avoid the situation that causes these messages,
you can try to avoid SIGKILL'ing the relevant processes, if you know
what's causing that. The message you've shown doesn't log the pid, but
there is already a change in 1.8.8pre1 to log the pid and some other
information in that log message.

If you want the specific patch to add some more info to that log
message, it's here (gerrit 14437):

https://git.openafs.org/?p=openafs.git;a=patch;h=5d863b4f6e817b1cc2615265c7747e17a2037ae6

I know of at least one bug that can be triggered by the log message
you've mentioned, which is fixed by gerrit 14451 here:

https://git.openafs.org/?p=openafs.git;a=patch;h=c55607d732a65f8acb1dfc6bf93aee0f4409cecf

That's also in 1.8.8pre1, so if it's feasible for you to just try
1.8.8pre1, that's probably easiest. The messages will still appear with
1.8.8pre1, but they may be more informative, and some other related bugs
may be fixed. If you are seeing some other problematic behavior with
1.8.8pre1, I can take a look if you provide some details.

-- 
Andrew Deason
adea...@sinenomine.net

Reply via email to