The following commit has been merged in the master branch:
commit addee4295946fc9af6027fc555b27945d2edead9
Author: Andrew Deason <adea...@sinenomine.net>
Date:   Mon Nov 6 13:49:59 2023 -0600

    LINUX: Block non-fatal signals when sleeping
    
    Currently, when sleeping on LINUX, we either block all signals
    (afs_osi_Sleep), or do not block signals at all (afs_osi_SleepSig).
    The only caller of afs_osi_SleepSig() is afs_read(), with the
    intention that a user can interrupt the process while it's waiting for
    data from an unresponsive server.
    
    This can cause problems when afs_read() is called during a paging
    request. In this case, if we are interrupted by a signal, we'll return
    an error (EINTR) from afs_linux_readpage(), which causes a SIGBUS to
    be sent to the process if it's not already dying.
    
    If we're interrupted by a fatal signal, the process is already dying
    so this is fine. But if we're interrupted by a non-fatal signal with
    an installed signal handler, the SIGBUS almost always causes the
    process to die immediately (and maybe dump core), where it otherwise
    would have been fine.
    
    This can be very confusing to a user when it happens, since it's not
    immediately obvious that AFS was involved at all; the dumped core
    often just shows the SIGBUS generated during a mundane memory load or
    store. This situation is most easily seen when running golang out of
    /afs, but has also been seen with git and other programs. Anything
    that makes heavy use of signals while data is being fetched from /afs
    is likely to trigger the behavior.
    
    This problem in general may not be specific to Linux, since the
    relevant code path is in cross-platform code (afs_read ->
    afs_osi_SleepSig). But notably, this does not happen on Solaris[1];
    other platforms may have their own quirks that prevent this from
    becoming a problem.
    
    To avoid this for Linux, block all signals except SIGKILL when we're
    sleeping via afs_osi_SleepSig(). This allows the process to be killed
    if needed, but prevents interruption from any non-fatal signal.
    
    Ideally we would put our process in TASK_KILLABLE state, using
    functions like wait_event_killable() or wait_event_state() instead of
    blocking signals. But these functions have evolved over time, making
    this approach complex or even impossible for various Linux versions in
    our current design. Future commits may improve this; for now, do the
    simpler fix and just block signals.
    
    We could theoretically still allow non-fatal signals to interrupt
    sleeps when called from a syscall like read(). But this is difficult
    with the current structure of our Linux integration (syscall i/o is
    implemented on top of the paging system, like most Linux filesystems),
    and other filesystems tend to not do this.
    
    Also, interrupting a stalled afs_read() in general doesn't currently
    work very well anyway. The only callers of afs_osi_SleepSig() look
    like this, to wait for a background fetch (BOP_FETCH) to complete:
    
        while (!code && tdc->mflags & DFFetchReq) {
        /* other locks, etc */
        ReleaseReadLock(&tdc->lock);
        code = afs_osi_SleepSig(&tdc->validPos);
        ObtainReadLock(&tdc->lock);
        }
    
    A signal will cause afs_osi_SleepSig() to return, but then we must
    wait to get tdc->lock. The background BOP_FETCH operation will keep
    tdc->lock write-locked for the entire fetch from the fileserver
    (afs_CacheFetchProc()), so we won't be able to continue until the
    fetch completes.
    
    Future commits may improve this, but for now just avoid the
    unnecessary SIGBUS errors.
    
    [1] On Solaris, the equivalent code path (afs_GetOnePage()) does not
    go through afs_read() with noLock==0, but instead calls
    afs_GetDCache() to handle fetching data. It does call afs_ustrategy()
    to fill a page, which does technically call afs_read(), but with
    noLock==1, and so avoids the case where we submit a BOP_FETCH and
    wait for it. Fetching data from the network happens in the
    afs_GetDCache() call, and so does not use afs_osi_SleepSig().
    
    Change-Id: Ic9c0c35bd1a2a8fd1901d91dae10cb7719194d25
    Reviewed-on: https://gerrit.openafs.org/15637
    Reviewed-by: Cheyenne Wills <cwi...@sinenomine.net>
    Reviewed-by: Mark Vitale <mvit...@sinenomine.net>
    Tested-by: BuildBot <build...@rampaginggeek.com>
    Reviewed-by: Benjamin Kaduk <ka...@mit.edu>

 src/afs/LINUX/osi_sleep.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

-- 
OpenAFS Master Repository
_______________________________________________
OpenAFS-cvs mailing list
OpenAFS-cvs@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-cvs

Reply via email to