partial failures in write(2) (and read(2))

David Holland Fri, 05 Feb 2021 15:26:22 -0800

(This came up in chat, and since there was no agreement at all there
it seems it ought to be discussed here.)


It is possible for write() calls to fail partway through, after
already having written some data. We do not currently document the
behavior under these circumstances (though we should), and some
experimentation suggests that at least some of the behavior violates
the principle of least surprise.

Basically, it is not feasible to check for and report all possible
errors ahead of time, nor in general is it possible or even desirable
to unwind portions of a write that have already been completed, which
means that if a failure occurs partway through a write there are two
reasonable choices for proceeding:
   (a) return success with a short count reporting how much data has
       already been written;
   (b) return failure.

In case (a) the error gets lost unless additional steps are taken
(which as far as I know we currently have no support for); in case (b)
the fact that some data was written gets lost, potentially leading to
corrupted output. Neither of these outcomes is optimal, but optimal
(detecting all errors beforehand, or rolling back the data already
written) isn't on the table.

It seems to me that for most errors (a) is preferable, since correctly
written user software will detect the short count, retry with the rest
of the data, and hit the error case directly, but it seems not
everyone agrees with me.

The cases that exist (going by the errors documented in write(2)) are:
   ENOSPC/EDQUOT (disk fills during the I/O)
   EFBIG (file size exceeds a limit)
   EFAULT (invalid user memory)
   EIO (hardware error)
   EPIPE (pipe gets closed during the I/O)

In the first three cases it's notionally possible to check for the
error case beforehand, but it doesn't actually work because the
activities of other processes or threads while the I/O is in progress
can invalidate the results of any check. (Also, for EFAULT the check
is expensive.)

Some of the same cases (particularly EFAULT and EIO) exist for read.
(Note that while for ordinary files stopping a partial read,
discarding the results, and returning failure is harmless, this is not
the case for pipes, ttys, and sockets, so it also matters for read.)

We were experimenting with the EFAULT behavior by using mprotect() to
deny access to part of a buffer and then writing the whole buffer out.
The results so far (with sufficiently large buffers):

   - for pipes, ttys, and probably everything that uses ordinary
     uiomove, the data in the accessible part of the buffer is written
     out and the call fails with EFAULT.

   - for regular files on ffs and probably most things that use
     uiomove_ubc, the data in the accessible part of the buffer is
     written, the call fails with EFAULT, and the size of the file is
     reverted to what it was at the start.

   - nobody's tested sockets yet, I think.

   - in all cases the mtime is updated.

The size reversion does unwind the common case, but in other cases it
produces bizarre behavior; e.g. if you have a 1M file and you write 2M
to it and then fault, the 1M of the file is replaced with the first 1M
of what you wrote and the rest is discarded; plus given that the call
failed most users' first instinct would be to assume that nothing was
written.

The behavior is probably the same for the other errors, though I
haven't looked and it's definitely possible that ENOSPC/EDQUOT are
handled more carefully.

Anyhow, if you've made it this far, the actual question is: is the
current behavior really what we want? (Whether or not it's technically
correct, or happens to be consistent with the exact wording in the man
pages, various aspects of it seem undesirable.)

ISTM that for all these cases except EIO it's sufficient to return
success with a short count and let the user code retry with the rest
of its data. For EIO I think it's best to do that and also retain the
error somewhere for the next write attempt.

-- 
David A. Holland
dholl...@netbsd.org

partial failures in write(2) (and read(2))

Reply via email to