(This came up in chat, and since there was no agreement at all there it seems it ought to be discussed here.)
It is possible for write() calls to fail partway through, after already having written some data. We do not currently document the behavior under these circumstances (though we should), and some experimentation suggests that at least some of the behavior violates the principle of least surprise. Basically, it is not feasible to check for and report all possible errors ahead of time, nor in general is it possible or even desirable to unwind portions of a write that have already been completed, which means that if a failure occurs partway through a write there are two reasonable choices for proceeding: (a) return success with a short count reporting how much data has already been written; (b) return failure. In case (a) the error gets lost unless additional steps are taken (which as far as I know we currently have no support for); in case (b) the fact that some data was written gets lost, potentially leading to corrupted output. Neither of these outcomes is optimal, but optimal (detecting all errors beforehand, or rolling back the data already written) isn't on the table. It seems to me that for most errors (a) is preferable, since correctly written user software will detect the short count, retry with the rest of the data, and hit the error case directly, but it seems not everyone agrees with me. The cases that exist (going by the errors documented in write(2)) are: ENOSPC/EDQUOT (disk fills during the I/O) EFBIG (file size exceeds a limit) EFAULT (invalid user memory) EIO (hardware error) EPIPE (pipe gets closed during the I/O) In the first three cases it's notionally possible to check for the error case beforehand, but it doesn't actually work because the activities of other processes or threads while the I/O is in progress can invalidate the results of any check. (Also, for EFAULT the check is expensive.) Some of the same cases (particularly EFAULT and EIO) exist for read. (Note that while for ordinary files stopping a partial read, discarding the results, and returning failure is harmless, this is not the case for pipes, ttys, and sockets, so it also matters for read.) We were experimenting with the EFAULT behavior by using mprotect() to deny access to part of a buffer and then writing the whole buffer out. The results so far (with sufficiently large buffers): - for pipes, ttys, and probably everything that uses ordinary uiomove, the data in the accessible part of the buffer is written out and the call fails with EFAULT. - for regular files on ffs and probably most things that use uiomove_ubc, the data in the accessible part of the buffer is written, the call fails with EFAULT, and the size of the file is reverted to what it was at the start. - nobody's tested sockets yet, I think. - in all cases the mtime is updated. The size reversion does unwind the common case, but in other cases it produces bizarre behavior; e.g. if you have a 1M file and you write 2M to it and then fault, the 1M of the file is replaced with the first 1M of what you wrote and the rest is discarded; plus given that the call failed most users' first instinct would be to assume that nothing was written. The behavior is probably the same for the other errors, though I haven't looked and it's definitely possible that ENOSPC/EDQUOT are handled more carefully. Anyhow, if you've made it this far, the actual question is: is the current behavior really what we want? (Whether or not it's technically correct, or happens to be consistent with the exact wording in the man pages, various aspects of it seem undesirable.) ISTM that for all these cases except EIO it's sufficient to return success with a short count and let the user code retry with the rest of its data. For EIO I think it's best to do that and also retain the error somewhere for the next write attempt. -- David A. Holland dholl...@netbsd.org