tl;dr I propose adding uiopeek(buf, n, uio) and uioskip(n, uio) which together are equivalent to successful uiomove(buf, n, uio).
This allows a driver to separately: 1. transfer data into a buffer (uiopeek) first, and then 2. advance the uio (uioskip) only after determining how much of the data the driver was able to actually process. Thoughts? DETAILS A common pattern when transferring data between a driver and userland is unlocked uiomove(9) into a temporary buffer, and then some I/O operation under a driver lock. The uiomove(9) itself normally can't be done under a driver lock because it may fault and sleep indefinitely for I/O. However, sometimes the I/O operation doesn't consume everything. This can happen in ttwrite (sys/kern/tty.c): in the non-blocking case (flags & IO_NDELAY), it may bite off a chunk of the uio bigger than the tty can chew without blocking. In that case, ttwrite hacks uio->uio_resid so that the write(2) system call reports the correct number of bytes writen so the user program can pick up where it left off (a hack that has existed since 1990 or earlier) -- but the uio itself becomes inconsistent, because it doesn't roll back uio->uio_iov or uio->uio_iov[0].iov_base or anything else. And ktrwrite (sys/kern/kern_ktrace.c) tries to reuse the uio in the event that fp->f_ops->fo_write returns EWOULDBLOCK, after pausing for a moment. This requires the uio to be consistent, which falls apart with ttwrite, as in a litany of syzbot reports like this one: https://syzkaller.appspot.com/bug?id=976210ed438d48ac275d77d7ebf4a086e43b5fcb Now, ktracing to a tty is pretty weird, but the salient point is: (a) Is ttwrite violating a contract by advancing the uio's iovs past the number of bytes it consumed of uio_resid? or (b) Is ktrwrite abusing struct fileops::fo_write by reusing uio even after an error? If (a), the attached patch breaks ttwrite's uiomove into two parts: uiopeek, to get the data into a temporary buffer, which may be then fed byte-by-byte into ttyoutput; and uioskip, to advance the uio by as many bytes as were actually consumed by the tty -- all of them in the success case, but not all if we stop with EWOULDBLOCK. If (b), we have to convince ktrwrite to reconstruct a uio that picks up where fo_write left off in the same buffer. Maybe we could reconstruct it but skip the first n bytes, or take a snapshot of the uio, but either way it would be annoying to deal with -- more fiddly than writing, using, and testing uiopeek/uioskip which I already did. The problem appears not to manifest in OpenBSD or FreeBSD because they have nothing that calls fo_write in a loop with the same uio -- their ktrace is limited to vnodes and doesn't handle EWOULDBLOCK by looping. But their ttwrite routines (or equivalent) do leave the uio inconsistent on error, with the same comment Mike Karels left back in 1990 in the CSRG tty.c 7.21.[*] P.S. I suspect this issue will also apply in the EFAULT case. If you try to uiomove (say) 2 bytes in separate iov entries, and the first iov doesn't fault but the second iov does, uiomove will advance the uio by one byte but return error. The caller probably won't process the first byte even though uio_resid has been decremented. So it will appear as though one byte has been transferred even though that byte was actually just copied to a buffer and then discarded. That might be a more common issue throughout the tree which could also be addressed by separating it into uiopeek and uioskip, but I haven't put further thought to whether it's a real issue or further effort into auditing drivers for it. [*] https://github.com/robohack/ucb-csrg-bsd/commit/b6660865c9f41cce98f8728f7f27b34b4a231e81#diff-37ae898972ce315c35bb81ff0e209efb141a9967efef821c7e31fda8d67d7ff2R1479-R1486
diff --git a/share/man/man9/uiomove.9 b/share/man/man9/uiomove.9 index 6417ea14b684..d2f579253386 100644 --- a/share/man/man9/uiomove.9 +++ b/share/man/man9/uiomove.9 @@ -24,7 +24,7 @@ .\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE .\" POSSIBILITY OF SUCH DAMAGE. .\" -.Dd April 26, 2010 +.Dd May 9, 2023 .Dt UIOMOVE 9 .Os .Sh NAME @@ -34,6 +34,10 @@ .In sys/systm.h .Ft int .Fn uiomove "void *buf" "size_t n" "struct uio *uio" +.Ft int +.Fn uiopeek "void *buf" "size_t n" "struct uio *uio" +.Ft void +.Fn uioskip "void *buf" "size_t n" "struct uio *uio" .Sh DESCRIPTION The .Fn uiomove @@ -140,10 +144,35 @@ to point that much farther into the region described. This allows multiple calls to .Fn uiomove to easily be used to fill or drain the region of data. +.Pp +The +.Fn uiopeek +function copies up to +.Fa n +bytes of data without updating +.Fa uio ; +the +.Fn uioskip +function updates +.Fa uio +without copying any data, and is guaranteed never to sleep or fault +even if the buffers are in userspace and memory access via +.Fn uiomove +or +.Fn uiopeek +would trigger paging. +A successful +.Fn uiomove buf n uio +call is equivalent to a successful +.Fn uiopeek buf n uio +followed by +.Fn uioskip n uio . .Sh RETURN VALUES Upon successful completion, .Fn uiomove -returns 0. +and +.Fn uiopeek +return 0. If a bad address is encountered, .Er EFAULT is returned. diff --git a/sys/kern/subr_copy.c b/sys/kern/subr_copy.c index cd934032d33c..02834432d31d 100644 --- a/sys/kern/subr_copy.c +++ b/sys/kern/subr_copy.c @@ -166,6 +166,93 @@ uiomove_frombuf(void *buf, size_t buflen, struct uio *uio) return (uiomove((char *)buf + offset, buflen - offset, uio)); } +int +uiopeek(void *buf, size_t n, struct uio *uio) +{ + struct vmspace *vm = uio->uio_vmspace; + struct iovec *iov; + size_t cnt; + int error = 0; + char *cp = buf; + size_t resid = uio->uio_resid; + int iovcnt = uio->uio_iovcnt; + char *base; + size_t len; + + KASSERT(uio->uio_rw == UIO_READ || uio->uio_rw == UIO_WRITE); + + if (n == 0 || resid == 0) + return 0; + iov = uio->uio_iov; + base = iov->iov_base; + len = iov->iov_len; + + while (n > 0 && resid > 0) { + KASSERT(iovcnt > 0); + cnt = len; + if (cnt == 0) { + KASSERT(iovcnt > 1); + iov++; + iovcnt--; + base = iov->iov_base; + len = iov->iov_len; + continue; + } + if (cnt > n) + cnt = n; + if (!VMSPACE_IS_KERNEL_P(vm)) { + preempt_point(); + } + + if (uio->uio_rw == UIO_READ) { + error = copyout_vmspace(vm, cp, base, cnt); + } else { + error = copyin_vmspace(vm, base, cp, cnt); + } + if (error) { + break; + } + base += cnt; + len -= cnt; + resid -= cnt; + cp += cnt; + KDASSERT(cnt <= n); + n -= cnt; + } + + return error; +} + +void +uioskip(size_t n, struct uio *uio) +{ + struct iovec *iov; + size_t cnt; + + KASSERTMSG(n <= uio->uio_resid, "n=%zu resid=%zu", n, uio->uio_resid); + + KASSERT(uio->uio_rw == UIO_READ || uio->uio_rw == UIO_WRITE); + while (n > 0 && uio->uio_resid) { + KASSERT(uio->uio_iovcnt > 0); + iov = uio->uio_iov; + cnt = iov->iov_len; + if (cnt == 0) { + KASSERT(uio->uio_iovcnt > 1); + uio->uio_iov++; + uio->uio_iovcnt--; + continue; + } + if (cnt > n) + cnt = n; + iov->iov_base = (char *)iov->iov_base + cnt; + iov->iov_len -= cnt; + uio->uio_resid -= cnt; + uio->uio_offset += cnt; + KDASSERT(cnt <= n); + n -= cnt; + } +} + /* * Give next character to user as result of read. */ diff --git a/sys/kern/tty.c b/sys/kern/tty.c index c92de98ce3d7..4da3e496bfd1 100644 --- a/sys/kern/tty.c +++ b/sys/kern/tty.c @@ -2229,13 +2229,13 @@ ttwrite(struct tty *tp, struct uio *uio, int flag) { u_char *cp; struct proc *p; - int cc, ce, i, hiwat, error; + int cc, cc0, ce, i, hiwat, error; u_char obuf[OBUFSIZ]; cp = NULL; hiwat = tp->t_hiwat; error = 0; - cc = 0; + cc0 = cc = 0; loop: mutex_spin_enter(&tty_lock); if (!CONNECTED(tp)) { @@ -2300,9 +2300,10 @@ ttwrite(struct tty *tp, struct uio *uio, int flag) * leftover from last time. */ if (cc == 0) { - cc = uimin(uio->uio_resid, OBUFSIZ); + uioskip(cc0, uio); + cc0 = cc = uimin(uio->uio_resid, OBUFSIZ); cp = obuf; - error = uiomove(cp, cc, uio); + error = uiopeek(cp, cc, uio); if (error) { cc = 0; goto out; @@ -2373,13 +2374,9 @@ ttwrite(struct tty *tp, struct uio *uio, int flag) } out: - /* - * If cc is nonzero, we leave the uio structure inconsistent, as the - * offset and iov pointers have moved forward, but it doesn't matter - * (the call will either return short or restart with a new uio). - */ KASSERTMSG(error || cc == 0, "error=%d cc=%d", error, cc); - uio->uio_resid += cc; + KASSERTMSG(cc0 >= cc, "cc0=%d cc=%d", cc0, cc); + uioskip(cc0 - cc, uio); return (error); overfull: diff --git a/sys/sys/systm.h b/sys/sys/systm.h index 3585a0e9fbea..b59536f6c079 100644 --- a/sys/sys/systm.h +++ b/sys/sys/systm.h @@ -620,6 +620,8 @@ void trace_exit(register_t, const struct sysent *, const void *, int uiomove(void *, size_t, struct uio *); int uiomove_frombuf(void *, size_t, struct uio *); +int uiopeek(void *, size_t, struct uio *); +void uioskip(size_t, struct uio *); #ifdef _KERNEL int setjmp(label_t *) __returns_twice;