On Samstag, 23. April 2022 06:33:50 CEST Akihiko Odaki wrote: > On 2022/04/22 23:06, Christian Schoenebeck wrote: > > On Freitag, 22. April 2022 04:43:40 CEST Akihiko Odaki wrote: > >> On 2022/04/22 0:07, Christian Schoenebeck wrote: > >>> mknod() on macOS does not support creating sockets, so divert to > >>> call sequence socket(), bind() and chmod() respectively if S_IFSOCK > >>> was passed with mode argument. > >>> > >>> Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/ > >>> Signed-off-by: Christian Schoenebeck <qemu_...@crudebyte.com> > >>> Reviewed-by: Will Cohen <wwco...@gmail.com> > >>> --- > >>> > >>> hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++- > >>> 1 file changed, 26 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c > >>> index e24d09763a..39308f2a45 100644 > >>> --- a/hw/9pfs/9p-util-darwin.c > >>> +++ b/hw/9pfs/9p-util-darwin.c > >>> @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char > >>> *filename, const char *name,> > >>> > >>> */ > >>> > >>> #if defined CONFIG_PTHREAD_FCHDIR_NP > >>> > >>> +static int create_socket_file_at_cwd(const char *filename, mode_t mode) > >>> { > >>> + int fd, err; > >>> + struct sockaddr_un addr = { > >>> + .sun_family = AF_UNIX > >>> + }; > >>> + > >>> + fd = socket(PF_UNIX, SOCK_DGRAM, 0); > >>> + if (fd == -1) { > >>> + return fd; > >>> + } > >>> + snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename); > >> > >> It would result in an incorrect path if the path does not fit in > >> addr.sun_path. It should report an explicit error instead. > > > > Looking at its header file, 'sun_path' is indeed defined on macOS with an > > oddly small size of only 104 bytes. So yes, I should explicitly handle > > that > > error case. > > > > I'll post a v3. > > > >>> + err = bind(fd, (struct sockaddr *) &addr, sizeof(addr)); > >>> + if (err == -1) { > >>> + goto out; > >> > >> You may close(fd) as soon as bind() returns (before checking the > >> returned value) and eliminate goto. > > > > Yeah, I thought about that alternative, but found it a bit ugly, and > > probably also counter-productive in case this function might get extended > > with more error pathes in future. Not that I would insist on the current > > solution though. > > I'm happy with the explanation. Thanks. > > >>> + } > >>> + err = chmod(addr.sun_path, mode); > >> > >> I'm not sure if it is fine to have a time window between bind() and > >> chmod(). Do you have some rationale? > > > > Good question. QEMU's 9p server is multi-threaded; all 9p requests come in > > serialized and the 9p server controller portion (9p.c) is only running on > > QEMU main thread, but the actual filesystem driver calls are then > > dispatched to QEMU worker threads and therefore running concurrently at > > this point: > > > > https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines > > > > Similar situation on Linux 9p client side: it handles access to a mounted > > 9p filesystem concurrently, requests are then serialized by 9p driver on > > Linux and sent over wire to 9p server (host). > > > > So yes, there might be implications by that short time windows. But could > > that be exploited on macOS hosts in practice? > > > > The socket file would have mode srwxr-xr-x for a short moment. > > > > For security_model=mapped* this should not be a problem. > > > > For security_model=none|passhrough, in theory, maybe? But how likely is > > that? If you are using a Linux client for instance, trying to brute-force > > opening the socket file, the client would send several 9p commands > > (Twalk, Tgetattr, Topen, probably more). The time window of the two > > commands above should be much smaller than that and I would expect one of > > the 9p commands to error out in between. > > > > What would be a viable approach to avoid this issue on macOS? > > It is unlikely that a naive brute-force approach will succeed to > exploit. The more concerning scenario is that the attacker uses the > knowledge of the underlying implementation of macOS to cause resource > contention to widen the window. Whether an exploitation is viable > depends on how much time you spend digging XNU. > > However, I'm also not sure if it really *has* a race condition. Looking > at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and > s->ops->lstat(). It also results in an entity called "path name based > fid" in the code, which inherently cannot identify a file when it is > renamed or recreated. > > If there is some rationale it is safe, it may also be applied to the > sequence of bind() and chmod(). Can anyone explain the sequence of > s->ops->mknod() and s->ops->lstat() or path name based fid in general?
You are talking about 9p server's controller level: I don't see something that would prevent a concurrent open() during this bind() ... chmod() time window unfortunately. Argument 'fidp' passed to function v9fs_co_mknod() reflects the directory in which the new device file shall be created. So 'fidp' is not the device file here, nor is 'fidp' modified during this function. Function v9fs_co_mknod() is entered by 9p server on QEMU main thread. At the beginning of the function it first acquires a read lock on a (per 9p export) global coroutine mutex: v9fs_path_read_lock(s); and holds this lock until returning from function v9fs_co_mknod(). But that's just a read lock. Function v9fs_co_open() also just gains a read lock. So they can happen concurrently. Then v9fs_co_run_in_worker({...}) is called to dispatch and execute all the code block (think of it as an Obj-C "block") inside this (macro actually) on a QEMU worker thread. So an arbitrary background thread would then call the fs driver functions: s->ops->mknod() v9fs_name_to_path() s->ops->lstat() and then at the end of the code block the background thread would dispatch back to QEMU main thread. So when we are reaching: v9fs_path_unlock(s); we are already back on QEMU main thread, hence unlocking on main thread now and finally leaving function v9fs_co_mknod(). The important thing to understand is, while that v9fs_co_run_in_worker({...}) code block is executed on a QEMU worker thread, the QEMU main thread (9p server controller portion, i.e. 9p.c) is *not* sleeping, QEMU main thread rather continues to process other (if any) client requests in the meantime. In other words v9fs_co_run_in_worker() neither behaves exactly like Apple's GCD dispatch_async(), nor like dispatch_sync(), as GCD is not coroutine based. So 9p server might pull a pending 'Topen' client request from the input FIFO in the meantime and likewise dispatch that to a worker thread, etc. Hence a concurrent open() might in theory be possible, but I find it quite unlikely to succeed in practice as the open() call on guest is translated by Linux client into a bunch of synchronous 9p requests on the path passed with the open() call on guest, and a round trip for each 9p message is like what, ~0.3ms or something in this order. That's quite huge compared to the time window I would expect between bind() ... open(). Does this answer your questions? > Regards, > Akihiko Odaki > > >>> +out: > >>> + close(fd); > >>> + return err; > >>> +} > >>> + > >>> > >>> int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t > >>> dev) > >>> { > >>> > >>> int preserved_errno, err; > >>> > >>> @@ -93,7 +114,11 @@ int qemu_mknodat(int dirfd, const char *filename, > >>> mode_t mode, dev_t dev)> > >>> > >>> if (pthread_fchdir_np(dirfd) < 0) { > >>> > >>> return -1; > >>> > >>> } > >>> > >>> - err = mknod(filename, mode, dev); > >>> + if (S_ISSOCK(mode)) { > >>> + err = create_socket_file_at_cwd(filename, mode); > >>> + } else { > >>> + err = mknod(filename, mode, dev); > >>> + } > >>> > >>> preserved_errno = errno; > >>> /* Stop using the thread-local cwd */ > >>> pthread_fchdir_np(-1);