Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Tue, Aug 04, 2020 at 08:44:42AM +, David Laight wrote: > From: James Bottomley > > Sent: 03 August 2020 16:43 > > > > On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote: > > [...] > > > What is wrong with live migration between one qemu process and > > > another qemu process on the same machine not work for this use case? > > > > > > Just reusing live migration would seem to be the simplest path of > > > all, as the code is already implemented. Further if something goes > > > wrong with the live migration you can fallback to the existing > > > process. With exec there is no fallback if the new version does not > > > properly support the handoff protocol of the old version. > > > > Actually, could I ask this another way: the other patch set you sent to > > the KVM list was to snapshot the VM to a PKRAM capsule preserved across > > kexec using zero copy for extremely fast save/restore. The original > > idea was to use this as part of a CRIU based snapshot, kexec to new > > system, restore. However, why can't you do a local snapshot, restart > > qemu, restore using the PKRAM capsule to achieve exactly the same as > > MADV_DOEXEC does but using a system that's easy to reason about? It > > may be slightly slower, but I think we're still talking milliseconds. > > > I've had another idea (that is probably impossible...). > What about a 'reverse mmap' operation. > Something that creates an fd whose contents are a chunk of the > processes address space. http://www.wil.cx/~willy/linux/sileby.html
RE: [RFC PATCH 0/5] madvise MADV_DOEXEC
From: James Bottomley > Sent: 03 August 2020 16:43 > > On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote: > [...] > > What is wrong with live migration between one qemu process and > > another qemu process on the same machine not work for this use case? > > > > Just reusing live migration would seem to be the simplest path of > > all, as the code is already implemented. Further if something goes > > wrong with the live migration you can fallback to the existing > > process. With exec there is no fallback if the new version does not > > properly support the handoff protocol of the old version. > > Actually, could I ask this another way: the other patch set you sent to > the KVM list was to snapshot the VM to a PKRAM capsule preserved across > kexec using zero copy for extremely fast save/restore. The original > idea was to use this as part of a CRIU based snapshot, kexec to new > system, restore. However, why can't you do a local snapshot, restart > qemu, restore using the PKRAM capsule to achieve exactly the same as > MADV_DOEXEC does but using a system that's easy to reason about? It > may be slightly slower, but I think we're still talking milliseconds. I've had another idea (that is probably impossible...). What about a 'reverse mmap' operation. Something that creates an fd whose contents are a chunk of the processes address space. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 8/3/2020 11:42 AM, James Bottomley wrote: > On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote: > [...] >> What is wrong with live migration between one qemu process and >> another qemu process on the same machine not work for this use case? >> >> Just reusing live migration would seem to be the simplest path of >> all, as the code is already implemented. Further if something goes >> wrong with the live migration you can fallback to the existing >> process. With exec there is no fallback if the new version does not >> properly support the handoff protocol of the old version. > > Actually, could I ask this another way: the other patch set you sent to > the KVM list was to snapshot the VM to a PKRAM capsule preserved across > kexec using zero copy for extremely fast save/restore. The original > idea was to use this as part of a CRIU based snapshot, kexec to new > system, restore. However, why can't you do a local snapshot, restart > qemu, restore using the PKRAM capsule to achieve exactly the same as > MADV_DOEXEC does but using a system that's easy to reason about? It > may be slightly slower, but I think we're still talking milliseconds. Hi James, good to hear from you. PKRAM or SysV shm could be used for a restart in that manner, but it would only support sriov guests if the guest exports an agent that supports suspend-to-ram, and if all guest drivers support the suspend-to-ram method. I have done this using a linux guest and qemu guest agent, and IIRC the guest pause time is 500 - 1000 msec. With MADV_DOEXEC, pause time is 100 - 200 msec. The pause time is a handful of seconds if the guest uses an nvme drive because CC.SHN takes so long to persist metadata to stable storage. We could instead pass vfio descriptors from the old process to a 3rd party escrow process and pass them back to the new qemu process, but the shm that vfio has already registered must be remapped at the same VA as the previous process, and there is no interface to guarantee that. MAP_FIXED blows away existing mappings and breaks the app. MAP_FIXED_NOREPLACE respects existing mappings but cannot map the shm and breaks the app. Adding a feature that reserves VAs would fix that, we have experimnted with one. Fixing the vfio kernel implementation to not use the original VA base would also work, but I don't know how doable/difficult that would be. Both solutions would require a qemu instance to be stopped and relaunched using shm as guest ram, and its guest rebooted, so they do not let us update legacy already-running instances that use anon memory. That problem solves itself if we get these rfe's into linux and qemu, and eventually users shut down the legacy instances, but that takes years and we need to do it sooner. - Steve
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 8/3/2020 11:28 AM, ebied...@xmission.com wrote: > Steven Sistare writes: >> On 7/30/2020 5:58 PM, ebied...@xmission.com wrote: >>> Here is another suggestion. >>> >>> Have a very simple program that does: >>> >>> for (;;) { >>> handle = dlopen("/my/real/program"); >>> real_main = dlsym(handle, "main"); >>> real_main(argc, argv, envp); >>> dlclose(handle); >>> } >>> >>> With whatever obvious adjustments are needed to fit your usecase. >>> >>> That should give the same level of functionality, be portable to all >>> unices, and not require you to duplicate code. It belive it limits you >>> to not upgrading libc, or librt but that is a comparatively small >>> limitation. >>> >>> >>> Given that in general the interesting work is done in userspace and that >>> userspace has provided an interface for reusing that work already. >>> I don't see the justification for adding anything to exec at this point. >> >> Thanks for the suggestion. That is clever, and would make a fun project, >> but I would not trust it for production. These few lines are just >> the first of many that it would take to reset the environment to the >> well-defined post-exec initial conditions that all executables expect, >> and incrementally tearing down state will be prone to bugs. > > Agreed. > >> Getting a clean slate from a kernel exec is a much more reliable >> design. > > Except you are explicitly throwing that out the window, by preserving > VMAs. You very much need to have a clean bug free shutdown to pass VMAs > reliably. Sure. The whole community relies on you and others to provide a bug free exec. >> The use case is creating long-lived apps that never go down, and the >> simplest implementation will have the fewest bugs and is the best. >> MADV_DOEXEC is simple, and does not even require a new system call, >> and the kernel already knows how to exec without bugs. > > *ROFL* I wish the kernel knew how to exec things without bugs. Essentially you are saying you would argue against any enhancement to exec. Surely that is too high a bar. We must continue to evolve an innovate and balance risk against reward. This use case matters to our business a lot, and to others as well, see below. That is the reward. I feel you are overstating the risk. Surely there is some early point in the development cycle of some release where this can be integrated and get enough test time and soak time to be proven reliable. > The bugs are hard to hit but the ones I am aware of are not straight > forward to fix. > > MADV_DOEXEC is not conceptually simple. It completely violates the > guarantees that exec is known to make about the contents of the memory > of the new process. This makes it very difficult to reason about. I have having trouble see the difficulty. Perhaps I am too familar with it, but the semantics are few and easy to explain, and it does not introduce new concepts: the post-exec process is born with a few more mappings than previously, and non-fixed further mmaps choose addresses in the holes. > Nor > will MADV_DOEXEC be tested very much as it has only one or two users. > Which means in the fullness of time it is likely someone will change > something that will break the implementation subtlely and the bug report > probably won't come in for 3 years, or maybe a decade. At which point > it won't be clear if the bug even can be fixed as something else might > rely on it. That's on us; we need to provide kernel tests and be diligent about testing new releases. This matters to our business and we will do so. > What is wrong with live migration between one qemu process and another > qemu process on the same machine not work for this use case? > > Just reusing live migration would seem to be the simplest path of all, > as the code is already implemented. Further if something goes wrong > with the live migration you can fallback to the existing process. With > exec there is no fallback if the new version does not properly support > the handoff protocol of the old version. This is less resource intensive than live migration. The latter ties up two hosts, consumes lots of memory and network bandwidth, may take a long time to converge on a busy system, and is unfeasible for guests with a huge amount of local storeage (which we call dense I/O shapes). Live update takes less than 1 second total, and the guest pause time is 100 - 200 msecs. It is a very attractive solution that other cloud vendors have implemented as well, with their own private modifications to exec and and fork. We have been independently working in this area, and we are offering our implementation to the community. - Steve
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote: [...] > What is wrong with live migration between one qemu process and > another qemu process on the same machine not work for this use case? > > Just reusing live migration would seem to be the simplest path of > all, as the code is already implemented. Further if something goes > wrong with the live migration you can fallback to the existing > process. With exec there is no fallback if the new version does not > properly support the handoff protocol of the old version. Actually, could I ask this another way: the other patch set you sent to the KVM list was to snapshot the VM to a PKRAM capsule preserved across kexec using zero copy for extremely fast save/restore. The original idea was to use this as part of a CRIU based snapshot, kexec to new system, restore. However, why can't you do a local snapshot, restart qemu, restore using the PKRAM capsule to achieve exactly the same as MADV_DOEXEC does but using a system that's easy to reason about? It may be slightly slower, but I think we're still talking milliseconds. James
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
Steven Sistare writes: > On 7/30/2020 5:58 PM, ebied...@xmission.com wrote: >> Here is another suggestion. >> >> Have a very simple program that does: >> >> for (;;) { >> handle = dlopen("/my/real/program"); >> real_main = dlsym(handle, "main"); >> real_main(argc, argv, envp); >> dlclose(handle); >> } >> >> With whatever obvious adjustments are needed to fit your usecase. >> >> That should give the same level of functionality, be portable to all >> unices, and not require you to duplicate code. It belive it limits you >> to not upgrading libc, or librt but that is a comparatively small >> limitation. >> >> >> Given that in general the interesting work is done in userspace and that >> userspace has provided an interface for reusing that work already. >> I don't see the justification for adding anything to exec at this point. > > Thanks for the suggestion. That is clever, and would make a fun project, > but I would not trust it for production. These few lines are just > the first of many that it would take to reset the environment to the > well-defined post-exec initial conditions that all executables expect, > and incrementally tearing down state will be prone to bugs. Agreed. > Getting a clean slate from a kernel exec is a much more reliable > design. Except you are explicitly throwing that out the window, by preserving VMAs. You very much need to have a clean bug free shutdown to pass VMAs reliably. > The use case is creating long-lived apps that never go down, and the > simplest implementation will have the fewest bugs and is the best. > MADV_DOEXEC is simple, and does not even require a new system call, > and the kernel already knows how to exec without bugs. *ROFL* I wish the kernel knew how to exec things without bugs. The bugs are hard to hit but the ones I am aware of are not straight forward to fix. MADV_DOEXEC is not conceptually simple. It completely violates the guarantees that exec is known to make about the contents of the memory of the new process. This makes it very difficult to reason about. Nor will MADV_DOEXEC be tested very much as it has only one or two users. Which means in the fullness of time it is likely someone will change something that will break the implementation subtlely and the bug report probably won't come in for 3 years, or maybe a decade. At which point it won't be clear if the bug even can be fixed as something else might rely on it. What is wrong with live migration between one qemu process and another qemu process on the same machine not work for this use case? Just reusing live migration would seem to be the simplest path of all, as the code is already implemented. Further if something goes wrong with the live migration you can fallback to the existing process. With exec there is no fallback if the new version does not properly support the handoff protocol of the old version. Eric
RE: [RFC PATCH 0/5] madvise MADV_DOEXEC
> Maybe. We still need to preserve an anonymous segment, though. MADV_DOEXEC, > or mshare, > or something else. And I think the ability to preserve memory containing > pointers to itself > is an interesting use case, though not ours. Why does all this remind me of the old sendmail code. After parsing the config file it used to write its entire data area out to a file. On restart, if the config file hadn't changed it would read it back it. It sort of worked - until shared libraries came along. Then it had a habit of generating random crashes. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 7/27/2020 1:11 PM, Anthony Yznaga wrote: > This patchset adds support for preserving an anonymous memory range across > exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for > sharing memory in this manner, as opposed to re-attaching to a named shared > memory segment, is to ensure it is mapped at the same virtual address in > the new process as it was in the old one. An intended use for this is to > preserve guest memory for guests using vfio while qemu exec's an updated > version of itself. By ensuring the memory is preserved at a fixed address, > vfio mappings and their associated kernel data structures can remain valid. > In addition, for the qemu use case, qemu instances that back guest RAM with > anonymous memory can be updated. I forgot to mention, our use case is not just theoretical. It has been implemented and is pretty cool (but I am biased). The pause time for the guest is in the 100 - 200 msec range. We submitted qemu patches for review based on the MADV_DOEXEC proposal. In case you are curious: https://lore.kernel.org/qemu-devel/1596122076-341293-1-git-send-email-steven.sist...@oracle.com/ - Steve
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 7/31/2020 1:48 PM, Jason Gunthorpe wrote: > On Fri, Jul 31, 2020 at 01:15:34PM -0400, Steven Sistare wrote: >> On 7/31/2020 12:56 PM, Jason Gunthorpe wrote: >>> On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote: > Your preservation-across-exec use-case might or might not need the > VMA to be mapped at the same address. It does. qemu registers memory with vfio which remembers the va's in kernel metadata for the device. >>> >>> Once the memory is registered with vfio the VA doesn't matter, vfio >>> will keep the iommu pointing at the same physical pages no matter >>> where they are mapped. >> >> Yes, but there are other code paths that compute and use offsets between va >> and the >> base va. Mapping at a different va in the new process breaks vfio; I have >> tried it. > > Maybe you could fix vfio instead of having this adventure, if vfio is > the only motivation. Maybe. We still need to preserve an anonymous segment, though. MADV_DOEXEC, or mshare, or something else. And I think the ability to preserve memory containing pointers to itself is an interesting use case, though not ours. - Steve
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Fri, Jul 31, 2020 at 01:15:34PM -0400, Steven Sistare wrote: > On 7/31/2020 12:56 PM, Jason Gunthorpe wrote: > > On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote: > >>> Your preservation-across-exec use-case might or might not need the > >>> VMA to be mapped at the same address. > >> > >> It does. qemu registers memory with vfio which remembers the va's in > >> kernel > >> metadata for the device. > > > > Once the memory is registered with vfio the VA doesn't matter, vfio > > will keep the iommu pointing at the same physical pages no matter > > where they are mapped. > > Yes, but there are other code paths that compute and use offsets between va > and the > base va. Mapping at a different va in the new process breaks vfio; I have > tried it. Maybe you could fix vfio instead of having this adventure, if vfio is the only motivation. Jason
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote: > On 7/31/2020 11:27 AM, Matthew Wilcox wrote: > > On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote: > >> Matthews sileby/mshare proposal has the same issue. If a process opts-in > >> and mmap's an address in the shared region, then content becomes mapped at > >> a VA that was known to the pre-fork or pre-exec process. Trust must still > >> be established. > > > > It's up to the recipient whether they try to map it at the same address > > or at a fresh address. The intended use case is a "semi-shared" address > > space between two processes (ie partway between a threaded, fully-shared > > address space and a forked un-shared address space), in which case > > there's a certain amount of trust and cooperation between the processes. > > Understood, but if the recipient does map at any of the same, which is the > whole > point because you want to share the page table. The trust relationship is no > different than for the live update case. You don't have to map at the same address to share the page tables. For example, on x86 if you share an 8GB region, that must be aligned at 1GB in both the donor and the recipient, but they need not be mapped at the same address. > > It's a net increase of 200 lines of kernel code. If 4 lines of userspace > > code removes 200 lines of kernel code, I think I know which I prefer ... > > It will be *far* more than 4 lines. > Much of the 200 lines is mostly for the elf opt in, and much of the elf code > is from > anthony reviving an earlier patch that use MAP_FIXED_NOREPLACE during segment > setup. It doesn't really matter how much of it is for the opt-in and how much is for the exec path itself. The MAP_FIXED_NOREPLACE patch is only net +16 lines, so that's not the problem.
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 7/31/2020 12:56 PM, Jason Gunthorpe wrote: > On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote: >>> Your preservation-across-exec use-case might or might not need the >>> VMA to be mapped at the same address. >> >> It does. qemu registers memory with vfio which remembers the va's in kernel >> metadata for the device. > > Once the memory is registered with vfio the VA doesn't matter, vfio > will keep the iommu pointing at the same physical pages no matter > where they are mapped. Yes, but there are other code paths that compute and use offsets between va and the base va. Mapping at a different va in the new process breaks vfio; I have tried it. - Steve
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote: > > Your preservation-across-exec use-case might or might not need the > > VMA to be mapped at the same address. > > It does. qemu registers memory with vfio which remembers the va's in kernel > metadata for the device. Once the memory is registered with vfio the VA doesn't matter, vfio will keep the iommu pointing at the same physical pages no matter where they are mapped. Jason
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 7/31/2020 11:27 AM, Matthew Wilcox wrote: > On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote: >> Matthews sileby/mshare proposal has the same issue. If a process opts-in >> and mmap's an address in the shared region, then content becomes mapped at >> a VA that was known to the pre-fork or pre-exec process. Trust must still >> be established. > > It's up to the recipient whether they try to map it at the same address > or at a fresh address. The intended use case is a "semi-shared" address > space between two processes (ie partway between a threaded, fully-shared > address space and a forked un-shared address space), in which case > there's a certain amount of trust and cooperation between the processes. Understood, but if the recipient does map at any of the same, which is the whole point because you want to share the page table. The trust relationship is no different than for the live update case. > Your preservation-across-exec use-case might or might not need the > VMA to be mapped at the same address. It does. qemu registers memory with vfio which remembers the va's in kernel metadata for the device. > I don't know whether qemu stores > pointers in this VMA which are absolute within the qemu address space. > If it's just the emulated process's address space, then everything will > be absolute within its own address space and everything will be opaque > to qemu. If qemu is storing its own pointers in it, then it has to be > mapped at the same address. qemu does not do the latter but that could be a nice way for apps to use preserved memory. >>> Here is another suggestion. >>> >>> Have a very simple program that does: >>> >>> for (;;) { >>> handle = dlopen("/my/real/program"); >>> real_main = dlsym(handle, "main"); >>> real_main(argc, argv, envp); >>> dlclose(handle); >>> } >>> >>> With whatever obvious adjustments are needed to fit your usecase. >>> >>> That should give the same level of functionality, be portable to all >>> unices, and not require you to duplicate code. It belive it limits you >>> to not upgrading libc, or librt but that is a comparatively small >>> limitation. >>> >>> >>> Given that in general the interesting work is done in userspace and that >>> userspace has provided an interface for reusing that work already. >>> I don't see the justification for adding anything to exec at this point. >> >> Thanks for the suggestion. That is clever, and would make a fun project, >> but I would not trust it for production. These few lines are just >> the first of many that it would take to reset the environment to the >> well-defined post-exec initial conditions that all executables expect, >> and incrementally tearing down state will be prone to bugs. Getting a >> clean slate from a kernel exec is a much more reliable design. The use >> case is creating long-lived apps that never go down, and the simplest >> implementation will have the fewest bugs and is the best. MADV_DOEXEC is >> simple, and does not even require a new system call, and the kernel already >> knows how to exec without bugs. > > It's a net increase of 200 lines of kernel code. If 4 lines of userspace > code removes 200 lines of kernel code, I think I know which I prefer ... It will be *far* more than 4 lines. Much of the 200 lines is mostly for the elf opt in, and much of the elf code is from anthony reviving an earlier patch that use MAP_FIXED_NOREPLACE during segment setup. - Steve
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote: > Matthews sileby/mshare proposal has the same issue. If a process opts-in > and mmap's an address in the shared region, then content becomes mapped at > a VA that was known to the pre-fork or pre-exec process. Trust must still > be established. It's up to the recipient whether they try to map it at the same address or at a fresh address. The intended use case is a "semi-shared" address space between two processes (ie partway between a threaded, fully-shared address space and a forked un-shared address space), in which case there's a certain amount of trust and cooperation between the processes. Your preservation-across-exec use-case might or might not need the VMA to be mapped at the same address. I don't know whether qemu stores pointers in this VMA which are absolute within the qemu address space. If it's just the emulated process's address space, then everything will be absolute within its own address space and everything will be opaque to qemu. If qemu is storing its own pointers in it, then it has to be mapped at the same address. > > Here is another suggestion. > > > > Have a very simple program that does: > > > > for (;;) { > > handle = dlopen("/my/real/program"); > > real_main = dlsym(handle, "main"); > > real_main(argc, argv, envp); > > dlclose(handle); > > } > > > > With whatever obvious adjustments are needed to fit your usecase. > > > > That should give the same level of functionality, be portable to all > > unices, and not require you to duplicate code. It belive it limits you > > to not upgrading libc, or librt but that is a comparatively small > > limitation. > > > > > > Given that in general the interesting work is done in userspace and that > > userspace has provided an interface for reusing that work already. > > I don't see the justification for adding anything to exec at this point. > > Thanks for the suggestion. That is clever, and would make a fun project, > but I would not trust it for production. These few lines are just > the first of many that it would take to reset the environment to the > well-defined post-exec initial conditions that all executables expect, > and incrementally tearing down state will be prone to bugs. Getting a > clean slate from a kernel exec is a much more reliable design. The use > case is creating long-lived apps that never go down, and the simplest > implementation will have the fewest bugs and is the best. MADV_DOEXEC is > simple, and does not even require a new system call, and the kernel already > knows how to exec without bugs. It's a net increase of 200 lines of kernel code. If 4 lines of userspace code removes 200 lines of kernel code, I think I know which I prefer ...
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 7/30/2020 5:58 PM, ebied...@xmission.com wrote: > Steven Sistare writes: >> On 7/30/2020 1:49 PM, Matthew Wilcox wrote: >>> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote: mshare + VA reservation is another possible solution. Or MADV_DOEXEC alone, which is ready now. I hope we can get back to reviewing that. >>> >>> We are. This is the part of the review process where we explore other >>> solutions to the problem. >>> >> Also, we need to support updating legacy processes that already created >> anon segments. >> We inject code that calls MADV_DOEXEC for such segments. > > Yes, I was assuming you'd inject code that called mshare(). OK, mshare works on existing memory and builds a new vma. >>> >>> Actually, reparents an existing VMA, and reuses the existing page tables. >>> > Actually, since you're injecting code, why do you need the kernel to > be involved? You can mmap the new executable and any libraries it depends > upon, set up a new stack and jump to the main() entry point, all without > calling exec(). I appreciate it'd be a fair amount of code, but it'd all > be in userspace and you can probably steal / reuse code from ld.so (I'm > not familiar with the details of how setting up an executable is done). Duplicating all the work that the kernel and loader do to exec a process would be error prone, require ongoing maintenance, and be redundant. Better to define a small kernel extension and leave exec to the kernel. >>> >>> Either this is a one-off kind of thing, in which case it doesn't need >>> ongoing maintenance, or it's something with broad applicability, in >>> which case it can live as its own userspace project. It could even >>> start off life as part of qemu and then fork into its own project. >> >> exec will be enhanced over time in the kernel. A separate user space >> implementation >> would need to track that. >> >> Reimplementing exec in userland would be a big gross mess. Not a good >> solution when >> we have simple and concise ways of solving the problem. > > The interesting work is not done in exec. The interesting work of > loading an executable is done in ld.so, and ld.so lives in userspace. And yet a smart guy named Eric said: "There is just enough complexity in exec currently that our implementation of exec is already teetering." which suggests that a user-space re-implementation will also be complex. In complexity lies bugs. >>> The idea of tagging an ELF executable to say "I can cope with having >>> chunks of my address space provided to me by my executor" is ... odd. >> >> I don't disagree. But it is useful. We already pass a block of data >> containing >> environment variables and arguments from one process to the next. >> Preserving >> additional segments is not a big leap from there. > > But we randomize where that block lives. > > It has been found to be very useful to have randomization by default and > you are arguing against that kind of randomization. In the normal use case the addresses of the preserved anon segments were already randomized during allocation in the pre-exec process. Preservation across exec does not make the process less safe from an external attack. Now, if the attacker injects content at known addresses by preserving and exec'ing, then the post-exec process must decide whether to trust the pre-exec process. Let's prevent preservation when exec'ing setuid binaries. That is not part of this patch series, but is a good idea. So, uid does not change across exec. Do I trust myself? Yes I do. If the uid is trusted, as it must be in the qemu use case which accesses system resources, then that is sufficient. If the exec raises caps, and the uid is not trusted, then the post-exec process must do additional work to verify the pre-exec credentials. The method is left to the app. We defined the ELF opt-in so that processes that do not use preserved memory will never receive any and will never need to deal with such verification. Matthews sileby/mshare proposal has the same issue. If a process opts-in and mmap's an address in the shared region, then content becomes mapped at a VA that was known to the pre-fork or pre-exec process. Trust must still be established. > Here is another suggestion. > > Have a very simple program that does: > > for (;;) { > handle = dlopen("/my/real/program"); > real_main = dlsym(handle, "main"); > real_main(argc, argv, envp); > dlclose(handle); > } > > With whatever obvious adjustments are needed to fit your usecase. > > That should give the same level of functionality, be portable to all > unices, and not require you to duplicate code. It belive it limits you > to not upgrading libc, or librt but that is a comparatively small > limitation. > > > Given that in general the interesting work is
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
Hi, Sileby looks interesting! I had just written up the following idea which seems similar but includes a mechanism for revoking mappings. Alexander Graf recently brought up an idea that solves the following problem: When process A passes shared memory file descriptors to process B there is no way for process A to revoke access or change page protection bits after passing the fd. I'll describe the idea (not sure if it's exactly what Alexander had in mind). Memory view driver -- The memory view driver allows process A to control the page table entries of an mmap in process B. It is a character device driver that process A opens like this: int fd = open("/dev/memory-view", O_RDWR); This returns a file descriptor to a new memory view. Next process A sets the size of the memory view: ftruncate(fd, 16 * GiB); The size determines how large the memory view will be. The size is a virtual memory concept and does not consume resources (there is no physical memory backing this). Process A populates the memory view with ranges from file descriptors it wishes to share. The file descriptor can be a shared memory file descriptor: int memfd = memfd_create("guest-ram, 0); ftruncate(memfd, 32 * GiB); /* Map [8GB, 10GB) at 8GB into the memory view */ struct memview_map_fd_info = { .fd = memfd, .fd_offset = 8 * GiB, .size = 2 * GiB, .mem_offset = 8 * GiB, .flags = MEMVIEW_MAP_READ | MEMVIEW_MAP_WRITE, }; ioctl(fd, MEMVIEW_MAP_FD, _fd_info); It is also possible to populate the memory view from the page cache: int filefd = open("big-file.iso", O_RDONLY); /* Map [4GB, 12GB) at 0B into the memory view */ struct memview_map_fd_info = { .fd = filefd, .fd_offset = 4 * GiB, .size = 8 * GiB, .mem_offset = 0, .flags = MEMVIEW_MAP_READ, }; ioctl(fd, MEMVIEW_MAP_FD, _fd_info); The memory view has now been populated like this: Range (GiB) Fd Permissions 0-8 big-file.iso read 8-10 guest-ramread+write 10-16 Now process A gets the "view" file descriptor for this memory view. The view file descriptor does not allow ioctls. It can be safely passed to process B in the knowledge that process B can only mmap or close it: int viewfd = ioctl(fd, MEMVIEW_GET_VIEWFD); ...pass viewfd to process B... Process B receives viewfd and mmaps it: void *ptr = mmap(NULL, 16 * GiB, PROT_READ | PROT_WRITE, MAP_SHARED, viewfd, 0); When process B accesses a page in the mmap region the memory view driver resolves the page fault by checking if the page is mapped to an fd and what its permissions are. For example, accessing the page at 4GB from the start of the memory view is an access at 8GB into big-file.iso. That's because 8GB of big-file.iso was mapped at 0 with fd_offset 4GB. To summarize, there is one vma in process B and the memory view driver maps pages from the file descriptors added with ioctl(MEMVIEW_MAP_FD) by process A. Page protection bits are the AND of the mmap PROT_READ/PROT_WRITE/PROT_EXEC flags with the memory view driver's MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC flags for the mapping in question. Does vmf_insert_mixed_prot() or a similar Linux API allow this? Can the memory view driver map pages from fds without pinning the pages? Process A can make further ioctl(MEMVIEW_MAP_FD) calls and also ioctl(MEMVIEW_UNMAP_FD) calls to change the mappings. This requires zapping affected process B ptes. When process B accesses those pages again the fault handler will handle the page fault based on the latest memory view layout. If process B accesses a page with incorrect permissions or that has not been configured by process A ioctl calls, a SIGSEGV/SIGBUS signal is raised. When process B uses mprotect(2) and other virtual memory syscalls it is unable to increase page permissions. Instead it can only reduce them because the pte protection bits are the AND of the mmap flags and the memory view driver's MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC flags. Use cases - How to use the memory view driver for vhost-user vhost-user and other out-of-process device emulation interfaces need a way for the VMM to enforce the IOMMU mappings on the device emulation process. Today the VMM passes all guest RAM fds to the device emulation process and has no way of restricting access or revoking it later. With the memory view driver the VMM will pass one or more memory view fds instead of the actual guest RAM fds. This allows the VMM to invoke ioctl(MEMVIEW_MAP_FD/MEMVIEW_UNMAP_FD) to enforce permissions or revoke access. How to use the memory view driver for virtio-fs ~~~ The virtiofsd vhost-user process creates a memory view for the device's DAX Window and passes it to QEMU. QEMU installs it as a kvm.ko memory region so that the
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
Steven Sistare writes: > On 7/30/2020 1:49 PM, Matthew Wilcox wrote: >> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote: >>> mshare + VA reservation is another possible solution. >>> >>> Or MADV_DOEXEC alone, which is ready now. I hope we can get back to >>> reviewing that. >> >> We are. This is the part of the review process where we explore other >> solutions to the problem. >> > Also, we need to support updating legacy processes that already created > anon segments. > We inject code that calls MADV_DOEXEC for such segments. Yes, I was assuming you'd inject code that called mshare(). >>> >>> OK, mshare works on existing memory and builds a new vma. >> >> Actually, reparents an existing VMA, and reuses the existing page tables. >> Actually, since you're injecting code, why do you need the kernel to be involved? You can mmap the new executable and any libraries it depends upon, set up a new stack and jump to the main() entry point, all without calling exec(). I appreciate it'd be a fair amount of code, but it'd all be in userspace and you can probably steal / reuse code from ld.so (I'm not familiar with the details of how setting up an executable is done). >>> >>> Duplicating all the work that the kernel and loader do to exec a process >>> would >>> be error prone, require ongoing maintenance, and be redundant. Better to >>> define >>> a small kernel extension and leave exec to the kernel. >> >> Either this is a one-off kind of thing, in which case it doesn't need >> ongoing maintenance, or it's something with broad applicability, in >> which case it can live as its own userspace project. It could even >> start off life as part of qemu and then fork into its own project. > > exec will be enhanced over time in the kernel. A separate user space > implementation > would need to track that. > > Reimplementing exec in userland would be a big gross mess. Not a good > solution when > we have simple and concise ways of solving the problem. The interesting work is not done in exec. The interesting work of loading an executable is done in ld.so, and ld.so lives in userspace. >> The idea of tagging an ELF executable to say "I can cope with having >> chunks of my address space provided to me by my executor" is ... odd. > > I don't disagree. But it is useful. We already pass a block of data > containing > environment variables and arguments from one process to the next. Preserving > additional segments is not a big leap from there. But we randomize where that block lives. It has been found to be very useful to have randomization by default and you are arguing against that kind of randomization. Here is another suggestion. Have a very simple program that does: for (;;) { handle = dlopen("/my/real/program"); real_main = dlsym(handle, "main"); real_main(argc, argv, envp); dlclose(handle); } With whatever obvious adjustments are needed to fit your usecase. That should give the same level of functionality, be portable to all unices, and not require you to duplicate code. It belive it limits you to not upgrading libc, or librt but that is a comparatively small limitation. Given that in general the interesting work is done in userspace and that userspace has provided an interface for reusing that work already. I don't see the justification for adding anything to exec at this point. Eric
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 7/30/2020 1:49 PM, Matthew Wilcox wrote: > On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote: >> mshare + VA reservation is another possible solution. >> >> Or MADV_DOEXEC alone, which is ready now. I hope we can get back to >> reviewing that. > > We are. This is the part of the review process where we explore other > solutions to the problem. > Also, we need to support updating legacy processes that already created anon segments. We inject code that calls MADV_DOEXEC for such segments. >>> >>> Yes, I was assuming you'd inject code that called mshare(). >> >> OK, mshare works on existing memory and builds a new vma. > > Actually, reparents an existing VMA, and reuses the existing page tables. > >>> Actually, since you're injecting code, why do you need the kernel to >>> be involved? You can mmap the new executable and any libraries it depends >>> upon, set up a new stack and jump to the main() entry point, all without >>> calling exec(). I appreciate it'd be a fair amount of code, but it'd all >>> be in userspace and you can probably steal / reuse code from ld.so (I'm >>> not familiar with the details of how setting up an executable is done). >> >> Duplicating all the work that the kernel and loader do to exec a process >> would >> be error prone, require ongoing maintenance, and be redundant. Better to >> define >> a small kernel extension and leave exec to the kernel. > > Either this is a one-off kind of thing, in which case it doesn't need > ongoing maintenance, or it's something with broad applicability, in > which case it can live as its own userspace project. It could even > start off life as part of qemu and then fork into its own project. exec will be enhanced over time in the kernel. A separate user space implementation would need to track that. Reimplementing exec in userland would be a big gross mess. Not a good solution when we have simple and concise ways of solving the problem. > The idea of tagging an ELF executable to say "I can cope with having > chunks of my address space provided to me by my executor" is ... odd. I don't disagree. But it is useful. We already pass a block of data containing environment variables and arguments from one process to the next. Preserving additional segments is not a big leap from there. - Steve
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote: > mshare + VA reservation is another possible solution. > > Or MADV_DOEXEC alone, which is ready now. I hope we can get back to > reviewing that. We are. This is the part of the review process where we explore other solutions to the problem. > >> Also, we need to support updating legacy processes that already created > >> anon segments. > >> We inject code that calls MADV_DOEXEC for such segments. > > > > Yes, I was assuming you'd inject code that called mshare(). > > OK, mshare works on existing memory and builds a new vma. Actually, reparents an existing VMA, and reuses the existing page tables. > > Actually, since you're injecting code, why do you need the kernel to > > be involved? You can mmap the new executable and any libraries it depends > > upon, set up a new stack and jump to the main() entry point, all without > > calling exec(). I appreciate it'd be a fair amount of code, but it'd all > > be in userspace and you can probably steal / reuse code from ld.so (I'm > > not familiar with the details of how setting up an executable is done). > > Duplicating all the work that the kernel and loader do to exec a process would > be error prone, require ongoing maintenance, and be redundant. Better to > define > a small kernel extension and leave exec to the kernel. Either this is a one-off kind of thing, in which case it doesn't need ongoing maintenance, or it's something with broad applicability, in which case it can live as its own userspace project. It could even start off life as part of qemu and then fork into its own project. The idea of tagging an ELF executable to say "I can cope with having chunks of my address space provided to me by my executor" is ... odd.
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 7/30/2020 1:12 PM, Matthew Wilcox wrote: > On Thu, Jul 30, 2020 at 11:59:42AM -0400, Steven Sistare wrote: >> On 7/30/2020 11:22 AM, Matthew Wilcox wrote: >>> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote: This patchset adds support for preserving an anonymous memory range across exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for sharing memory in this manner, as opposed to re-attaching to a named shared memory segment, is to ensure it is mapped at the same virtual address in the new process as it was in the old one. An intended use for this is to preserve guest memory for guests using vfio while qemu exec's an updated version of itself. By ensuring the memory is preserved at a fixed address, vfio mappings and their associated kernel data structures can remain valid. In addition, for the qemu use case, qemu instances that back guest RAM with anonymous memory can be updated. >>> >>> I just realised that something else I'm working on might be a suitable >>> alternative to this. Apologies for not realising it sooner. >>> >>> http://www.wil.cx/~willy/linux/sileby.html >>> >>> To use this, you'd mshare() the anonymous memory range, essentially >>> detaching the VMA from the current process's mm_struct and reparenting >>> it to this new mm_struct, which has an fd referencing it. >>> >>> Then you call exec(), and the exec'ed task gets to call mmap() on that >>> new fd to attach the memory range to its own address space. >>> >>> Presto! >> >> To be suitable for the qemu use case, we need a guarantee that the same VA >> range >> is available in the new process, with nothing else mapped there. From your >> spec, >> it sounds like the new process could do a series of unrelated mmap's which >> could >> overlap the desired va range before the silby mmap(fd) is performed?? > > That could happen. eg libc might get its text segment mapped there > randomly. I believe Khalid was working on a solution for reserving > memory ranges. mshare + VA reservation is another possible solution. Or MADV_DOEXEC alone, which is ready now. I hope we can get back to reviewing that. >> Also, we need to support updating legacy processes that already created anon >> segments. >> We inject code that calls MADV_DOEXEC for such segments. > > Yes, I was assuming you'd inject code that called mshare(). OK, mshare works on existing memory and builds a new vma. > Actually, since you're injecting code, why do you need the kernel to > be involved? You can mmap the new executable and any libraries it depends > upon, set up a new stack and jump to the main() entry point, all without > calling exec(). I appreciate it'd be a fair amount of code, but it'd all > be in userspace and you can probably steal / reuse code from ld.so (I'm > not familiar with the details of how setting up an executable is done). Duplicating all the work that the kernel and loader do to exec a process would be error prone, require ongoing maintenance, and be redundant. Better to define a small kernel extension and leave exec to the kernel. - Steve
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Thu, Jul 30, 2020 at 11:59:42AM -0400, Steven Sistare wrote: > On 7/30/2020 11:22 AM, Matthew Wilcox wrote: > > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote: > >> This patchset adds support for preserving an anonymous memory range across > >> exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for > >> sharing memory in this manner, as opposed to re-attaching to a named shared > >> memory segment, is to ensure it is mapped at the same virtual address in > >> the new process as it was in the old one. An intended use for this is to > >> preserve guest memory for guests using vfio while qemu exec's an updated > >> version of itself. By ensuring the memory is preserved at a fixed address, > >> vfio mappings and their associated kernel data structures can remain valid. > >> In addition, for the qemu use case, qemu instances that back guest RAM with > >> anonymous memory can be updated. > > > > I just realised that something else I'm working on might be a suitable > > alternative to this. Apologies for not realising it sooner. > > > > http://www.wil.cx/~willy/linux/sileby.html > > > > To use this, you'd mshare() the anonymous memory range, essentially > > detaching the VMA from the current process's mm_struct and reparenting > > it to this new mm_struct, which has an fd referencing it. > > > > Then you call exec(), and the exec'ed task gets to call mmap() on that > > new fd to attach the memory range to its own address space. > > > > Presto! > > To be suitable for the qemu use case, we need a guarantee that the same VA > range > is available in the new process, with nothing else mapped there. From your > spec, > it sounds like the new process could do a series of unrelated mmap's which > could > overlap the desired va range before the silby mmap(fd) is performed?? That could happen. eg libc might get its text segment mapped there randomly. I believe Khalid was working on a solution for reserving memory ranges. > Also, we need to support updating legacy processes that already created anon > segments. > We inject code that calls MADV_DOEXEC for such segments. Yes, I was assuming you'd inject code that called mshare(). Actually, since you're injecting code, why do you need the kernel to be involved? You can mmap the new executable and any libraries it depends upon, set up a new stack and jump to the main() entry point, all without calling exec(). I appreciate it'd be a fair amount of code, but it'd all be in userspace and you can probably steal / reuse code from ld.so (I'm not familiar with the details of how setting up an executable is done).
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 7/30/2020 11:22 AM, Matthew Wilcox wrote: > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote: >> This patchset adds support for preserving an anonymous memory range across >> exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for >> sharing memory in this manner, as opposed to re-attaching to a named shared >> memory segment, is to ensure it is mapped at the same virtual address in >> the new process as it was in the old one. An intended use for this is to >> preserve guest memory for guests using vfio while qemu exec's an updated >> version of itself. By ensuring the memory is preserved at a fixed address, >> vfio mappings and their associated kernel data structures can remain valid. >> In addition, for the qemu use case, qemu instances that back guest RAM with >> anonymous memory can be updated. > > I just realised that something else I'm working on might be a suitable > alternative to this. Apologies for not realising it sooner. > > http://www.wil.cx/~willy/linux/sileby.html > > To use this, you'd mshare() the anonymous memory range, essentially > detaching the VMA from the current process's mm_struct and reparenting > it to this new mm_struct, which has an fd referencing it. > > Then you call exec(), and the exec'ed task gets to call mmap() on that > new fd to attach the memory range to its own address space. > > Presto! To be suitable for the qemu use case, we need a guarantee that the same VA range is available in the new process, with nothing else mapped there. From your spec, it sounds like the new process could do a series of unrelated mmap's which could overlap the desired va range before the silby mmap(fd) is performed?? Also, we need to support updating legacy processes that already created anon segments. We inject code that calls MADV_DOEXEC for such segments. - Steve
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Thu, Jul 30, 2020 at 04:34:50PM +0100, Matthew Wilcox wrote: > On Thu, Jul 30, 2020 at 05:27:05PM +0200, Christian Brauner wrote: > > On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote: > > > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote: > > > > This patchset adds support for preserving an anonymous memory range > > > > across > > > > exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit > > > > for > > > > sharing memory in this manner, as opposed to re-attaching to a named > > > > shared > > > > memory segment, is to ensure it is mapped at the same virtual address in > > > > the new process as it was in the old one. An intended use for this is > > > > to > > > > preserve guest memory for guests using vfio while qemu exec's an updated > > > > version of itself. By ensuring the memory is preserved at a fixed > > > > address, > > > > vfio mappings and their associated kernel data structures can remain > > > > valid. > > > > In addition, for the qemu use case, qemu instances that back guest RAM > > > > with > > > > anonymous memory can be updated. > > > > > > I just realised that something else I'm working on might be a suitable > > > alternative to this. Apologies for not realising it sooner. > > > > > > http://www.wil.cx/~willy/linux/sileby.html > > > > Just skimming: make it O_CLOEXEC by default. ;) > > I appreciate the suggestion, and it makes sense for many 'return an fd' > interfaces, but the point of mshare() is to, well, share. So sharing > the fd with a child is a common usecase, unlike say sharing a timerfd. Fair point, from reading I thought the main reason was share after fork() but having an fd over exec() may be a good use-case too. (Fwiw, this very much looks like what the memfd_create() api should have looked like, i.e. mshare() could've possibly encompassed both.) > The only other reason to use mshare() is to pass the fd over a unix > socket to a non-child, and I submit that is far less common than wanting > to share with a child. Well, we have use-cases for that too. E.g. where we need to attach to an existing pid namespace which requires a first fork() of an intermediate process so that the caller doesn't change the pid_for_children namespace. Then we setns() to the target pid namespace in the intermediate process which changes the pid_ns_children namespace such that the next process will be a proper member of the target pid namespace. Finally, we fork() the target process that is now a full member of the target pid namespace. We also set CLONE_PARENT so grandfather process == father process for the second process and then have the intermediate process exit. Some fds we only ever open or create after the intermediate process exited and some fds we can only open or create after the intermediate process has been created and then send the fds via scm (since we can't share the fdtable) to the final process. Christian
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Thu, Jul 30, 2020 at 05:27:05PM +0200, Christian Brauner wrote: > On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote: > > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote: > > > This patchset adds support for preserving an anonymous memory range across > > > exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for > > > sharing memory in this manner, as opposed to re-attaching to a named > > > shared > > > memory segment, is to ensure it is mapped at the same virtual address in > > > the new process as it was in the old one. An intended use for this is to > > > preserve guest memory for guests using vfio while qemu exec's an updated > > > version of itself. By ensuring the memory is preserved at a fixed > > > address, > > > vfio mappings and their associated kernel data structures can remain > > > valid. > > > In addition, for the qemu use case, qemu instances that back guest RAM > > > with > > > anonymous memory can be updated. > > > > I just realised that something else I'm working on might be a suitable > > alternative to this. Apologies for not realising it sooner. > > > > http://www.wil.cx/~willy/linux/sileby.html > > Just skimming: make it O_CLOEXEC by default. ;) I appreciate the suggestion, and it makes sense for many 'return an fd' interfaces, but the point of mshare() is to, well, share. So sharing the fd with a child is a common usecase, unlike say sharing a timerfd. The only other reason to use mshare() is to pass the fd over a unix socket to a non-child, and I submit that is far less common than wanting to share with a child.
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote: > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote: > > This patchset adds support for preserving an anonymous memory range across > > exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for > > sharing memory in this manner, as opposed to re-attaching to a named shared > > memory segment, is to ensure it is mapped at the same virtual address in > > the new process as it was in the old one. An intended use for this is to > > preserve guest memory for guests using vfio while qemu exec's an updated > > version of itself. By ensuring the memory is preserved at a fixed address, > > vfio mappings and their associated kernel data structures can remain valid. > > In addition, for the qemu use case, qemu instances that back guest RAM with > > anonymous memory can be updated. > > I just realised that something else I'm working on might be a suitable > alternative to this. Apologies for not realising it sooner. > > http://www.wil.cx/~willy/linux/sileby.html Just skimming: make it O_CLOEXEC by default. ;) > > To use this, you'd mshare() the anonymous memory range, essentially > detaching the VMA from the current process's mm_struct and reparenting > it to this new mm_struct, which has an fd referencing it. > > Then you call exec(), and the exec'ed task gets to call mmap() on that > new fd to attach the memory range to its own address space. > > Presto!
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote: > This patchset adds support for preserving an anonymous memory range across > exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for > sharing memory in this manner, as opposed to re-attaching to a named shared > memory segment, is to ensure it is mapped at the same virtual address in > the new process as it was in the old one. An intended use for this is to > preserve guest memory for guests using vfio while qemu exec's an updated > version of itself. By ensuring the memory is preserved at a fixed address, > vfio mappings and their associated kernel data structures can remain valid. > In addition, for the qemu use case, qemu instances that back guest RAM with > anonymous memory can be updated. I just realised that something else I'm working on might be a suitable alternative to this. Apologies for not realising it sooner. http://www.wil.cx/~willy/linux/sileby.html To use this, you'd mshare() the anonymous memory range, essentially detaching the VMA from the current process's mm_struct and reparenting it to this new mm_struct, which has an fd referencing it. Then you call exec(), and the exec'ed task gets to call mmap() on that new fd to attach the memory range to its own address space. Presto!
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 7/28/20 4:34 AM, Kirill Tkhai wrote: > On 27.07.2020 20:11, Anthony Yznaga wrote: >> This patchset adds support for preserving an anonymous memory range across >> exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for >> sharing memory in this manner, as opposed to re-attaching to a named shared >> memory segment, is to ensure it is mapped at the same virtual address in >> the new process as it was in the old one. An intended use for this is to >> preserve guest memory for guests using vfio while qemu exec's an updated >> version of itself. By ensuring the memory is preserved at a fixed address, > So, the goal is an update of QEMU binary without a stopping of virtual > machine? Essentially, yes. The VM is paused very briefly. Anthony > >> vfio mappings and their associated kernel data structures can remain valid. >> In addition, for the qemu use case, qemu instances that back guest RAM with >> anonymous memory can be updated. >> >> Patches 1 and 2 ensure that loading of ELF load segments does not silently >> clobber existing VMAS, and remove assumptions that the stack is the only >> VMA in the mm when the stack is set up. Patch 1 re-introduces the use of >> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues >> and could be considered on its own. >> >> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use >> using an ELF note. >> >> Anthony Yznaga (5): >> elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings >> mm: do not assume only the stack vma exists in setup_arg_pages() >> mm: introduce VM_EXEC_KEEP >> exec, elf: require opt-in for accepting preserved mem >> mm: introduce MADV_DOEXEC >> >> arch/x86/Kconfig | 1 + >> fs/binfmt_elf.c| 196 >> + >> fs/exec.c | 33 +- >> include/linux/binfmts.h| 7 +- >> include/linux/mm.h | 5 + >> include/uapi/asm-generic/mman-common.h | 3 + >> kernel/fork.c | 2 +- >> mm/madvise.c | 25 + >> mm/mmap.c | 47 >> 9 files changed, 266 insertions(+), 53 deletions(-) >>
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 7/28/2020 10:23 AM, Andy Lutomirski wrote: >> On Jul 27, 2020, at 10:02 AM, Anthony Yznaga >> wrote: >> >> This patchset adds support for preserving an anonymous memory range across >> exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for >> sharing memory in this manner, as opposed to re-attaching to a named shared >> memory segment, is to ensure it is mapped at the same virtual address in >> the new process as it was in the old one. An intended use for this is to >> preserve guest memory for guests using vfio while qemu exec's an updated >> version of itself. By ensuring the memory is preserved at a fixed address, >> vfio mappings and their associated kernel data structures can remain valid. >> In addition, for the qemu use case, qemu instances that back guest RAM with >> anonymous memory can be updated. > > This will be an amazing attack surface. Perhaps use of this flag should > require no_new_privs? Arguably it should also require a special flag to > execve() to honor it. Otherwise library helpers that do vfork()+exec() or > posix_spawn() could be quite surprised. Preservation is disabled across fork, so fork/exec combo's are not affected. We forgot to document that. - Steve
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
> On Jul 27, 2020, at 10:02 AM, Anthony Yznaga > wrote: > > This patchset adds support for preserving an anonymous memory range across > exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for > sharing memory in this manner, as opposed to re-attaching to a named shared > memory segment, is to ensure it is mapped at the same virtual address in > the new process as it was in the old one. An intended use for this is to > preserve guest memory for guests using vfio while qemu exec's an updated > version of itself. By ensuring the memory is preserved at a fixed address, > vfio mappings and their associated kernel data structures can remain valid. > In addition, for the qemu use case, qemu instances that back guest RAM with > anonymous memory can be updated. This will be an amazing attack surface. Perhaps use of this flag should require no_new_privs? Arguably it should also require a special flag to execve() to honor it. Otherwise library helpers that do vfork()+exec() or posix_spawn() could be quite surprised.
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On Mon, Jul 27, 2020 at 02:00:17PM -0400, Steven Sistare wrote: > On 7/27/2020 1:07 PM, ebied...@xmission.com wrote: > > Anthony Yznaga writes: > > > >> This patchset adds support for preserving an anonymous memory range across > >> exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for > >> sharing memory in this manner, as opposed to re-attaching to a named shared > >> memory segment, is to ensure it is mapped at the same virtual address in > >> the new process as it was in the old one. An intended use for this is to > >> preserve guest memory for guests using vfio while qemu exec's an updated > >> version of itself. By ensuring the memory is preserved at a fixed address, > >> vfio mappings and their associated kernel data structures can remain valid. > >> In addition, for the qemu use case, qemu instances that back guest RAM with > >> anonymous memory can be updated. > > > > What is wrong with using a file descriptor to a possibly deleted file > > and re-mmaping it? > > > > There is already MAP_FIXED that allows you to ensure you have the same > > address. > > MAP_FIXED blows away any existing mapping in that range, which is not the > desired behavior. We want to preserve the previously created mapping at There's also MAP_FIXED_NOREPLACE since v4.17 in case that helps. Note that this should really go to linux-api too. I won't argue to resend it since that would mean spamming everyone's inbox with the same thread again but in case you send a revised version, please ensure to Cc linux-api. The glibc folks are listening on there too. Thanks! Christian
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 27.07.2020 20:11, Anthony Yznaga wrote: > This patchset adds support for preserving an anonymous memory range across > exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for > sharing memory in this manner, as opposed to re-attaching to a named shared > memory segment, is to ensure it is mapped at the same virtual address in > the new process as it was in the old one. An intended use for this is to > preserve guest memory for guests using vfio while qemu exec's an updated > version of itself. By ensuring the memory is preserved at a fixed address, So, the goal is an update of QEMU binary without a stopping of virtual machine? > vfio mappings and their associated kernel data structures can remain valid. > In addition, for the qemu use case, qemu instances that back guest RAM with > anonymous memory can be updated. > > Patches 1 and 2 ensure that loading of ELF load segments does not silently > clobber existing VMAS, and remove assumptions that the stack is the only > VMA in the mm when the stack is set up. Patch 1 re-introduces the use of > MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues > and could be considered on its own. > > Patches 3, 4, and 5 introduce the feature and an opt-in method for its use > using an ELF note. > > Anthony Yznaga (5): > elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings > mm: do not assume only the stack vma exists in setup_arg_pages() > mm: introduce VM_EXEC_KEEP > exec, elf: require opt-in for accepting preserved mem > mm: introduce MADV_DOEXEC > > arch/x86/Kconfig | 1 + > fs/binfmt_elf.c| 196 > + > fs/exec.c | 33 +- > include/linux/binfmts.h| 7 +- > include/linux/mm.h | 5 + > include/uapi/asm-generic/mman-common.h | 3 + > kernel/fork.c | 2 +- > mm/madvise.c | 25 + > mm/mmap.c | 47 > 9 files changed, 266 insertions(+), 53 deletions(-) >
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
On 7/27/2020 1:07 PM, ebied...@xmission.com wrote: > Anthony Yznaga writes: > >> This patchset adds support for preserving an anonymous memory range across >> exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for >> sharing memory in this manner, as opposed to re-attaching to a named shared >> memory segment, is to ensure it is mapped at the same virtual address in >> the new process as it was in the old one. An intended use for this is to >> preserve guest memory for guests using vfio while qemu exec's an updated >> version of itself. By ensuring the memory is preserved at a fixed address, >> vfio mappings and their associated kernel data structures can remain valid. >> In addition, for the qemu use case, qemu instances that back guest RAM with >> anonymous memory can be updated. > > What is wrong with using a file descriptor to a possibly deleted file > and re-mmaping it? > > There is already MAP_FIXED that allows you to ensure you have the same > address. MAP_FIXED blows away any existing mapping in that range, which is not the desired behavior. We want to preserve the previously created mapping at the same VA and co-exist with other mappings created by the new process. There is no way to guarantee availability of a VA post-exec. > I think all it would take would be a small protocol from one version > to the next to say map file descriptor #N and address #A. Something > easily passed on the command line. > > There is just enough complexity in exec currently that our > implementation of exec is already teetering. So if we could use > existing mechanisms it would be good. > > And I don't see why this version of sharing a mmap area would be > particularly general. I do imagine that being able to force a > mmap area into a setuid executable would be a fun attack vector. Any mmap(MAP_ANON) segment can be preserved. That is very general, and is the case we need to support to upgrade legacy applications that are already running (such as qemu) -- we cannot recode them before they are updated. > Perhaps I missed this in my skim of these patches but I did not see > anything that guarded this feature against an exec that changes an > applications privileges. The ELF opt-in flag must be set, so only selected executables will accept incoming mappings. The exec() code verifies that the preserved mappings do not overlap or become adjacent to text or stack, so it is not possible for example for an attacker to cause stack underflow or overflow to access injected content. A gadget invoked by some other attack could access the preserved content. - Steve >> Patches 1 and 2 ensure that loading of ELF load segments does not silently >> clobber existing VMAS, and remove assumptions that the stack is the only >> VMA in the mm when the stack is set up. Patch 1 re-introduces the use of >> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues >> and could be considered on its own. >> >> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use >> using an ELF note. >> >> Anthony Yznaga (5): >> elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings >> mm: do not assume only the stack vma exists in setup_arg_pages() >> mm: introduce VM_EXEC_KEEP >> exec, elf: require opt-in for accepting preserved mem >> mm: introduce MADV_DOEXEC >> >> arch/x86/Kconfig | 1 + >> fs/binfmt_elf.c| 196 >> + >> fs/exec.c | 33 +- >> include/linux/binfmts.h| 7 +- >> include/linux/mm.h | 5 + >> include/uapi/asm-generic/mman-common.h | 3 + >> kernel/fork.c | 2 +- >> mm/madvise.c | 25 + >> mm/mmap.c | 47 >> 9 files changed, 266 insertions(+), 53 deletions(-)
Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
Anthony Yznaga writes: > This patchset adds support for preserving an anonymous memory range across > exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for > sharing memory in this manner, as opposed to re-attaching to a named shared > memory segment, is to ensure it is mapped at the same virtual address in > the new process as it was in the old one. An intended use for this is to > preserve guest memory for guests using vfio while qemu exec's an updated > version of itself. By ensuring the memory is preserved at a fixed address, > vfio mappings and their associated kernel data structures can remain valid. > In addition, for the qemu use case, qemu instances that back guest RAM with > anonymous memory can be updated. What is wrong with using a file descriptor to a possibly deleted file and re-mmaping it? There is already MAP_FIXED that allows you to ensure you have the same address. I think all it would take would be a small protocol from one version to the next to say map file descriptor #N and address #A. Something easily passed on the command line. There is just enough complexity in exec currently that our implementation of exec is already teetering. So if we could use existing mechanisms it would be good. And I don't see why this version of sharing a mmap area would be particularly general. I do imagine that being able to force a mmap area into a setuid executable would be a fun attack vector. Perhaps I missed this in my skim of these patches but I did not see anything that guarded this feature against an exec that changes an applications privileges. Eric > Patches 1 and 2 ensure that loading of ELF load segments does not silently > clobber existing VMAS, and remove assumptions that the stack is the only > VMA in the mm when the stack is set up. Patch 1 re-introduces the use of > MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues > and could be considered on its own. > > Patches 3, 4, and 5 introduce the feature and an opt-in method for its use > using an ELF note. > > Anthony Yznaga (5): > elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings > mm: do not assume only the stack vma exists in setup_arg_pages() > mm: introduce VM_EXEC_KEEP > exec, elf: require opt-in for accepting preserved mem > mm: introduce MADV_DOEXEC > > arch/x86/Kconfig | 1 + > fs/binfmt_elf.c| 196 > + > fs/exec.c | 33 +- > include/linux/binfmts.h| 7 +- > include/linux/mm.h | 5 + > include/uapi/asm-generic/mman-common.h | 3 + > kernel/fork.c | 2 +- > mm/madvise.c | 25 + > mm/mmap.c | 47 > 9 files changed, 266 insertions(+), 53 deletions(-)
[RFC PATCH 0/5] madvise MADV_DOEXEC
This patchset adds support for preserving an anonymous memory range across exec(3) using a new madvise MADV_DOEXEC argument. The primary benefit for sharing memory in this manner, as opposed to re-attaching to a named shared memory segment, is to ensure it is mapped at the same virtual address in the new process as it was in the old one. An intended use for this is to preserve guest memory for guests using vfio while qemu exec's an updated version of itself. By ensuring the memory is preserved at a fixed address, vfio mappings and their associated kernel data structures can remain valid. In addition, for the qemu use case, qemu instances that back guest RAM with anonymous memory can be updated. Patches 1 and 2 ensure that loading of ELF load segments does not silently clobber existing VMAS, and remove assumptions that the stack is the only VMA in the mm when the stack is set up. Patch 1 re-introduces the use of MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues and could be considered on its own. Patches 3, 4, and 5 introduce the feature and an opt-in method for its use using an ELF note. Anthony Yznaga (5): elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings mm: do not assume only the stack vma exists in setup_arg_pages() mm: introduce VM_EXEC_KEEP exec, elf: require opt-in for accepting preserved mem mm: introduce MADV_DOEXEC arch/x86/Kconfig | 1 + fs/binfmt_elf.c| 196 + fs/exec.c | 33 +- include/linux/binfmts.h| 7 +- include/linux/mm.h | 5 + include/uapi/asm-generic/mman-common.h | 3 + kernel/fork.c | 2 +- mm/madvise.c | 25 + mm/mmap.c | 47 9 files changed, 266 insertions(+), 53 deletions(-) -- 1.8.3.1