Re: Approaches for same-on-same linux-user execve?
> ARM's armie takes a different approach with the trap and emulate of > SIGILL instructions. This works well for the occasional "new" > instruction but will be less efficient overall if your instruction > stream is entirely novel. To clarify: earlier versions of armie did use the SIGILL trap-and-emulate method, which was limited. Recent versions, including the latest release are based on the DynamoRIO platform which enables full emulation and instrumentation (https://dynamorio.org). By default, DynamoRIO and by extension armie, follow all child processes, see https://dynamorio.org/page_deploy.html#op_children. As new Arm architecture features are added to QEMU, e.g. SVE, SVE2, SME etc. there is an expectation in the Arm community that QEMU can run large Arm user-space applications on Arm hardware, making lack of same-on-same execve a not insignificant blocker. AIUI, given the open-source licensing of QEMU and DynamoRIO, there would be no legal reason for QEMU not to borrow from DynamoRIO. On Fri, 8 Oct 2021 at 11:49, Alex Bennée wrote: > > > Arnd Bergmann writes: > > > On Thu, Oct 7, 2021 at 4:32 PM Alex Bennée wrote: > >> > >> I came across a use-case this week for ARM although this may be also > >> applicable to architectures where QEMU's emulation is ahead of the > >> hardware currently widely available - for example if you want to > >> exercise SVE code on AArch64. When the linux-user architecture is not > >> the same as the host architecture then binfmt_misc works perfectly fine. > >> > >> However in the case you are running same-on-same you can't use > >> binfmt_misc to redirect execution to using QEMU because any attempt to > >> trap native binaries will cause your userspace to hang as binfmt_misc > >> will be invoked to run the QEMU binary needed to run your application > >> and a deadlock ensues. > > > > Can you clarify how the code would run in this case? Does qemu-user > > still emulate every single instruction, both the compatible and the > > incompatible > > ones, or is the idea here to run as much as possible natively and only > > emulate the instructions that are not available natively, using either > > SIGILL or searching through the object code for those instructions? > > qemu-user only every does a complete translation. The hope is of course > our translator is "fairly efficient" so for example integer SVE > operations should get unrolled into a series of AdvSIMD instructions on > the backend. > > ARM's armie takes a different approach with the trap and emulate of > SIGILL instructions. This works well for the occasional "new" > instruction but will be less efficient overall if your instruction > stream is entirely novel. > > >> Trap execve in QEMU linux-user > >> -- > >> > >> We could add a flag to QEMU so at the point of execve it manually > >> invokes the new process with QEMU, passing on the flag to persist this > >> behaviour. > > > > This sounds like the obvious approach if you already do a full > > instruction emulation. You'd still need to run the parent process > > by calling qemu-user manually, but I suppose you need to do > > something like this in any case. > > > >> Add path mask to binfmt_misc > >> > >> > >> The other option would be to extend binfmt_misc to have a path mask so > >> it only applies it's alternative execution scheme to binaries in a > >> particular section of the file-system (or maybe some sort of pattern?). > > > > The main downside I see here is that it requires kernel modification, so > > it would not work for old kernels. > > > >> Are there any other approaches you could take? Which do you think has > >> the most merit? > > > > If we modify binfmt_misc in the kernel, it might be helpful to do it > > by extending it with namespace support, so it could be constrained > > to a single container without having to do the emulation outside. > > Unfortunately that does not solve the problem of preventing the > > qemu-user binary from triggering the binfmt_misc lookup. > > I wonder how that would interact with the persistent ("P") mode of > binfmt_misc. The backend is identified at the start and gets re-used > rather than looked up each time. > > > > >Arnd > > > -- > Alex Bennée
Re: Approaches for same-on-same linux-user execve?
On Thu, Oct 7, 2021 at 4:32 PM Alex Bennée wrote: > > Are there any other approaches you could take? Which do you think has > the most merit? Reading through the ELF loader code in the kernel, I had another idea: If qemu-user could be turned into a replacement for /lilb/ld.so and act as an ELF interpreter rather than a binfmt-misc helper, this might address a lot of the issues automatically. It would need to be a statically linked binary so it doesn't itself require an interpreter. It would have to do the job of ld.so in addition to the emulation, but it could do that by finding the real ld.so somewhere else and running that in emulation mode. It would also not work at all for statically linked executables. Not sure if that makes the tradeoffs better than your other suggestions, but it seemed worth bringing up. Arnd
Re: Approaches for same-on-same linux-user execve?
On Thu, Oct 07, 2021 at 03:32:19PM +0100, Alex Bennée wrote: > Hi, > > I came across a use-case this week for ARM although this may be also > applicable to architectures where QEMU's emulation is ahead of the > hardware currently widely available - for example if you want to > exercise SVE code on AArch64. When the linux-user architecture is not > the same as the host architecture then binfmt_misc works perfectly fine. > > However in the case you are running same-on-same you can't use > binfmt_misc to redirect execution to using QEMU because any attempt to > trap native binaries will cause your userspace to hang as binfmt_misc > will be invoked to run the QEMU binary needed to run your application > and a deadlock ensues. > > There are some hacks you can apply at a local level like tweaking the > elf header of the binaries you want to run under emulation and adjusting > the binfmt_mask appropriately. This works but is messy and a faff to > set-up. > > An ideal setup would be would be for the kernel to catch a SIGILL from a > failing user space program and then to re-launch the process using QEMU > with the old processes maps and execution state so it could continue. > However I suspect there are enough moving parts to make this very > fragile (e.g. what happens to the results of library feature probing > code). So two approaches I can think of are: > > Trap execve in QEMU linux-user > -- > > We could add a flag to QEMU so at the point of execve it manually > invokes the new process with QEMU, passing on the flag to persist this > behaviour. > > > Add path mask to binfmt_misc > > > The other option would be to extend binfmt_misc to have a path mask so > it only applies it's alternative execution scheme to binaries in a > particular section of the file-system (or maybe some sort of pattern?). > > Are there any other approaches you could take? Which do you think has > the most merit? Could a new Linux personality flag be useful in combination with a new flag in binfmt_misc. eg a flag "E" for binfmt_misc which indicates the rule must only be applied if the process is execve()d with PER_USE_BINFMT personality set. That would let you add a native match rule to binfmt_misc without it affecting your system initially. To then run native binaries via qemu-user you just need to set the personality() flag and the only that sub-process tree gets redirected. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: Approaches for same-on-same linux-user execve?
Arnd Bergmann writes: > On Thu, Oct 7, 2021 at 4:32 PM Alex Bennée wrote: >> >> I came across a use-case this week for ARM although this may be also >> applicable to architectures where QEMU's emulation is ahead of the >> hardware currently widely available - for example if you want to >> exercise SVE code on AArch64. When the linux-user architecture is not >> the same as the host architecture then binfmt_misc works perfectly fine. >> >> However in the case you are running same-on-same you can't use >> binfmt_misc to redirect execution to using QEMU because any attempt to >> trap native binaries will cause your userspace to hang as binfmt_misc >> will be invoked to run the QEMU binary needed to run your application >> and a deadlock ensues. > > Can you clarify how the code would run in this case? Does qemu-user > still emulate every single instruction, both the compatible and the > incompatible > ones, or is the idea here to run as much as possible natively and only > emulate the instructions that are not available natively, using either > SIGILL or searching through the object code for those instructions? qemu-user only every does a complete translation. The hope is of course our translator is "fairly efficient" so for example integer SVE operations should get unrolled into a series of AdvSIMD instructions on the backend. ARM's armie takes a different approach with the trap and emulate of SIGILL instructions. This works well for the occasional "new" instruction but will be less efficient overall if your instruction stream is entirely novel. >> Trap execve in QEMU linux-user >> -- >> >> We could add a flag to QEMU so at the point of execve it manually >> invokes the new process with QEMU, passing on the flag to persist this >> behaviour. > > This sounds like the obvious approach if you already do a full > instruction emulation. You'd still need to run the parent process > by calling qemu-user manually, but I suppose you need to do > something like this in any case. > >> Add path mask to binfmt_misc >> >> >> The other option would be to extend binfmt_misc to have a path mask so >> it only applies it's alternative execution scheme to binaries in a >> particular section of the file-system (or maybe some sort of pattern?). > > The main downside I see here is that it requires kernel modification, so > it would not work for old kernels. > >> Are there any other approaches you could take? Which do you think has >> the most merit? > > If we modify binfmt_misc in the kernel, it might be helpful to do it > by extending it with namespace support, so it could be constrained > to a single container without having to do the emulation outside. > Unfortunately that does not solve the problem of preventing the > qemu-user binary from triggering the binfmt_misc lookup. I wonder how that would interact with the persistent ("P") mode of binfmt_misc. The backend is identified at the start and gets re-used rather than looked up each time. > >Arnd -- Alex Bennée
Re: Approaches for same-on-same linux-user execve?
On Thu, Oct 7, 2021 at 8:56 AM Alex Bennée wrote: > Hi, > > I came across a use-case this week for ARM although this may be also > applicable to architectures where QEMU's emulation is ahead of the > hardware currently widely available - for example if you want to > exercise SVE code on AArch64. When the linux-user architecture is not > the same as the host architecture then binfmt_misc works perfectly fine. > > However in the case you are running same-on-same you can't use > binfmt_misc to redirect execution to using QEMU because any attempt to > trap native binaries will cause your userspace to hang as binfmt_misc > will be invoked to run the QEMU binary needed to run your application > and a deadlock ensues. > > There are some hacks you can apply at a local level like tweaking the > elf header of the binaries you want to run under emulation and adjusting > the binfmt_mask appropriately. This works but is messy and a faff to > set-up. > > An ideal setup would be would be for the kernel to catch a SIGILL from a > failing user space program and then to re-launch the process using QEMU > with the old processes maps and execution state so it could continue. > However I suspect there are enough moving parts to make this very > fragile (e.g. what happens to the results of library feature probing > code). So two approaches I can think of are: > 32-bit arm had an 'eabi' section in ELF binaries. There it would have been possible to look at that and make a decision before the binary starts executing to see whether it should just run, or fork the linux-user binary. It would take kernel changes, though. > Trap execve in QEMU linux-user > -- > > We could add a flag to QEMU so at the point of execve it manually > invokes the new process with QEMU, passing on the flag to persist this > behaviour. > The bsd-user code differs a little from linux-user in that it looks at the binary being exec'd to determine what to do. It works OK, but isn't really for this situation (we use it to optimize our package builds with additional path processing for our mixed binary situation where we have native binaries execing emulated binaries that then exec native binaries again. It is a bit of a hack, though, and I'm not completely happy with it. Add path mask to binfmt_misc > > > The other option would be to extend binfmt_misc to have a path mask so > it only applies it's alternative execution scheme to binaries in a > particular section of the file-system (or maybe some sort of pattern?). > > Are there any other approaches you could take? Which do you think has > the most merit? > In by-gone times, brandelf has bene used for situations where you wanted to run an ELF binary with one emulation that looks like another. But that's also kernel hacks and also touching the local binary. There's also the option of doing a VM86-like thing that allowed people to run 16-bit x86 binaries on 32-bit processors. There the system calls would SEGV and you'd decode them inline, execute the emulation and move the IP to execute the next instruction after the INT XX system call. You could create a loader that knows how to load load the binaries and catch SIGILL and then emulate the new instructions on the old processor, but that's somewhat different than how qemu user-mode works today. But knowing you'd need to do this is hard, potentially. But one could expand the kernel to load in SIGILL handlers on-demand for programs that do this, but that wouldn't work with old kernels and just feels weird... Warner
Re: Approaches for same-on-same linux-user execve?
Le 07/10/2021 à 16:32, Alex Bennée a écrit : > Hi, > > I came across a use-case this week for ARM although this may be also > applicable to architectures where QEMU's emulation is ahead of the > hardware currently widely available - for example if you want to > exercise SVE code on AArch64. When the linux-user architecture is not > the same as the host architecture then binfmt_misc works perfectly fine. > > However in the case you are running same-on-same you can't use > binfmt_misc to redirect execution to using QEMU because any attempt to > trap native binaries will cause your userspace to hang as binfmt_misc > will be invoked to run the QEMU binary needed to run your application > and a deadlock ensues. > > There are some hacks you can apply at a local level like tweaking the > elf header of the binaries you want to run under emulation and adjusting > the binfmt_mask appropriately. This works but is messy and a faff to > set-up. > > An ideal setup would be would be for the kernel to catch a SIGILL from a > failing user space program and then to re-launch the process using QEMU > with the old processes maps and execution state so it could continue. > However I suspect there are enough moving parts to make this very > fragile (e.g. what happens to the results of library feature probing > code). So two approaches I can think of are: > > Trap execve in QEMU linux-user > -- > > We could add a flag to QEMU so at the point of execve it manually > invokes the new process with QEMU, passing on the flag to persist this > behaviour. Another approach can be to use ptrace(PTRACE_SYSEMU) to catch syscalls. We need a wrapper that loads the first target binary and fork, it attach a ptrace() process and intercept the syscalls to emulate them as we do in usermode linux. I was thinking to this solution for instance to execute big-endian program (like ppc64) on little-endian system (ppc64le). But I'm not sure it fits in what you need... > > Add path mask to binfmt_misc > > > The other option would be to extend binfmt_misc to have a path mask so > it only applies it's alternative execution scheme to binaries in a > particular section of the file-system (or maybe some sort of pattern?). > > Are there any other approaches you could take? Which do you think has > the most merit? I don't know if it can apply to what you want, but I wrote years ago a binfmt namespace that applies binfmt configuration only on a container but I didn't finish the work (it seems there can be some security issues in what I did): https://lore.kernel.org/lkml/20191216091220.465626-2-laur...@vivier.eu/T/ Thanks, Laurent
Re: Approaches for same-on-same linux-user execve?
On Thu, Oct 7, 2021 at 4:32 PM Alex Bennée wrote: > > I came across a use-case this week for ARM although this may be also > applicable to architectures where QEMU's emulation is ahead of the > hardware currently widely available - for example if you want to > exercise SVE code on AArch64. When the linux-user architecture is not > the same as the host architecture then binfmt_misc works perfectly fine. > > However in the case you are running same-on-same you can't use > binfmt_misc to redirect execution to using QEMU because any attempt to > trap native binaries will cause your userspace to hang as binfmt_misc > will be invoked to run the QEMU binary needed to run your application > and a deadlock ensues. Can you clarify how the code would run in this case? Does qemu-user still emulate every single instruction, both the compatible and the incompatible ones, or is the idea here to run as much as possible natively and only emulate the instructions that are not available natively, using either SIGILL or searching through the object code for those instructions? > Trap execve in QEMU linux-user > -- > > We could add a flag to QEMU so at the point of execve it manually > invokes the new process with QEMU, passing on the flag to persist this > behaviour. This sounds like the obvious approach if you already do a full instruction emulation. You'd still need to run the parent process by calling qemu-user manually, but I suppose you need to do something like this in any case. > Add path mask to binfmt_misc > > > The other option would be to extend binfmt_misc to have a path mask so > it only applies it's alternative execution scheme to binaries in a > particular section of the file-system (or maybe some sort of pattern?). The main downside I see here is that it requires kernel modification, so it would not work for old kernels. > Are there any other approaches you could take? Which do you think has > the most merit? If we modify binfmt_misc in the kernel, it might be helpful to do it by extending it with namespace support, so it could be constrained to a single container without having to do the emulation outside. Unfortunately that does not solve the problem of preventing the qemu-user binary from triggering the binfmt_misc lookup. Arnd