Re: Kernel prctl feature for syscall interception and emulation

2020-11-19 Thread Paul Gofman
On 11/19/20 23:54, Paul Gofman wrote:
> On 11/19/20 20:57, David Laight wrote:
 The Windows code is not completely loaded at initialization time.  It
 also has dynamic libraries loaded later.  yes, wine knows the memory
 regions, but there is no guarantee there is a small number of segments
 or that the full picture is known at any given moment.
>>> Yes, I didn't mean it was known statically at init time (although
>>> maybe it can be; see below) just that all the code doing the loading
>>> is under Wine's control (vs having system dynamic linker doing stuff
>>> it can't reliably see, which is the case with host libraries).
>> Since wine must itself make the mmap() system calls that make memory
>> executable can't it arrange for windows code and linux code to be
>> above/below some critical address?
>>
>> IIRC 32bit windows has the user/kernel split at 2G, so all the
>> linux code could be shoe-horned into the top 1GB.
>>
>> A similar boundary could be picked for 64bit code.
>>
>> This would probably require flags to mmap() to map above/below
>> the specified address (is there a flag for the 2G boundary
>> these days - wine used to do very horrid things).
>> It might also need a special elf interpreter to load the
>> wine code itself high.
>>
> Wine does not control the loading of native libraries (which are subject
> to ASLR and thus do not necessarily exactly follow mmap's top down
> order). Wine is also not free to choose where to load the Windows
> libraries. Some of Win libraries are relocatable, some are not. Even
> those relocatable are still often assumed to be loaded at the base
> address specified in PE, with assumption made either by library itself
> or DRM or sandboxing / hotpatching / interception code from around.
>
> Also, it is very common to DRMs to unpack the encrypted code to a newly
> allocated segment (which gives no clue at the moment of allocation
> whether it is going to be executable later), and then make it
> executable. There are a lot of tricks about that and such code sometimes
> assumes very specific (and Windows implementation dependent) things, in
> particular, about the memory layout. Windows VirtualAlloc[Ex] gives the
> way to request top down or bottom up allocation order, as well as
> specific allocation address. The latter is not guaranteed to succeed of
> course just like on Linux for obvious reasons, but if specific (high)
> address rangesĀ  always have some space available on Windows, then there
> are the apps in the wild which depend of that, as far as our practice goes.
>
> If we were given mmap flag for specifying memory allocation boundary,
> and also a sort of process-wide dlopen() config option for specifying
> that boundary for every host shared library load, the address space
> separation could probably work... until we hit a tricky case when the
> app wants to get a memory specifically high address range. I think we
> can't do that cleanly as both Windows and Linux currently have the same
> 128TB limit for user address space on x64 and we've got no spare space
> to safely put native code without potential interference with Windows code.
>
Maybe it is also interesting to mention that the initial Gabriel's
patches version was introducing the emulation trigger by specifying a
flag for memory region through mprotect(), so we could mark the regions
calls from which should be trapped. That would be probably the easiest
possible solution in terms of using that in Wine (as no memory allocated
by Wine itself is supposed to contain native host syscalls) but that
idea was not accepted. Mainly because, as I understand, such a
functionality does not belong to VM management.



Re: Kernel prctl feature for syscall interception and emulation

2020-11-19 Thread Paul Gofman
On 11/19/20 20:57, David Laight wrote:
>>> The Windows code is not completely loaded at initialization time.  It
>>> also has dynamic libraries loaded later.  yes, wine knows the memory
>>> regions, but there is no guarantee there is a small number of segments
>>> or that the full picture is known at any given moment.
>> Yes, I didn't mean it was known statically at init time (although
>> maybe it can be; see below) just that all the code doing the loading
>> is under Wine's control (vs having system dynamic linker doing stuff
>> it can't reliably see, which is the case with host libraries).
> Since wine must itself make the mmap() system calls that make memory
> executable can't it arrange for windows code and linux code to be
> above/below some critical address?
>
> IIRC 32bit windows has the user/kernel split at 2G, so all the
> linux code could be shoe-horned into the top 1GB.
>
> A similar boundary could be picked for 64bit code.
>
> This would probably require flags to mmap() to map above/below
> the specified address (is there a flag for the 2G boundary
> these days - wine used to do very horrid things).
> It might also need a special elf interpreter to load the
> wine code itself high.
>
Wine does not control the loading of native libraries (which are subject
to ASLR and thus do not necessarily exactly follow mmap's top down
order). Wine is also not free to choose where to load the Windows
libraries. Some of Win libraries are relocatable, some are not. Even
those relocatable are still often assumed to be loaded at the base
address specified in PE, with assumption made either by library itself
or DRM or sandboxing / hotpatching / interception code from around.

Also, it is very common to DRMs to unpack the encrypted code to a newly
allocated segment (which gives no clue at the moment of allocation
whether it is going to be executable later), and then make it
executable. There are a lot of tricks about that and such code sometimes
assumes very specific (and Windows implementation dependent) things, in
particular, about the memory layout. Windows VirtualAlloc[Ex] gives the
way to request top down or bottom up allocation order, as well as
specific allocation address. The latter is not guaranteed to succeed of
course just like on Linux for obvious reasons, but if specific (high)
address rangesĀ  always have some space available on Windows, then there
are the apps in the wild which depend of that, as far as our practice goes.

If we were given mmap flag for specifying memory allocation boundary,
and also a sort of process-wide dlopen() config option for specifying
that boundary for every host shared library load, the address space
separation could probably work... until we hit a tricky case when the
app wants to get a memory specifically high address range. I think we
can't do that cleanly as both Windows and Linux currently have the same
128TB limit for user address space on x64 and we've got no spare space
to safely put native code without potential interference with Windows code.



RE: Kernel prctl feature for syscall interception and emulation

2020-11-19 Thread David Laight
> > The Windows code is not completely loaded at initialization time.  It
> > also has dynamic libraries loaded later.  yes, wine knows the memory
> > regions, but there is no guarantee there is a small number of segments
> > or that the full picture is known at any given moment.
> 
> Yes, I didn't mean it was known statically at init time (although
> maybe it can be; see below) just that all the code doing the loading
> is under Wine's control (vs having system dynamic linker doing stuff
> it can't reliably see, which is the case with host libraries).

Since wine must itself make the mmap() system calls that make memory
executable can't it arrange for windows code and linux code to be
above/below some critical address?

IIRC 32bit windows has the user/kernel split at 2G, so all the
linux code could be shoe-horned into the top 1GB.

A similar boundary could be picked for 64bit code.

This would probably require flags to mmap() to map above/below
the specified address (is there a flag for the 2G boundary
these days - wine used to do very horrid things).
It might also need a special elf interpreter to load the
wine code itself high.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: Kernel prctl feature for syscall interception and emulation

2020-11-19 Thread Rich Felker
On Thu, Nov 19, 2020 at 12:32:54PM -0500, Gabriel Krisman Bertazi wrote:
> Rich Felker  writes:
> 
> > On Thu, Nov 19, 2020 at 11:15:46AM -0500, Gabriel Krisman Bertazi wrote:
> >> Rich Felker  writes:
> >> 
> >> > On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via 
> >> > Libc-alpha wrote:
> >> 
> >> [...]
> >> 
> >> >
> >> > SIGSYS (or signal handling in general) is not the right way to do
> >> > this. It has all the same problems that came up in seccomp filtering
> >> > with SIGSYS, and which were solved by user_notif mode (running the
> >> > interception in a separate thread rather than an async context
> >> > interrupting the syscall. In fact I wouldn't be surprised if what you
> >> > want can already be done with reasonable efficiency using seccomp
> >> > user_notif.
> >> 
> >> Hi Rich,
> >> 
> >> User_notif was raised in the kernel discussion and we had experimented
> >> with it, but the latency of user_notif is even worse than what we can do
> >> right now with other seccomp actions.
> >
> > Is there a compelling argument that the latency matters here? What
> > syscalls are windows binaries making like this? Is there a reason you
> > can't do something like intercepting the syscall with seccomp the
> > first time it happens, then rewriting the code not to use a direct
> > syscall on future invocations?
> 
> We can't do any code rewriting without tripping DRM protections and
> anti-cheating mechanisms.

I think you could if you maintained separate versions of the code for
read vs exec access ala some oldschool hardening tricks, but maybe
that's not compatible with windows code (or with 64-bit mode?).
Actually it's rather impressive that an DRM/anti-cheat mess works on
Wine at all..

> I should correct myself here.  While it is true that user_notif is
> slower than other seccomp actions, this is not a problem in itself.  The
> frequency of syscalls that need to be emulated is much smaller than
> regular syscalls, and the performance problem actually appears due to
> the filtering.  I should investigate user_notif more, but I don't oppose
> SUD doing user_notif instead of SIGSYS.  I will raise that with Wine
> developers and the kernel community.

Thanks! Avoiding repetition of the SIGSYS pitfall would be a good
thing.

> >> Regarding SIGSYS, the x86 maintainer suggested redirecting the syscall
> >> return to a userspace thunk, but the understanding among Wine developers
> >> is that SIGSYS is enough for their emulation needs.
> >
> > It might work for Wine needs, if Wine can guarantee it will never be
> > running code with signals blocked and some other constraints, but then
> > you end up with a mechanism that's designed just for Wine and that
> > will have gratuitous reasons it's not usable elsewhere. That does not
> > seem appropriate for inclusion in kernel.
> >
> >> > The default-intercept and excepting libc code segment is also bogus,
> >> > and will break stuff, including vdso syscall mechanism on i386 and any
> >> > code outside libc that makes its own syscalls from asm. If you need to
> >> > tag regions to control interception, it should be tagging the emulated
> >> > Windows guest code, which is bounded and you have full control over,
> >> > rather than the host code, which is unbounded and includes any
> >> > libraries that get linked indirectly by Wine.
> >> 
> >> The vdso trampoline, for the architectures that have it, is solved by
> >> the kernel implementation, who makes sure that region is allowed.
> >
> > I guess that works but it's ugly and assumes particular policy goals
> > matching Wine's rather than being a general mechanism.
> >
> >> The Linux code is not bounded, but the dispatcher region main goal is to
> >> support trampolines outside of the vdso case. The correct userspace
> >> implementation requires flipping the selector on any Windows/Linux code
> >> boundary cross, exactly because other libraries can issue syscalls
> >> directly.  The fact that libc is not the only one issuing syscalls is
> >> the exact reason we need something more complex than a few seccomp
> >> filters.
> >
> > I don't think this is correct. Rather than listing all the host
> > library code ranges to allow, you just list all the guest Windows code
> > ranges to intercept. Wine knows them by virtue of being the loader for
> > them. This all seems really easy to do with seccomp with a very small
> > filter.
> 
> The Windows code is not completely loaded at initialization time.  It
> also has dynamic libraries loaded later.  yes, wine knows the memory
> regions, but there is no guarantee there is a small number of segments
> or that the full picture is known at any given moment.

Yes, I didn't mean it was known statically at init time (although
maybe it can be; see below) just that all the code doing the loading
is under Wine's control (vs having system dynamic linker doing stuff
it can't reliably see, which is the case with host libraries).

> >> > But I'm skeptical that doing any 

Re: Kernel prctl feature for syscall interception and emulation

2020-11-19 Thread Gabriel Krisman Bertazi
Rich Felker  writes:

> On Thu, Nov 19, 2020 at 11:15:46AM -0500, Gabriel Krisman Bertazi wrote:
>> Rich Felker  writes:
>> 
>> > On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via 
>> > Libc-alpha wrote:
>> 
>> [...]
>> 
>> >
>> > SIGSYS (or signal handling in general) is not the right way to do
>> > this. It has all the same problems that came up in seccomp filtering
>> > with SIGSYS, and which were solved by user_notif mode (running the
>> > interception in a separate thread rather than an async context
>> > interrupting the syscall. In fact I wouldn't be surprised if what you
>> > want can already be done with reasonable efficiency using seccomp
>> > user_notif.
>> 
>> Hi Rich,
>> 
>> User_notif was raised in the kernel discussion and we had experimented
>> with it, but the latency of user_notif is even worse than what we can do
>> right now with other seccomp actions.
>
> Is there a compelling argument that the latency matters here? What
> syscalls are windows binaries making like this? Is there a reason you
> can't do something like intercepting the syscall with seccomp the
> first time it happens, then rewriting the code not to use a direct
> syscall on future invocations?

We can't do any code rewriting without tripping DRM protections and
anti-cheating mechanisms.

I should correct myself here.  While it is true that user_notif is
slower than other seccomp actions, this is not a problem in itself.  The
frequency of syscalls that need to be emulated is much smaller than
regular syscalls, and the performance problem actually appears due to
the filtering.  I should investigate user_notif more, but I don't oppose
SUD doing user_notif instead of SIGSYS.  I will raise that with Wine
developers and the kernel community.

>> Regarding SIGSYS, the x86 maintainer suggested redirecting the syscall
>> return to a userspace thunk, but the understanding among Wine developers
>> is that SIGSYS is enough for their emulation needs.
>
> It might work for Wine needs, if Wine can guarantee it will never be
> running code with signals blocked and some other constraints, but then
> you end up with a mechanism that's designed just for Wine and that
> will have gratuitous reasons it's not usable elsewhere. That does not
> seem appropriate for inclusion in kernel.
>
>> > The default-intercept and excepting libc code segment is also bogus,
>> > and will break stuff, including vdso syscall mechanism on i386 and any
>> > code outside libc that makes its own syscalls from asm. If you need to
>> > tag regions to control interception, it should be tagging the emulated
>> > Windows guest code, which is bounded and you have full control over,
>> > rather than the host code, which is unbounded and includes any
>> > libraries that get linked indirectly by Wine.
>> 
>> The vdso trampoline, for the architectures that have it, is solved by
>> the kernel implementation, who makes sure that region is allowed.
>
> I guess that works but it's ugly and assumes particular policy goals
> matching Wine's rather than being a general mechanism.
>
>> The Linux code is not bounded, but the dispatcher region main goal is to
>> support trampolines outside of the vdso case. The correct userspace
>> implementation requires flipping the selector on any Windows/Linux code
>> boundary cross, exactly because other libraries can issue syscalls
>> directly.  The fact that libc is not the only one issuing syscalls is
>> the exact reason we need something more complex than a few seccomp
>> filters.
>
> I don't think this is correct. Rather than listing all the host
> library code ranges to allow, you just list all the guest Windows code
> ranges to intercept. Wine knows them by virtue of being the loader for
> them. This all seems really easy to do with seccomp with a very small
> filter.

The Windows code is not completely loaded at initialization time.  It
also has dynamic libraries loaded later.  yes, wine knows the memory
regions, but there is no guarantee there is a small number of segments
or that the full picture is known at any given moment.

>> > But I'm skeptical that doing any new kernel-side logic for tagging is
>> > needed. Seccomp already lets you filter on instruction pointer so you
>> > can install filters that will trigger user_notif just for guest code,
>> > then let you execute the emulation in the watcher thread and skip the
>> > actual syscall in the watched thread.
>> 
>> As I mentioned, we can check IP in seccomp and write filters.  But this
>> has two problems:
>> 
>> 1) Performance.  seccomp filters use cBPF which means 32bit comparisons,
>> no maps and a very limited instruction set.  We need to generate
>> boundary checks for each memory segment.  The filter becomes very large
>> very quickly and becomes a observable bottleneck.
>
> This sounds like you're doing something wrong. Range checking is O(log
> n) and n cannot be large enough to make log n significant. If you do
> it with a linear search rather than 

Re: Kernel prctl feature for syscall interception and emulation

2020-11-19 Thread Rich Felker
On Thu, Nov 19, 2020 at 11:15:46AM -0500, Gabriel Krisman Bertazi wrote:
> Rich Felker  writes:
> 
> > On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via 
> > Libc-alpha wrote:
> 
> [...]
> 
> >
> > SIGSYS (or signal handling in general) is not the right way to do
> > this. It has all the same problems that came up in seccomp filtering
> > with SIGSYS, and which were solved by user_notif mode (running the
> > interception in a separate thread rather than an async context
> > interrupting the syscall. In fact I wouldn't be surprised if what you
> > want can already be done with reasonable efficiency using seccomp
> > user_notif.
> 
> Hi Rich,
> 
> User_notif was raised in the kernel discussion and we had experimented
> with it, but the latency of user_notif is even worse than what we can do
> right now with other seccomp actions.

Is there a compelling argument that the latency matters here? What
syscalls are windows binaries making like this? Is there a reason you
can't do something like intercepting the syscall with seccomp the
first time it happens, then rewriting the code not to use a direct
syscall on future invocations?

> Regarding SIGSYS, the x86 maintainer suggested redirecting the syscall
> return to a userspace thunk, but the understanding among Wine developers
> is that SIGSYS is enough for their emulation needs.

It might work for Wine needs, if Wine can guarantee it will never be
running code with signals blocked and some other constraints, but then
you end up with a mechanism that's designed just for Wine and that
will have gratuitous reasons it's not usable elsewhere. That does not
seem appropriate for inclusion in kernel.

> > The default-intercept and excepting libc code segment is also bogus,
> > and will break stuff, including vdso syscall mechanism on i386 and any
> > code outside libc that makes its own syscalls from asm. If you need to
> > tag regions to control interception, it should be tagging the emulated
> > Windows guest code, which is bounded and you have full control over,
> > rather than the host code, which is unbounded and includes any
> > libraries that get linked indirectly by Wine.
> 
> The vdso trampoline, for the architectures that have it, is solved by
> the kernel implementation, who makes sure that region is allowed.

I guess that works but it's ugly and assumes particular policy goals
matching Wine's rather than being a general mechanism.

> The Linux code is not bounded, but the dispatcher region main goal is to
> support trampolines outside of the vdso case. The correct userspace
> implementation requires flipping the selector on any Windows/Linux code
> boundary cross, exactly because other libraries can issue syscalls
> directly.  The fact that libc is not the only one issuing syscalls is
> the exact reason we need something more complex than a few seccomp
> filters.

I don't think this is correct. Rather than listing all the host
library code ranges to allow, you just list all the guest Windows code
ranges to intercept. Wine knows them by virtue of being the loader for
them. This all seems really easy to do with seccomp with a very small
filter.

> > But I'm skeptical that doing any new kernel-side logic for tagging is
> > needed. Seccomp already lets you filter on instruction pointer so you
> > can install filters that will trigger user_notif just for guest code,
> > then let you execute the emulation in the watcher thread and skip the
> > actual syscall in the watched thread.
> 
> As I mentioned, we can check IP in seccomp and write filters.  But this
> has two problems:
> 
> 1) Performance.  seccomp filters use cBPF which means 32bit comparisons,
> no maps and a very limited instruction set.  We need to generate
> boundary checks for each memory segment.  The filter becomes very large
> very quickly and becomes a observable bottleneck.

This sounds like you're doing something wrong. Range checking is O(log
n) and n cannot be large enough to make log n significant. If you do
it with a linear search rather than binary then of course it's slow.

> 2) Seccomp filters cannot be removed.  And we'd need to update them
> frequently.

What are the updating requirements?

I'm not sure if Windows code is properly PIC or not, but if it is,
then you just do your own address assignment in a single huge range
(first allocated with PROT_NONE, then MAP_FIXED over top of it) so
that a single static range check suffices.

Rich


Re: Kernel prctl feature for syscall interception and emulation

2020-11-19 Thread Gabriel Krisman Bertazi
Rich Felker  writes:

> On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via 
> Libc-alpha wrote:

[...]

>
> SIGSYS (or signal handling in general) is not the right way to do
> this. It has all the same problems that came up in seccomp filtering
> with SIGSYS, and which were solved by user_notif mode (running the
> interception in a separate thread rather than an async context
> interrupting the syscall. In fact I wouldn't be surprised if what you
> want can already be done with reasonable efficiency using seccomp
> user_notif.

Hi Rich,

User_notif was raised in the kernel discussion and we had experimented
with it, but the latency of user_notif is even worse than what we can do
right now with other seccomp actions.

Regarding SIGSYS, the x86 maintainer suggested redirecting the syscall
return to a userspace thunk, but the understanding among Wine developers
is that SIGSYS is enough for their emulation needs.

> The default-intercept and excepting libc code segment is also bogus,
> and will break stuff, including vdso syscall mechanism on i386 and any
> code outside libc that makes its own syscalls from asm. If you need to
> tag regions to control interception, it should be tagging the emulated
> Windows guest code, which is bounded and you have full control over,
> rather than the host code, which is unbounded and includes any
> libraries that get linked indirectly by Wine.

The vdso trampoline, for the architectures that have it, is solved by
the kernel implementation, who makes sure that region is allowed.

The Linux code is not bounded, but the dispatcher region main goal is to
support trampolines outside of the vdso case. The correct userspace
implementation requires flipping the selector on any Windows/Linux code
boundary cross, exactly because other libraries can issue syscalls
directly.  The fact that libc is not the only one issuing syscalls is
the exact reason we need something more complex than a few seccomp
filters.

Flipping the selector on every boundary crosses is fine for performance,
since we don't go into the kernel.  But if we can avoid checking it from
kernelspace, that's an optimization, which is what I meant by the
dispatcher region allowing the more parts of the glibc code.  That's
just an optimization, but not strictly necessary for correctness.

I still don't think anything is broken here.

> But I'm skeptical that doing any new kernel-side logic for tagging is
> needed. Seccomp already lets you filter on instruction pointer so you
> can install filters that will trigger user_notif just for guest code,
> then let you execute the emulation in the watcher thread and skip the
> actual syscall in the watched thread.

As I mentioned, we can check IP in seccomp and write filters.  But this
has two problems:

1) Performance.  seccomp filters use cBPF which means 32bit comparisons,
no maps and a very limited instruction set.  We need to generate
boundary checks for each memory segment.  The filter becomes very large
very quickly and becomes a observable bottleneck.

2) Seccomp filters cannot be removed.  And we'd need to update them
frequently.

-- 
Gabriel Krisman Bertazi


Re: Kernel prctl feature for syscall interception and emulation

2020-11-19 Thread Rich Felker
On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via 
Libc-alpha wrote:
> Hi,
> 
> I'm proposing a kernel patch for a feature I'm calling Syscall User
> Dispatch (SUD).  It is a mechanism to efficiently redirect system calls
> of only part of a binary back to userspace to be emulated by a
> compatibility layer.  The patchset is close to being accepted, but
> Florian suggested the feature might pose some constraints on glibc, and
> requested I raise the discussion here.
> 
> The problem I am trying to solve is that modern Windows games running
> over Wine are issuing Windows system calls directly from the Windows
> code, without going through the "WinAPI", which doesn't give Wine a
> chance to emulate the library calls and implement the behavior.  As a
> result, Windows syscalls reache the Linux kernel, and the kernel has
> no context to differentiate them from native syscalls coming from the
> Wine side, since it cannot trust the ABI, not even syscall numbers to be
> something sane.  Historically, Windows applications were very respectful
> of the WinAPI, not bypassing it, but we are seeing modern applications
> like games doing it more often for reasons, I believe, of DRM.
> 
> It is worth mentioning that, by design, Wine and the Windows application
> run on the same process space, so we really cannot just filter specific
> threads or the entire application. We need some kind of filter executed
> on each system call.
> 
> Now, the obvious way to solve this problem would be cBPF filtering
> memory regions, through Seccomp.  The main problem with that approach is
> the performance of executing a large cBPF filter.  The goal is to run
> games, and we observed the Seccomp filter become a bottleneck, since we
> have many, many memory areas that need to be checked by cBPF.  In
> addition, seccomp, as a security mechanism, doesn't support some filter
> update operations, like removing them.  Another approaches were
> explored, like making a new mode out of seccomp, but the kernel
> community preferred to make it a separate, self-contained mechanism.
> Other solutions, like (live) patching the Windows application are out
> of question, as they would trip DRM and anti-cheat protection
> mechanisms.
> 
> The SUD interface I proposed to the kernel community is self-contained
> and exposed as a prctl option.  It lets userspace define a switch
> variable per-thread that, when set, will raise a SIGSYS for any syscall
> attempted.  The idea is that Wine can just flip this switch efficiently
> before delivering control to the Windows portions of the binary, and
> flip it back off when it needs to execute native syscalls.  It is
> important for us that the switch flip doesn't require a syscall, for
> performance reasons.  The interface also lets userspace define a
> "dispatcher region" from where any syscalls are always executed,
> regardless of the selector variable.  This is important for the return
> of the SIGSYS directly to a Windows segment, where we need to execute
> the signal return trampoline with the selector blocked.  Ideally, Wine
> would simply define this dispatcher region as the entire libc code
> segment, and just use the selector to safe-guard against Linux libraries
> issuing syscalls by themselves (they exist).
> 
> I think my questions to libc are: what are the constraints, if any, that
> libc would face with this new interface?  I expected this to be
> completely invisible to libc.  In addition, are there any problems you
> foresee with the current interface?
> 
> Finally, I don't think it makes sense to bother you immediately with
> the kernel implementation patches, but if you want to see the them,
> they are archived in the link below.  I can also share them directly on
> this ML if you request it.
> 
>   https://lkml.org/lkml/2020/11/17/2347
> 
> Nevertheless, I think it is useful the share the final patch, that has
> the in-tree documentation for the interface, which I inlined in this
> message.

SIGSYS (or signal handling in general) is not the right way to do
this. It has all the same problems that came up in seccomp filtering
with SIGSYS, and which were solved by user_notif mode (running the
interception in a separate thread rather than an async context
interrupting the syscall. In fact I wouldn't be surprised if what you
want can already be done with reasonable efficiency using seccomp
user_notif.

The default-intercept and excepting libc code segment is also bogus,
and will break stuff, including vdso syscall mechanism on i386 and any
code outside libc that makes its own syscalls from asm. If you need to
tag regions to control interception, it should be tagging the emulated
Windows guest code, which is bounded and you have full control over,
rather than the host code, which is unbounded and includes any
libraries that get linked indirectly by Wine. But I'm skeptical that
doing any new kernel-side logic for tagging is needed. Seccomp already
lets you filter on instruction