Re: [PATCH RFC v3 00/10] extensible syscalls: CHECK_FIELDS to allow for easier feature detection

2024-10-09 Thread Florian Weimer
* Aleksa Sarai:

> This is something that I've been thinking about for a while. We had a
> discussion at LPC 2020 about this[1] but the proposals suggested there
> never materialised.
>
> In short, it is quite difficult for userspace to detect the feature
> capability of syscalls at runtime. This is something a lot of programs
> want to do, but they are forced to create elaborate scenarios to try to
> figure out if a feature is supported without causing damage to the
> system. For the vast majority of cases, each individual feature also
> needs to be tested individually (because syscall results are
> all-or-nothing), so testing even a single syscall's feature set can
> easily inflate the startup time of programs.
>
> This patchset implements the fairly minimal design I proposed in this
> talk[2] and in some old LKML threads (though I can't find the exact
> references ATM). The general flow looks like:

By the way, I have recently tried to document things from a glibc
perspective (which is a bit broader because we also have purely
userspace types):

  [PATCH RFC] manual: Document how types change
  


(This patch has not yet been reviewed.)



Re: [RFC PATCH 0/3] introduce PIDFD_SELF

2024-09-30 Thread Florian Weimer
* Lorenzo Stoakes:

> If you wish to utilise a pidfd interface to refer to the current process
> (from the point of view of userland - from the kernel point of view - the
> thread group leader), it is rather cumbersome, requiring something like:
>
>   int pidfd = pidfd_open(getpid(), 0);
>
>   ...
>
>   close(pidfd);
>
> Or the equivalent call opening /proc/self. It is more convenient to use a
> sentinel value to indicate to an interface that accepts a pidfd that we
> simply wish to refer to the current process.

The descriptor will refer to the current thread, not process, right?

The distinction matters for pidfd_getfd if a process contains multiple
threads with different file descriptor tables, and probably for
pidfd_send_signal as well.

Thanks,
Florian




Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Florian Weimer
* Jann Horn:

> On Wed, Apr 14, 2021 at 12:27 PM Florian Weimer  wrote:
>>
>> * Andrei Vagin:
>>
>> > We already have process_vm_readv and process_vm_writev to read and write
>> > to a process memory faster than we can do this with ptrace. And now it
>> > is time for process_vm_exec that allows executing code in an address
>> > space of another process. We can do this with ptrace but it is much
>> > slower.
>> >
>> > = Use-cases =
>>
>> We also have some vaguely related within the same address space: running
>> code on another thread, without modifying its stack, while it has signal
>> handlers blocked, and without causing system calls to fail with EINTR.
>> This can be used to implement certain kinds of memory barriers.
>
> That's what the membarrier() syscall is for, right? Unless you don't
> want to register all threads for expedited membarrier use?

membarrier is not sufficiently powerful for revoking biased locks, for
example.

For the EINTR issue, <https://github.com/golang/go/issues/38836> is an
example.  I believe CIFS has since seen a few fixes (after someone
reported that tar on CIFS wouldn't work because the SIGCHLD causing
utimensat to fail—and there isn't even a signal handler for SIGCHLD!),
but the time it took to get to this point doesn't give me confidence
that it is safe to send signals to a thread that is running unknown
code.

But as you explained regarding the set*id broadcast, it seems that if we
had this run-on-another-thread functionality, we would likely encounter
issues similar to those with SA_RESTART.  We don't see the issue with
set*id today because it's a rare operation, and multi-threaded file
servers that need to change credentials frequently opt out of the set*id
broadcast anyway.  (What I have in mind is a future world where any
printf call, any malloc call, can trigger such a broadcast.)

The cross-VM CRIU scenario would probably somewhere in between (not
quite the printf/malloc level, but more frequent than set*id).

Thanks,
Florian



Re: [PATCH v7 5/6] x86/signal: Detect and prevent an alternate signal stack overflow

2021-04-14 Thread Florian Weimer
* Borislav Petkov:

> On Mon, Apr 12, 2021 at 10:30:23PM +, Bae, Chang Seok wrote:
>> On Mar 26, 2021, at 03:30, Borislav Petkov  wrote:
>> > On Thu, Mar 25, 2021 at 09:56:53PM -0700, Andy Lutomirski wrote:
>> >> We really ought to have a SIGSIGFAIL signal that's sent, double-fault
>> >> style, when we fail to send a signal.
>> > 
>> > Yeap, we should be able to tell userspace that we couldn't send a
>> > signal, hohumm.
>> 
>> Hi Boris,
>> 
>> Let me clarify some details as preparing to include this in a revision.
>> 
>> So, IIUC, a number needs to be assigned for this new SIGFAIL. At a glance, 
>> not
>> sure which one to pick there in signal.h -- 1-31 fully occupied and the rest
>> for 33 different real-time signals.
>> 
>> Also, perhaps, force_sig(SIGFAIL) here, instead of return -1 -- to die with
>> SIGSEGV.
>
> I think this needs to be decided together with userspace people so that
> they can act accordingly and whether it even makes sense to them.
>
> Florian, any suggestions?

Is this discussion about better behavior (at least diagnostics) for
existing applications, without any code changes?  Or an alternative
programming model?

Does noavx512 acutally reduce the XSAVE size to AVX2 levels?  Or would
you need noxsave?

One possibility is that the sigaltstack size check prevents application
from running which work just fine today because all they do is install a
stack overflow handler, and stack overflow does not actually happen.  So
if sigaltstack fails and the application checks the result of the system
call, it probably won't run at all.  Shifting the diagnostic to the
pointer where the signal would have to be delivered is perhaps the only
thing that can be done.

As for SIGFAIL in particular, I don't think there are any leftover
signal numbers.  It would need a prctl to assign the signal number, and
I'm not sure if there is a useful programming model because signals do
not really compose well even today.  SIGFAIL adds another point where
libraries need to collaborate, and we do not have a mechanism for that.
(This is about what Rich Felker termed “library-safe code”, proper
maintenance of process-wide resources such as the current directory.)

Thanks,
Florian



Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

2021-04-14 Thread Florian Weimer
* Andrei Vagin:

> We already have process_vm_readv and process_vm_writev to read and write
> to a process memory faster than we can do this with ptrace. And now it
> is time for process_vm_exec that allows executing code in an address
> space of another process. We can do this with ptrace but it is much
> slower.
>
> = Use-cases =

We also have some vaguely related within the same address space: running
code on another thread, without modifying its stack, while it has signal
handlers blocked, and without causing system calls to fail with EINTR.
This can be used to implement certain kinds of memory barriers.  It is
also necessary to implement set*id with POSIX semantics in userspace.
(Linux only changes the current thread credentials, POSIX requires
process-wide changes.)  We currently use a signal for set*id, but it has
issues (it can be blocked, the signal could come from somewhere, etc.).
We can't use signals for barriers because of the EINTR issue, and
because the signal context is stored on the stack.

Thanks,
Florian



Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-12 Thread Florian Weimer
* Borislav Petkov:

> On Mon, Apr 12, 2021 at 04:19:29PM +0200, Florian Weimer wrote:
>> Maybe we could have done this in 2016 when I reported this for the first
>> time.  Now it is too late, as more and more software is using
>> CPUID-based detection for AVX-512.
>
> So as I said on another mail today, I don't think a library should rely
> solely on CPUID-based detection of features especially if those features
> need kernel support too. IOW, it should ask whether the kernel can
> handle those too, first.

Yes, that's why we have the XGETBV handshake.  I was imprecise.  It's
CPUID + XGETBV of course.  Or even AT_HWCAP2 (for FSGSBASE).

> And the CPUID-faulting thing would solve stuff like that because then
> the kernel can *actually* get involved into answering something where it
> has a say in, too.

But why wouldn't we use a syscall or an entry in the auxiliary vector
for that?  Why fault a potentially performance-critical instruction?

Thanks,
Florian



Re: static_branch/jump_label vs branch merging

2021-04-09 Thread Florian Weimer
* Ard Biesheuvel:

> Wouldn't that require the compiler to interpret the contents of the
> asm() block?

Yes and no.  It would require proper toolchain support, so in this case
a new ELF relocation type, with compiler, assembler, and linker support
to generate those relocations and process them.  As far as I understand
it, the kernel doesn't do things this way.

Thanks,
Florian



Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-29 Thread Florian Weimer
* Len Brown via Libc-alpha:

>> In particular, the library may use instructions that main() doesn't know 
>> exist.
>
> And so I'll ask my question another way.
>
> How is it okay to change the value of XCR0 during the run time of a
> program?
>
> I submit that it is not, and that is a deal-killer for a
> request/release API.
>
> eg.  main() doesn't know that the math library wants to use AMX, and
> neither does the threading library.  So main() doesn't know to call
> the API before either library is invoked.  The threading library
> starts up and creates user-space threads based on the initial value
> from XCR0.  Then the math library calls the API, which adds bits to
> XCRO, and then the user-space context switch in the threading
> library corrupts data because the new XCR0 size doesn't match the
> initial size.

I agree that this doesn't quite work.  (Today, it's not the thread
library, but the glibc dynamic loader trampoline.)

I disagree that CPU feature enablement has been a failure.  I think we
are pretty good at enabling new CPU features on older operating
systems, not just bleeding edge mainline kernels.  Part of that is
that anything but the kernel stays out of the way, and most features
are available directly via inline assembly (you can even use .byte
hacks if you want).  There is no need to switch to new userspace
libraries, compile out-of-tree kernel drivers that have specific
firmware requirements, and so on.

If the operations that need a huge context can be made idempotent,
with periodic checkpoints, it might be possible to avoid saving the
context completely by some rseq-like construct.


Re: Why does glibc use AVX-512?

2021-03-26 Thread Florian Weimer
* Andy Lutomirski:

> On Fri, Mar 26, 2021 at 1:35 PM Florian Weimer  wrote:
>>
>> * Andy Lutomirski:
>>
>> > On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer  wrote:
>> >>   x86: Sporadic failures in tst-cpu-features-cpuinfo
>> >>   <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>
>> >
>> > It's worth noting that recent microcode updates have make RTM
>> > considerably less likely to actually work on many parts.  It's
>> > possible you should just disable it. :(
>>
>> Sorry, I'm not sure who should disable it.
>>
>> Let me sum up the situation:
>>
>> We have a request for a performance enhancement in glibc, so that
>> applications can use it on server parts where RTM actually works.
>>
>> For CPUs that support AVX-512, we may be able to meet that with a
>> change that uses the new 256-bit registers, t avoid the %xmm
>> transition penalty.  (This is the easy case, hopefully—there shouldn't
>> be any frequency issues associated with that, and if the kernel
>> doesn't optimize the context switch today, that's a nonissue as well.)
>
> I would make sure that the transition penalty actually works the way
> you think it does.  My general experience with the transition
> penalties is that the CPU is rather more aggressive about penalizing
> you than makes sense.

Do you mean the frequency/thermal budget?

I mean the immense slowdown you get if you use %xmm registers after
their %ymm counterparts (doesn't have to be %zmm, that issue is
present starting with AVX) and you have not issued VZEROALL or
VZEROUPPER between the two uses.

It's a bit like EMMS, I gues, only that you don't get corruption, just
really poor performance.

>> For CPUs that do not support AVX-512 but support RTM (and AVX2), we
>> need a dynamic run-time check whether the string function is invoked
>> in a transaction.  In that case, we need to use VZEROALL instead of
>> VZEROUPPER.  (It's apparently too costly to issue VZEROALL
>> unconditionally.)
>
> So VZEROALL works in a transaction and VZEROUPPER doesn't?  That's bizarre.

Apparently yes.

>> All this needs to work transparently without user intervention.  We
>> cannot require firmware upgrades to fix the incorrect RTM reporting
>> issue (the bug I referenced).  I think we can require software updates
>> which tell glibc when to use RTM-enabled string functions if the
>> dynamic selection does not work (either for performance reasons, or
>> because of the RTM reporting bug).
>>
>> I want to avoid a situation where one in eight processes fail to work
>> correctly because the CPUID checks ran on CPU 0, where RTM is reported
>> as available, and then we trap when executing XTEST on other CPUs.
>
> What kind of system has that problem?

It's a standard laptop after a suspend/resume cycle.  It's either a
kernel or firmware bug.

> If RTM reports as available, then it should work in the sense of not
> trapping.  (There is no guarantee that transactions will *ever*
> complete, and that part is no joke.)

XTEST doesn't abort transactions, but it traps without RTM support.
If CPU0 has RTM support and we enable XTEST use in glibc based on that
(because the startup code runs on CPU0), then the XTEST instruction
must not trap when running on other CPUs.

Currently, we do not use RTM for anything in glibc by default, even if
it is available according to CPUID.  (There are ways to opt in, unless
the CPU is on the disallow list due to the early Haswell bug.)  I'm
worried that if we start executing XTEST on all CPUs that indicate RTM
support, we will see lots of weird issues, along the lines of bug 27398.


Re: Why does glibc use AVX-512?

2021-03-26 Thread Florian Weimer
* Andy Lutomirski:

> On Fri, Mar 26, 2021 at 12:34 PM Florian Weimer  wrote:
>>   x86: Sporadic failures in tst-cpu-features-cpuinfo
>>   <https://sourceware.org/bugzilla/show_bug.cgi?id=27398#c3>
>
> It's worth noting that recent microcode updates have make RTM
> considerably less likely to actually work on many parts.  It's
> possible you should just disable it. :(

Sorry, I'm not sure who should disable it.

Let me sum up the situation:

We have a request for a performance enhancement in glibc, so that
applications can use it on server parts where RTM actually works.

For CPUs that support AVX-512, we may be able to meet that with a
change that uses the new 256-bit registers, t avoid the %xmm
transition penalty.  (This is the easy case, hopefully—there shouldn't
be any frequency issues associated with that, and if the kernel
doesn't optimize the context switch today, that's a nonissue as well.)

For CPUs that do not support AVX-512 but support RTM (and AVX2), we
need a dynamic run-time check whether the string function is invoked
in a transaction.  In that case, we need to use VZEROALL instead of
VZEROUPPER.  (It's apparently too costly to issue VZEROALL
unconditionally.)

All this needs to work transparently without user intervention.  We
cannot require firmware upgrades to fix the incorrect RTM reporting
issue (the bug I referenced).  I think we can require software updates
which tell glibc when to use RTM-enabled string functions if the
dynamic selection does not work (either for performance reasons, or
because of the RTM reporting bug).

I want to avoid a situation where one in eight processes fail to work
correctly because the CPUID checks ran on CPU 0, where RTM is reported
as available, and then we trap when executing XTEST on other CPUs.


Re: Why does glibc use AVX-512?

2021-03-26 Thread Florian Weimer
* Andy Lutomirski:

>> > AVX-512 cleared, and programs need to explicitly request enablement.
>> > This would allow programs to opt into not saving/restoring across
>> > signals or to save/restore in buffers supplied when the feature is
>> > enabled.
>>
>> Isn't XSAVEOPT already able to handle that?
>>
>
> Yes, but we need a place to put the data, and we need to acknowledge
> that, with the current save-everything-on-signal model, the amount of
> time and memory used is essentially unbounded.  This isn't great.

The size has to have a known upper bound, but the save amount can be
dynamic, right?

How was the old lazy FPU initialization support for i386 implemented?

>> Assuming you can make XSAVEOPT work for you on the kernel side, my
>> instincts tell me that we should have markup for RTM, not for AVX-512.
>> This way, we could avoid use of the AVX-512 registers and keep using
>> VZEROUPPER, without run-time transaction checks, and deal with other
>> idiosyncrasies needed for transaction support that users might
>> encounter once this feature sees more use.  But the VZEROUPPER vs RTM
>> issues is currently stuck in some internal process issue on my end (or
>> two, come to think of it), which I hope to untangle next month.
>
> Can you elaborate on the issue?

This is the bug:

  vzeroupper use in AVX2 multiarch string functions cause HTM aborts 
  

Unfortunately we have a bug (outside of glibc) that makes me wonder if
we can actually roll out RTM transaction checks (or any RTM
instruction) on a large scale:

  x86: Sporadic failures in tst-cpu-features-cpuinfo 
  

The dynamic RTM check might trap due to this bug.  (We have a bit more
information about the nature of the bug, currently missing from
Bugzilla.)

I'm also worried that the new dynamic RTM check in the string
functions has a performance impact.  Due to its nature, it will be
enabled for every program once running on RTM-capable hardware, not
just those that actually use RTM.


Re: Why does glibc use AVX-512?

2021-03-26 Thread Florian Weimer
* Andy Lutomirski-alpha:

> glibc appears to use AVX512F for memcpy by default.  (Unless
> Prefer_ERMS is default-on, but I genuinely can't tell if this is the
> case.  I did some searching.)  The commit adding it refers to a 2016
> email saying that it's 30% on KNL.

As far as I know, glibc only does that on KNL, and there it is
actually beneficial.  The relevant code is:

  /* Since AVX512ER is unique to Xeon Phi, set Prefer_No_VZEROUPPER
 if AVX512ER is available.  Don't use AVX512 to avoid lower CPU
 frequency if AVX512ER isn't available.  */
  if (CPU_FEATURES_CPU_P (cpu_features, AVX512ER))
cpu_features->preferred[index_arch_Prefer_No_VZEROUPPER]
  |= bit_arch_Prefer_No_VZEROUPPER;
  else
cpu_features->preferred[index_arch_Prefer_No_AVX512]
  |= bit_arch_Prefer_No_AVX512;

So it's not just about Prefer_ERMS.

> I think we should seriously consider solutions in which, for new
> tasks, XCR0 has new giant features (e.g. AMX) and possibly even

I think the AMX programming model will be different, yes.

> AVX-512 cleared, and programs need to explicitly request enablement.
> This would allow programs to opt into not saving/restoring across
> signals or to save/restore in buffers supplied when the feature is
> enabled.

Isn't XSAVEOPT already able to handle that?

In glibc, we use XSAVE/XSAVEC for the dynamic loader trampoline, so it
should not needlessly enable AVX-512 state today, while still enabling
AVX-512 calling conventions transparently.

There is a discussion about using the higher (AVX-512-only) %ymm
registers, to avoid the %xmm transition penalty without the need for
VZEROUPPER.  (VZEROUPPER is incompatible with RTM from a performance
point of view.)  That would perhaps negatively impact XSAVEOPT.

Assuming you can make XSAVEOPT work for you on the kernel side, my
instincts tell me that we should have markup for RTM, not for AVX-512.
This way, we could avoid use of the AVX-512 registers and keep using
VZEROUPPER, without run-time transaction checks, and deal with other
idiosyncrasies needed for transaction support that users might
encounter once this feature sees more use.  But the VZEROUPPER vs RTM
issues is currently stuck in some internal process issue on my end (or
two, come to think of it), which I hope to untangle next month.


Re: [PATCH v7 5/6] x86/signal: Detect and prevent an alternate signal stack overflow

2021-03-25 Thread Florian Weimer
* Chang Seok via Libc-alpha Bae:

> On Mar 25, 2021, at 09:20, Borislav Petkov  wrote:
>> 
>> $ gcc tst-minsigstksz-2.c -DMY_MINSIGSTKSZ=3453 -o tst-minsigstksz-2
>> $ ./tst-minsigstksz-2
>> tst-minsigstksz-2: changed byte 50 bytes below configured stack
>> 
>> Whoops.
>> 
>> And the debug print said:
>> 
>> [ 5395.252884] signal: get_sigframe: sp: 0x7f54ec39e7b8, sas_ss_sp: 
>> 0x7f54ec39e6ce, sas_ss_size 0xd7d
>> 
>> which tells me that, AFAICT, your check whether we have enough alt stack
>> doesn't seem to work in this case.
>
> Yes, in this case.
>
> tst-minsigstksz-2.c has this code:
>
> static void
> handler (int signo)
> {
>   /* Clear a bit of on-stack memory.  */
>   volatile char buffer[256];
>   for (size_t i = 0; i < sizeof (buffer); ++i)
> buffer[i] = 0;
>   handler_run = 1;
> }
> …
>
>   if (handler_run != 1)
> errx (1, "handler did not run");
>
>   for (void *p = stack_buffer; p < stack_bottom; ++p)
> if (*(unsigned char *) p != 0xCC)
>   errx (1, "changed byte %zd bytes below configured stack\n",
> stack_bottom - p);
> …
>
> I think the message comes from the handler’s overwriting, not from the kernel.
>
> The patch's check is to detect and prevent the kernel-induced overflow --
> whether alt stack enough for signal delivery itself.  The stack is possibly
> not enough for the signal handler's use as the kernel does not know for it.

Ahh, right.  When I wrote the test, I didn't know which turn the
kernel would eventually take, so the test is quite arbitrary.

The glibc dynamic loader uses XSAVE/XSAVEC as well, so you can
probably double the practical stack requirement if lazy binding is in
use and can be triggered from the signal handler.  Estimating stack
sizes is hard.


Re: [PATCH] Document that PF_KTHREAD _is_ ABI

2021-03-20 Thread Florian Weimer
* Alexey Dobriyan:

> Some aren't -- PF_FORKNOEXEC. However it is silly for userspace to query it
> because programs knows if it forked but didn't exec without external help.

Libraries typically lack that knowledge, and may have reasons to
detect forks.  But there are probably better ways than this flag, like
a MADV_WIPEONFORK mapping, or comparing counters in MAP_PRIVATE and
MAP_SHARED mappings.


Re: [PATCH v2] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request

2021-03-03 Thread Florian Weimer
* Mathieu Desnoyers:

> This way, the configuration structure can be expanded in the future. The
> rseq ABI structure is by definition fixed-size, so there is no point in
> having its size here.
>
> Florian, did I understand your request correctly, or am I missing your
> point ?

No, the idea was that if the kernel ever supports different rseq ABI
sizes on registration (it could as there's a size argument to the rseq
system call), that needs to be communicated to CRIU, so that it restores
with the right size.

I haven't thought about whether it makes sense to make the ptrace
argument struct extensible.

Thanks,
Florian



Re: [PATCH] ptrace: add PTRACE_GET_RSEQ_CONFIGURATION request

2021-02-23 Thread Florian Weimer
* Piotr Figiel:

> diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> index 83ee45fa634b..d54cf6b6ce7c 100644
> --- a/include/uapi/linux/ptrace.h
> +++ b/include/uapi/linux/ptrace.h
> @@ -102,6 +102,14 @@ struct ptrace_syscall_info {
>   };
>  };
>  
> +#define PTRACE_GET_RSEQ_CONFIGURATION0x420f
> +
> +struct ptrace_rseq_configuration {
> + __u64 rseq_abi_pointer;
> + __u32 signature;
> + __u32 pad;
> +};

The flags and the structure size appear to be missing here.

Thanks,
Florian



LINUX_VERSION_CODE overflow (was: Re: Linux 4.9.256)

2021-02-11 Thread Florian Weimer
* Greg Kroah-Hartman:

> I'm announcing the release of the 4.9.256 kernel.
>
> This, and the 4.4.256 release are a little bit "different" than normal.
>
> This contains only 1 patch, just the version bump from .255 to .256
> which ends up causing the userspace-visable LINUX_VERSION_CODE to
> behave a bit differently than normal due to the "overflow".
>
> With this release, KERNEL_VERSION(4, 9, 256) is the same as KERNEL_VERSION(4, 
> 10, 0).
>
> Nothing in the kernel build itself breaks with this change, but given
> that this is a userspace visible change, and some crazy tools (like
> glibc and gcc) have logic that checks the kernel version for different
> reasons, I wanted to do this release as an "empty" release to ensure
> that everything still works properly.

As promised, I looked at this from the glibc perspective.

A dynamically linked glibc reads the LINUX_VERSION_CODE in the ELF note
in the vDSO.

Statically linked binaries use the uname system call and parse the
release field in struct utsname.  If the uname system call fails, there
is also /proc fallback, but I believe that path is unused.

The glibc dynamic linker falls back to uname if the vDSO cannot be
located.

The LINUX_VERSION_CODE format is also used in /etc/ld.so.cache.  This is
difficult to change because a newer ldconfig is supposed to build a
cache that is compatible with older glibc versions (two-way
compatibility).  The information in /etc/ld.so.cache is copied from the
ELF_NOTE_ABI/NT_GNU_ABI_TAG ELF note in the DSOs; the note format is not
subject to overflows because it uses 32-bit values for the component
versions.

glibc uses the current kernel's LINUX_VERSION_CODE for two purposes: for
its own “kernel too old” check (glibc refuses to start in this case),
and to skip loading DSOs which have an ELF_NOTE_ABI/NT_GNU_ABI_TAG that
indicates a higher kernel version than the current kernel.  glibc does
not use LINUX_VERSION_CODE to detect features or activate workarounds
for kernel bugs.

The overflow from 4.9.256 to 4.10.0 means that we might get spurious
passes on these checks.  Worst case, it can happen that if the system
has a DSO in two versions on the library search path, one for kernel
4.10 and one for kernel 4.9 or earlier (in that order), we now load the
4.10 version on a 4.9 kernel.  Previously, loading the 4.10 DSO failed,
and the fallback version for earlier kernels was used.  That would be
real breakage.

Our options in userspace are limited because whatever changes we make to
glibc today are unlikely to reach people running 4.4 or 4.9 kernels
anytime soon, if ever.  Clamping the sublevel field of
LINUX_VERSION_CODE in the vDSO to 255 only benefits dynamically linked
binaries, but it could be that this is sufficient to paper over this
issue.

There's also the question whether these glibc checks are valuable at
all.  It encourages kernel patching to lie about kernel versions, making
diagnostics harder (e.g., reporting 3.10 if it's really a 2.6.32 with
lots of system call backports).  The ELF_NOTE_ABI/NT_GNU_ABI_TAG DSO
selection is known to cause endless problems with Qt, basically the only
large-scale user of this feature.  Perhaps we should remove it, but it
would also break the fallback DSO approach mentioned above.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: Linux 4.9.256

2021-02-08 Thread Florian Weimer
* Greg Kroah-Hartman:

> I'm announcing the release of the 4.9.256 kernel.
>
> This, and the 4.4.256 release are a little bit "different" than normal.
>
> This contains only 1 patch, just the version bump from .255 to .256 which ends
> up causing the userspace-visable LINUX_VERSION_CODE to behave a bit 
> differently
> than normal due to the "overflow".
>
> With this release, KERNEL_VERSION(4, 9, 256) is the same as KERNEL_VERSION(4, 
> 10, 0).
>
> Nothing in the kernel build itself breaks with this change, but given
> that this is a userspace visible change, and some crazy tools (like
> glibc and gcc) have logic that checks the kernel version for different
> reasons, I wanted to do this release as an "empty" release to ensure
> that everything still works properly.

I'm looking at this from a glibc perspective.  glibc took the
KERNEL_VERSION definition and embedded the bit layout into the
/etc/ld.so.cache, as part of the file format.  Exact impact is still
unclear at this point.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: Aarch64 EXT4FS inode checksum failures - seems to be weak memory ordering issues

2021-01-12 Thread Florian Weimer
* Lukas Wunner:

> On Fri, Jan 08, 2021 at 12:02:53PM -0800, Linus Torvalds wrote:
>> I appreciate Arnd pointing out "--std=gnu11", though. What are the
>> actual relevant language improvements?
>> 
>> Variable declarations in for-loops is the only one I can think of. I
>> think that would clean up some code (and some macros), but might not
>> be compelling on its own.
>
> Anonymous structs/unions.  I used to have a use case for that in
> struct efi_dev_path in include/linux/efi.h, but Ard Biesheuvel
> refactored it in a gnu89-compatible way for v5.7 with db8952e7094f.

Aren't those a GNU extension supported since GCC 3.0?

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-11 Thread Florian Weimer
* Suren Baghdasaryan:

> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6a660858784b..c2d600386902 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1197,12 +1197,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
> struct iovec __user *, vec,
>   goto release_task;
>   }
>  
> - mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> + /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
> + mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
>   if (IS_ERR_OR_NULL(mm)) {
>   ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
>   goto release_task;
>   }

Shouldn't this depend on the requested behavior?  Several operations
directly result in observable changes, and go beyond performance tuning.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: Aarch64 EXT4FS inode checksum failures - seems to be weak memory ordering issues

2021-01-07 Thread Florian Weimer
* Theodore Ts'o:

> On Thu, Jan 07, 2021 at 01:37:47PM +, Russell King - ARM Linux admin 
> wrote:
>> > The gcc bugzilla mentions backports into gcc-linaro, but I do not see
>> > them in my git history.
>> 
>> So, do we raise the minimum gcc version for the kernel as a whole to 5.1
>> or just for aarch64?
>
> Russell, Arnd, thanks so much for tracking down the root cause of the
> bug!
>
> I will note that RHEL 7 uses gcc 4.8.  I personally don't have an
> objections to requiring developers using RHEL 7 to have to install a
> more modern gcc (since I use Debian Testing and gcc 10.2.1, myself,
> and gcc 5.1 is so five years ago :-), but I could imagine that being
> considered inconvenient for some.

Actually, RHEL 7 should have the fix (internal bug #1362635, curiously
we encountered it in the *XFS* CRC calculation code back then).

My understanding is that RHEL 7 aarch64 support ceased completely about
a month ago, so that shouldn't be an argument against bumping the
minimum version requirement to 5.1.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH v5] mm: Optional full ASLR for mmap(), mremap(), vdso and stack

2020-12-03 Thread Florian Weimer
* Andy Lutomirski:

> If you want a 4GB allocation to succeed, you can only divide the
> address space into 32k fragments.  Or, a little more precisely, if you
> want a randomly selected 4GB region to be empty, any other allocation
> has a 1/32k chance of being in the way.  (Rough numbers — I’m ignoring
> effects of the beginning and end of the address space, and I’m
> ignoring the size of a potential conflicting allocation.).

I think the probability distribution is way more advantageous than that
because it is unlikely that 32K allocations are all exactly spaced 4 GB
apart.  (And with 32K allocations, you are close to the VMA limit anyway.)

My knowledge of probability theory is quite limited, so I have to rely
on simulations.  But I think you would see a 40 GiB gap somewhere for a
47-bit address space with 32K allocations, most of the time.  Which is
not too bad.

But even with a 47 bit address space and just 500 threads (each with at
least a stack and local heap, randomized indepently), simulations
suggestion that the largest gap is often just 850 GB.  At that point,
you can't expect to map your NVDIMM (or whatever) in a single mapping
anymore, and you have to code around that.

Not randomizing large allocations and sacrificing one bit of randomness
for small allocations would avoid this issue, though.

(I still expect page walking performance to suffer drastically, with or
without this tweak.  I assume page walking uses the CPU cache hierarchy
today, and with full randomization, accessing page entry at each level
after a TLB miss would result in a data cache miss.  But then, I'm
firmly a software person.)

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH v5] mm: Optional full ASLR for mmap(), mremap(), vdso and stack

2020-12-03 Thread Florian Weimer
* Topi Miettinen:

> +3   Additionally enable full randomization of memory mappings created
> +with mmap(NULL, ...). With 2, the base of the VMA used for such
> +mappings is random, but the mappings are created in predictable
> +places within the VMA and in sequential order. With 3, new VMAs
> +are created to fully randomize the mappings.
> +
> +Also mremap(..., MREMAP_MAYMOVE) will move the mappings even if
> +not necessary and the location of stack and vdso are also
> +randomized.
> +
> +On 32 bit systems this may cause problems due to increased VM
> +fragmentation if the address space gets crowded.

Isn't this a bit of an understatement?  I think you'll have to restrict
this randomization to a subregion of the entire address space, otherwise
the reduction in maximum mapping size due to fragmentation will be a
problem on 64-bit architectures as well (which generally do not support
the full 64 bits for user-space addresses).

> +On all systems, it will reduce performance and increase memory
> +usage due to less efficient use of page tables and inability to
> +merge adjacent VMAs with compatible attributes. In the worst case,
> +additional page table entries of up to 4 pages are created for
> +each mapping, so with small mappings there's considerable penalty.

The number 4 is architecture-specific, right?

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround

2020-11-24 Thread Florian Weimer
* Jann Horn:

> But if you can't tell whether the more modern syscall failed because
> of a seccomp filter, you may be forced to retry with an older syscall
> even on systems where the new syscall works fine, and such a fallback
> may reduce security or reliability if you're trying to use some flags
> that only the new syscall provides for security, or something like
> that. (As a contrived example, imagine being forced to retry any
> tgkill() that fails with EPERM as a tkill() just in case you're
> running under a seccomp filter.)

We have exactly this situation with faccessat2 and faccessat today.
EPERM could mean a reject from a LSM, and we really don't want to do our
broken fallback in this case because it will mask the EPERM error from
the LSM (and the sole purpose of faccessat2 is to get that error).

This is why I was so eager to start using faccessat2 in glibc, and we
are now encountering breakage with container runtimes.  Applications
call faccessat (with a non-zero flags argument) today, and they now get
routed to the faccessat2 entry point, without needing recompilation or
anything like that.

We have the same problem for any new system call, but it's different
this time because it affects 64-bit hosts *and* existing applications.

And as I explained earlier, I want to take this opportunity to get
consensus how to solve this properly, so that we are ready for a new
system call where incorrect fallback would definitely reintroduce a
security issue.  Whether it's that ugly probing sequence, a change to
the OCI specification that gets deployed in a reasonable time frame, or
something else that I haven't thought of—I do not have a very strong
preference, although I lean towards the spec change myself.  But I do
feel that we shouldn't throw in a distro-specific patch to paper over
the current faccessat2 issue and forget about it.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround

2020-11-24 Thread Florian Weimer
* Mark Wielaard:

> For valgrind the issue is statx which we try to use before falling back
> to stat64, fstatat or stat (depending on architecture, not all define
> all of these). The problem with these fallbacks is that under some
> containers (libseccomp versions) they might return EPERM instead of
> ENOSYS. This causes really obscure errors that are really hard to
> diagnose.

The probing sequence I proposed should also work for statx. 8-p

> Don't you have the same issue with glibc for those architectures that
> don't have fstatat or 32bit arches that need 64-bit time_t? And if so,
> how are you working around containers possibly returning EPERM instead
> of ENOSYS?

That's a good point.  I don't think many people run 32-bit containers in
the cloud.  The Y2038 changes in glibc impact 64-bit ports a little, but
mostly on the fringes (e.g., clock_nanosleep vs nanosleep).

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround

2020-11-24 Thread Florian Weimer
* Jann Horn:

> +seccomp maintainers/reviewers
> [thread context is at
> https://lore.kernel.org/linux-api/87lfer2c0b@oldenburg2.str.redhat.com/
> ]
>
> On Tue, Nov 24, 2020 at 5:49 PM Christoph Hellwig  wrote:
>> On Tue, Nov 24, 2020 at 03:08:05PM +0100, Mark Wielaard wrote:
>> > For valgrind the issue is statx which we try to use before falling back
>> > to stat64, fstatat or stat (depending on architecture, not all define
>> > all of these). The problem with these fallbacks is that under some
>> > containers (libseccomp versions) they might return EPERM instead of
>> > ENOSYS. This causes really obscure errors that are really hard to
>> > diagnose.
>>
>> So find a way to detect these completely broken container run times
>> and refuse to run under them at all.  After all they've decided to
>> deliberately break the syscall ABI.  (and yes, we gave the the rope
>> to do that with seccomp :().
>
> FWIW, if the consensus is that seccomp filters that return -EPERM by
> default are categorically wrong, I think it should be fairly easy to
> add a check to the seccomp core that detects whether the installed
> filter returns EPERM for some fixed unused syscall number and, if so,
> prints a warning to dmesg or something along those lines...

But that's playing Core Wars, right?  Someone will write a seccomp
filter trying to game that kernel check.  I don't really think it solves
anything until there is consensus what a system call filter should do
with system calls not on the permitted list.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround

2020-11-24 Thread Florian Weimer
* Christoph Hellwig:

> On Tue, Nov 24, 2020 at 03:08:09PM +0100, Florian Weimer wrote:
>> Do you categorically reject the general advice, or specific instances as
>> well?
>
> All of the above.  Really, if people decided to use seccompt to return
> nonsensical error codes we should not work around that in new kernel
> ABIs.

Fair enough, I can work with that.  Thanks.

Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround

2020-11-24 Thread Florian Weimer
* Christoph Hellwig:

> On Tue, Nov 24, 2020 at 01:08:20PM +0100, Florian Weimer wrote:
>> This documents a way to safely use new security-related system calls
>> while preserving compatibility with container runtimes that require
>> insecure emulation (because they filter the system call by default).
>> Admittedly, it is somewhat hackish, but it can be implemented by
>> userspace today, for existing system calls such as faccessat2,
>> without kernel or container runtime changes.
>
> I think this is completely insane.  Tell the OCI folks to fix their
> completely broken specification instead.

Do you categorically reject the general advice, or specific instances as
well?  Like this workaround for faccessat that follows the pattern I
outlined:

<https://sourceware.org/pipermail/libc-alpha/2020-November/119955.html>

I value your feedback and want to make sure I capture it accurately.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround

2020-11-24 Thread Florian Weimer
* Aleksa Sarai:

> As I mentioned in the runc thread[1], this is really down to Docker's
> default policy configuration. The EPERM-everything behaviour in OCI was
> inherited from Docker, and it boils down to not having an additional
> seccomp rule which does ENOSYS for unknown syscall numbers (Docker can
> just add the rule without modifying the OCI runtime-spec -- so it's
> something Docker can fix entirely on their own). I'll prepare a patch
> for Docker this week.

Appreciated, thanks.

> IMHO it's also slightly overkill to change the kernel API design
> guidelines in response to this issue.
>
> [1]: https://github.com/opencontainers/runc/issues/2151

Won't this cause docker to lose OCI compliance?  Or is the compliance
testing not that good?

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH] syscalls: Document OCI seccomp filter interactions & workaround

2020-11-24 Thread Florian Weimer
* Christian Brauner:

> I'm sorry but I have some doubts about this new "rule". The idea of
> being able to reliably trigger an error for a system call other then
> EPERM might have merrit in some scenarios but justifying it via a bug in
> a userspace standard is not enough in my opinion.
>
> The solution is to fix the standard to mandate ENOSYS. This is the
> correct error for this exact scenario and standards can be changed.
> I don't think it is the kernel's job to work around a deliberate
> userspace decision to use EPERM and not ENOSYS. The kernel's system call
> design should not be informed by this especially since this is clearly
> not a kernel bug.
>
> Apart from that I have doubts that this is in any shape or form
> enforceable. Not just because in principle there might be system calls
> that only return EPERM on error but also because this requirement feels
> arbitrary and I doubt developers will feel bound by it or people will
> check for it.
>
>> +
>> +If a system call has such error behavior, upon encountering an
>> +``EPERM`` error, userspace applications can perform further
>> +invocations of the same system call to check if the ``EPERM`` error
>> +persists for those known error conditions.  If those also fail with
>> +``EPERM``, that likely means that the original ``EPERM`` error was the
>> +result of a seccomp filter, and should be treated like ``ENOSYS``
>
> I think that this "approach" alone should illustrate that this is the
> wrong way to approach this. It's hacky and requires excercising a system
> call multiple times just to find out whether or not it is supported.
> The only application that would possibly do this is probably glibc.
> This seems to be the complete wrong way of solving this problem.

Thank you for your feedback.  I appreciate it.

I agree that the standard should mandate ENOSYS, and I've just proposed
a specification change here:

  

However, such a change may take some time to implement.

Meanwhile, we have the problem today with glibc that it wants to use the
faccessat2 system call but it can't.  I've been told that it would make
glibc incompatible with the public cloud and Docker.  The best solution
I could come up with it is this awkward probing sequence.  (Just
checking for the zero flags argument is not sufficient because systemd
calls fchmodat with AT_SYMLINK_NOFOLLOW.)

I do not wish to put the probing sequence into glibc (upstream or
downstream) unless it is blessed to some degree by kernel developers.  I
consider it quite ugly and would prefer if more of us share the blame.

We will face the same issue again with fchmodat2 (or fchmodat4 if that's
what it's name is going to be).  And we have been lucky in recent times
that didn't need a new system call to fix a security vulnerability in an
existing system call in wide use by userspace (although faccessat2 comes
rather close because it replaces a userspace permission check
approximation with a proper kernel check).  The seccomp situation means
that we can't, reliably, and the probing hack seems to be way out.
That's another reason for not just putting in the probing sequence
quietly and be done with it: I'd like to discuss this aspect in the
open, before we need it as part of a fix for some embargoed security
vulnerability.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



[PATCH] syscalls: Document OCI seccomp filter interactions & workaround

2020-11-24 Thread Florian Weimer
This documents a way to safely use new security-related system calls
while preserving compatibility with container runtimes that require
insecure emulation (because they filter the system call by default).
Admittedly, it is somewhat hackish, but it can be implemented by
userspace today, for existing system calls such as faccessat2,
without kernel or container runtime changes.

Signed-off-by: Florian Weimer 

---
 Documentation/process/adding-syscalls.rst | 37 +++
 1 file changed, 37 insertions(+)

diff --git a/Documentation/process/adding-syscalls.rst 
b/Documentation/process/adding-syscalls.rst
index a3ecb236576c..7d1e578a1df1 100644
--- a/Documentation/process/adding-syscalls.rst
+++ b/Documentation/process/adding-syscalls.rst
@@ -436,6 +436,40 @@ simulates registers etc).  Fixing this is as simple as 
adding a #define to
 
 #define stub_xyzzy sys_xyzzy
 
+Container Compatibility and seccomp
+---
+
+The Linux Foundation Open Container Initiative Runtime Specification
+requires that by default, implementations install seccomp system call
+filters which cause system calls to fail with ``EPERM``.  As a result,
+all new system calls in such containers fail with ``EPERM`` instead of
+``ENOSYS``.  This design is problematic because ``EPERM`` is a
+legitimate system call result which should not trigger fallback to a
+userspace emulation, particularly for security-related system calls.
+(With ``ENOSYS``, it is clear that a fallback implementation has to be
+used to maintain compatibility with older kernels or container
+runtimes.)
+
+New system calls should therefore provide a way to reliably trigger an
+error distinct from ``EPERM``, without any side effects.  Some ways to
+achieve that are:
+
+ - ``EBADFD`` for the invalid file descriptor -1
+ - ``EFAULT`` for a null pointer
+ - ``EINVAL`` for a contradictory set of flags that will remain invalid
+   in the future
+
+If a system call has such error behavior, upon encountering an
+``EPERM`` error, userspace applications can perform further
+invocations of the same system call to check if the ``EPERM`` error
+persists for those known error conditions.  If those also fail with
+``EPERM``, that likely means that the original ``EPERM`` error was the
+result of a seccomp filter, and should be treated like ``ENOSYS``
+(e.g., trigger an alternative fallback implementation).  If those
+probing system calls do not fail with ``EPERM``, the error likely came
+from a real implementation, and should be reported to the caller
+directly, without resorting to ``ENOSYS``-style fallback.
+
 
 Other Details
 -
@@ -575,3 +609,6 @@ References and Sources
  - Recommendation from Linus Torvalds that x32 system calls should prefer
compatibility with 64-bit versions rather than 32-bit versions:
https://lkml.org/lkml/2011/8/31/244
+ - Linux Configuration section of the Open Container Initiative
+   Runtime Specification:
+   https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md

-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH] lseek.2: SYNOPSIS: Use correct types

2020-11-22 Thread Florian Weimer
* Alejandro Colomar:

> The Linux kernel uses 'unsigned int' instead of 'int' for 'fd' and
> 'whence'.  As glibc provides no wrapper, use the same types the
> kernel uses.

lseek is a POSIX interface, and glibc provides it.  POSIX uses int for
file descriptors (and the whence parameter in case of lseek).

The llseek system call is a different matter, that's indeed
Linux-specific.


Re: violating function pointer signature

2020-11-18 Thread Florian Weimer
* Segher Boessenkool:

> On Wed, Nov 18, 2020 at 12:17:30PM -0500, Steven Rostedt wrote:
>> I could change the stub from (void) to () if that would be better.
>
> Don't?  In a function definition they mean exactly the same thing (and
> the kernel uses (void) everywhere else, which many people find clearer).

And I think () functions expected a caller-provided parameter save
area on powerpc64le, while (void) functions do not.  It does not
matter for an empty function, but GCC prefers to use the parameter
save area instead of setting up a stack frame if it is present.  So
you get stack corruption if you call a () function as a (void)
function.  (The other way round is fine.)


Re: [PATCH v7 0/7] Syscall User Dispatch

2020-11-18 Thread Florian Weimer
* Gabriel Krisman Bertazi:

> The main use case is to intercept Windows system calls of an application
> running over Wine. While Wine is using an unmodified glibc to execute
> its own native Linux syscalls, the Windows libraries might be directly
> issuing syscalls that we need to capture. So there is a mix. While this
> mechanism is compatible with existing libc, we might have other
> libraries executing a syscall instruction directly.

Please raise this on libc-alpha, it's an unexpected compatibility
constraint on glibc.  Thanks.


Re: violating function pointer signature

2020-11-18 Thread Florian Weimer
* Peter Zijlstra:

>> The default Linux calling conventions are all of the cdecl family,
>> where the caller pops the argument off the stack.  You didn't quote
>> enough to context to tell whether other calling conventions matter in
>> your case.
>
> This is strictly in-kernel, and I think we're all cdecl, of which the
> important part is caller-cleanup. The function compiles to:
>
>   RET
>
> so whatever the arguments are is irrelevant.

Yes, then the stub is ABI-compatible, as far as I know.


Re: violating function pointer signature

2020-11-18 Thread Florian Weimer
* Peter Zijlstra:

> I think that as long as the function is completely empty (it never
> touches any of the arguments) this should work in practise.
>
> That is:
>
>   void tp_nop_func(void) { }
>
> can be used as an argument to any function pointer that has a void
> return. In fact, I already do that, grep for __static_call_nop().

You can pass it as a function parameter, but in general, you cannot
call the function with a different prototype.  Even trivial
differences such as variadic vs non-variadic prototypes matter.

The default Linux calling conventions are all of the cdecl family,
where the caller pops the argument off the stack.  You didn't quote
enough to context to tell whether other calling conventions matter in
your case.

> I'm not sure what the LLVM-CFI crud makes of it, but that's their
> problem.

LTO can cause problems as well, particularly with whole-program
optimization.


Re: [PATCH v7 0/7] Syscall User Dispatch

2020-11-18 Thread Florian Weimer
* Gabriel Krisman Bertazi:

> This is the v7 of syscall user dispatch.  This version is a bit
> different from v6 on the following points, after the modifications
> requested on that submission.

Is this supposed to work with existing (Linux) libcs, or do you bring
your own low-level run-time libraries?


Re: [PATCH v7 7/7] docs: Document Syscall User Dispatch

2020-11-18 Thread Florian Weimer
* Gabriel Krisman Bertazi:

> +Interface
> +-
> +
> +A process can setup this mechanism on supported kernels
> +CONFIG_SYSCALL_USER_DISPATCH) by executing the following prctl:
> +
> +  prctl(PR_SET_SYSCALL_USER_DISPATCH, , , , [selector])
> +
> + is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
> +disable the mechanism globally for that thread.  When
> +PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
> +
> + and  delimit a closed memory region interval
> +from which syscalls are always executed directly, regardless of the
> +userspace selector.  This provides a fast path for the C library, which
> +includes the most common syscall dispatchers in the native code
> +applications, and also provides a way for the signal handler to return
> +without triggering a nested SIGSYS on (rt_)sigreturn.  Users of this
> +interface should make sure that at least the signal trampoline code is
> +included in this region. In addition, for syscalls that implement the
> +trampoline code on the vDSO, that trampoline is never intercepted.
> +
> +[selector] is a pointer to a char-sized region in the process memory
> +region, that provides a quick way to enable disable syscall redirection
> +thread-wide, without the need to invoke the kernel directly.  selector
> +can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.  Any other
> +value should terminate the program with a SIGSYS.

Is this a process property or a task/thread property?  The last
paragraph says “thread-wide”, but the first paragraph says “process”.


Re: [PATCH 0/4] aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831]

2020-11-04 Thread Florian Weimer
* Catalin Marinas:

> Can the dynamic loader mmap() the main exe again while munmap'ing the
> original one? (sorry if it was already discussed)

No, we don't have a descriptor for that.  /proc may not be mounted, and
using the path stored there has a race condition anyway.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH 0/4] aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831]

2020-11-04 Thread Florian Weimer
* Will Deacon:

> Is there real value in this seccomp filter if it only looks at mprotect(),
> or was it just implemented because it's easy to do and sounds like a good
> idea?

It seems bogus to me.  Everyone will just create alias mappings instead,
just like they did for the similar SELinux feature.  See “Example code
to avoid execmem violations” in:

  

As you can see, this reference implementation creates a PROT_WRITE
mapping aliased to a PROT_EXEC mapping, so it actually reduces security
compared to something that generates the code in an anonymous mapping
and calls mprotect to make it executable.

Furthermore, it requires unusual cache flushing code on some AArch64
implementations (a requirement that is not shared by any Linux other
architecture to which libffi has been ported), resulting in
hard-to-track-down real-world bugs.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH 2/4] elf: Move note processing after l_phdr is updated [BZ #26831]

2020-11-03 Thread Florian Weimer
* Szabolcs Nagy:

> Program headers are processed in two pass: after the first pass
> load segments are mmapped so in the second pass target specific
> note processing logic can access the notes.
>
> The second pass is moved later so various link_map fields are
> set up that may be useful for note processing such as l_phdr.
> ---
>  elf/dl-load.c | 30 +++---
>  1 file changed, 15 insertions(+), 15 deletions(-)
>
> diff --git a/elf/dl-load.c b/elf/dl-load.c
> index ceaab7f18e..673cf960a0 100644
> --- a/elf/dl-load.c
> +++ b/elf/dl-load.c
> @@ -1259,21 +1259,6 @@ _dl_map_object_from_fd (const char *name, const char 
> *origname, int fd,
> maplength, has_holes, loader);
>  if (__glibc_unlikely (errstring != NULL))
>goto call_lose;
> -
> -/* Process program headers again after load segments are mapped in
> -   case processing requires accessing those segments.  Scan program
> -   headers backward so that PT_NOTE can be skipped if PT_GNU_PROPERTY
> -   exits.  */
> -for (ph = &phdr[l->l_phnum]; ph != phdr; --ph)
> -  switch (ph[-1].p_type)
> - {
> - case PT_NOTE:
> -   _dl_process_pt_note (l, fd, &ph[-1]);
> -   break;
> - case PT_GNU_PROPERTY:
> -   _dl_process_pt_gnu_property (l, fd, &ph[-1]);
> -   break;
> - }
>}
>  
>if (l->l_ld == 0)
> @@ -1481,6 +1466,21 @@ cannot enable executable stack as shared object 
> requires");
>  /* Assign the next available module ID.  */
>  l->l_tls_modid = _dl_next_tls_modid ();
>  
> +  /* Process program headers again after load segments are mapped in
> + case processing requires accessing those segments.  Scan program
> + headers backward so that PT_NOTE can be skipped if PT_GNU_PROPERTY
> + exits.  */
> +  for (ph = &l->l_phdr[l->l_phnum]; ph != l->l_phdr; --ph)
> +switch (ph[-1].p_type)
> +  {
> +  case PT_NOTE:
> + _dl_process_pt_note (l, fd, &ph[-1]);
> + break;
> +  case PT_GNU_PROPERTY:
> + _dl_process_pt_gnu_property (l, fd, &ph[-1]);
> + break;
> +  }
> +
>  #ifdef DL_AFTER_LOAD
>DL_AFTER_LOAD (l);
>  #endif

Is this still compatible with the CET requirements?

I hope it is because the CET magic happens in _dl_open_check, so after
the the code in elf/dl-load.c has run.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH 3/4] aarch64: Use mmap to add PROT_BTI instead of mprotect [BZ #26831]

2020-11-03 Thread Florian Weimer
* Szabolcs Nagy:

> Re-mmap executable segments if possible instead of using mprotect
> to add PROT_BTI. This allows using BTI protection with security
> policies that prevent mprotect with PROT_EXEC.
>
> If the fd of the ELF module is not available because it was kernel
> mapped then mprotect is used and failures are ignored.  It is
> expected that linux kernel will add PROT_BTI when mapping a module
> (current linux as of version 5.9 does not do this).
>
> Computing the mapping parameters follows the logic of
> _dl_map_object_from_fd more closely now.

What's the performance of this on execve-heavy workloads, such as kernel
or glibc builds?  Hopefully it's cheap because these mappings have not
been faulted in yet.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [RFC PATCH 1/2] rseq: Implement KTLS prototype for x86-64

2020-10-29 Thread Florian Weimer
* Mathieu Desnoyers:

> - On Sep 29, 2020, at 4:13 AM, Florian Weimer fwei...@redhat.com wrote:
>
>> * Mathieu Desnoyers:
>> 
>>>> So we have a bootstrap issue here that needs to be solved, I think.
>>>
>>> The one thing I'm not sure about is whether the vDSO interface is indeed
>>> superior to KTLS, or if it is just the model we are used to.
>>>
>>> AFAIU, the current use-cases for vDSO is that an application calls into
>>> glibc, which then calls the vDSO function exposed by the kernel. I wonder
>>> whether the vDSO indirection is really needed if we typically have a glibc
>>> function used as indirection ? For an end user, what is the benefit of vDSO
>>> over accessing KTLS data directly from glibc ?
>> 
>> I think the kernel can only reasonably maintain a single userspace data
>> structure.  It's not reasonable to update several versions of the data
>> structure in parallel.
>
> I disagree with your statement. Considering that the kernel needs to
> keep ABI compatibility for whatever it exposes to user-space, claiming
> that it should never update several versions of data structures
> exposed to user-space in parallel means that once a data structure is
> exposed to user-space as ABI in a certain way, it can never ever
> change in the future, even if we find a better way to do things.

I think it's possible to put data into userspace without making it ABI.
Think about the init_module system call.  The module blob comes from
userspace, but its (deeper) internal structure does not have a stable
ABI.  Similar for many BPF use cases.

If the internal KTLS blob structure turns into ABI, including the parts
that need to be updated on context switch, each versioning change has a
performance impact.

>> This means that glibc would have to support multiple kernel data
>> structures, and users might lose userspace acceleration after a kernel
>> update, until they update glibc as well.  The glibc update should be
>> ABI-compatible, but someone would still have to backport it, apply it to
>> container images, etc.
>
> No. If the kernel ever exposes a data structure to user-space as ABI,
> then it needs to stay there, and not break userspace. Hence the need to
> duplicate information provided to user-space if need be, so we can move
> on to better ABIs without breaking the old ones.

It can expose the data as an opaque blob.

> Or as Andy mentioned, we would simply pass the ktls offset as argument to
> the vDSO ? It seems simple enough. Would it fit all our use-cases including
> errno ?

That would work, yes.  It's neat, but it won't give you a way to provide
traditional syscall wrappers directly from the vDSO.

>> We'll see what will break once we have the correct TID after vfork. 8->
>> glibc currently supports malloc-after-vfork as an extension, and
>> a lot of software depends on it (OpenJDK, for example).
>
> I am not sure to see how that is related to ktls ?

The mutex implementation could switch to the KTLS TID because it always
correct.  But then locking in a vfork'ed subprocess would no longer look
like locking from the parent thread because the TID would be different.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: Possible bug in getdents64()?

2020-10-29 Thread Florian Weimer
* Alejandro Colomar via Libc-alpha:

> [[ CC += linux-man, linux-kernel, libc-alpha, mtk ]]
>
> On 2020-10-28 20:26, Alejandro Colomar wrote:
>> The manual page for getdents64() says the prototype should be the
>> following:
>>     int getdents64(unsigned int fd, struct linux_dirent64 *dirp,
>>      unsigned int count);
>> 
>> Note the type of 'count': 'unsigned int'
>> (usually a 32-bit unsigned integer).
>> And the Linux kernel seems to use those types (fs/readdir.c:351):
>> SYSCALL_DEFINE3(getdents64, unsigned int, fd,
>>      struct linux_dirent64 __user *, dirent,
>>      unsigned int, count)
>> {
>> ...
>> }
>> But glibc uses 'size_t' (usually a 64-bit unsigned integer)
>> for the parameter 'count' (sysdeps/unix/linux/getdents64.c:25):
>> 
>> /* The kernel struct linux_dirent64 matches the 'struct dirent64' type.  */
>> ssize_t
>> __getdents64 (int fd, void *buf, size_t nbytes)
>> {
>>    /* The system call takes an unsigned int argument, and some length
>>   checks in the kernel use an int type.  */
>>    if (nbytes > INT_MAX)
>>      nbytes = INT_MAX;
>>    return INLINE_SYSCALL_CALL (getdents64, fd, buf, nbytes);
>> }
>> libc_hidden_def (__getdents64)
>> weak_alias (__getdents64, getdents64)
>> 
>> Isn't it undefined behavior to pass a variable of a different
>> (larger) type to a variadic function than what it expects?
>> Is that behavior defined in this implementation?
>> Wouldn't a cast to 'unsigned int' be needed?

There is no variadic function involved here.  INLINE_SYSCALL_CALL takes
care of the zero extension to the register internally, irrespective of
the argument type.  (The register is of type long int or long long int,
depending on the architecture.)

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-27 Thread Florian Weimer
* Dave Martin via Libc-alpha:

> On Mon, Oct 26, 2020 at 05:45:42PM +0100, Florian Weimer via Libc-alpha wrote:
>> * Dave Martin via Libc-alpha:
>> 
>> > Would it now help to add something like:
>> >
>> > int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
>> > {
>> >int ret = -EINVAL;
>> >mmap_write_lock(current->mm);
>> >if (all vmas in [addr .. addr + len) have
>> >their mprotect flags set to old_flags) {
>> >
>> >ret = mprotect(addr, len, new_flags);
>> >}
>> >
>> >mmap_write_unlock(current->mm);
>> >return ret;
>> > }
>> 
>> I suggested something similar as well.  Ideally, the interface would
>> subsume pkey_mprotect, though, and have a separate flags argument from
>> the protection flags.  But then we run into argument list length limits.
>>
>> Thanks,
>> Florian
>
> I suppose.  Assuming that a syscall filter can inspect memory, we might
> be able to bundle arguments into a struct if necessary.

But that leads to a discussion about batch mmap/mprotect/munmap, and
that's again incompatible with seccomp (it would need a checking loop).

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Florian Weimer
* Dave Martin via Libc-alpha:

> Would it now help to add something like:
>
> int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> {
>   int ret = -EINVAL;
>   mmap_write_lock(current->mm);
>   if (all vmas in [addr .. addr + len) have
>   their mprotect flags set to old_flags) {
>
>   ret = mprotect(addr, len, new_flags);
>   }
>   
>   mmap_write_unlock(current->mm);
>   return ret;
> }

I suggested something similar as well.  Ideally, the interface would
subsume pkey_mprotect, though, and have a separate flags argument from
the protection flags.  But then we run into argument list length limits.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Florian Weimer
* Topi Miettinen:

> Allowing mprotect(PROT_EXEC|PROT_BTI) would mean that all you need to
> circumvent MDWX is to add PROT_BTI flag. I'd suggest getting the flags 
> right at mmap() time or failing that, reverting the PROT_BTI for
> legacy programs later.
>
> Could the kernel tell the loader of the BTI situation with auxiliary
> vectors? Then it would be easy for the loader to always use the best 
> mmap() flags without ever needing to mprotect().

I think what we want is a mprotect2 call with a flags argument (separate
from protection flags) that tells the kernel that the request *removes*
protection flags and should fail otherwise.  seccomp could easily filter
that then.

But like the other proposals, the migration story isn't great.  You
would need kernel and seccomp/systemd etc. updates before glibc starts
working, even if glibc has a fallback from mprotect2 to mprotect
(because the latter would be blocked).

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Florian Weimer
* Topi Miettinen:

>> The dynamic loader has to process the LOAD segments to get to the ELF
>> note that says to enable BTI.  Maybe we could do a first pass and
>> load only the segments that cover notes.  But that requires lots of
>> changes to generic code in the loader.
>
> What if the loader always enabled BTI for PROT_EXEC pages, but then
> when discovering that this was a mistake, mprotect() the pages without
> BTI?

Is that architecturally supported?  How costly is the mprotect change if
the pages have not been faulted in yet?

> Then both BTI and MDWX would work and the penalty of not getting
> MDWX would fall to non-BTI programs. What's the expected proportion of
> BTI enabled code vs. disabled in the future, is it perhaps expected
> that a distro would enable the flag globally so eventually only a few
> legacy programs might be unprotected?

Eventually, I expect that mainstream distributions build everything for
BTI, so yes, the PROT_BTI removal would only be needed for legacy
programs.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Florian Weimer
* Lennart Poettering:

> On Mi, 21.10.20 22:44, Jeremy Linton (jeremy.lin...@arm.com) wrote:
>
>> Hi,
>>
>> There is a problem with glibc+systemd on BTI enabled systems. Systemd
>> has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
>> PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
>> being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
>> caught by the seccomp filter, resulting in service failures.
>>
>> So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
>> This is obviously not desirable.
>>
>> Various changes have been suggested, replacing the mprotect with mmap calls
>> having PROT_BTI set on the original mapping, re-mmapping the segments,
>> implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
>> and various modification to seccomp to allow particular mprotect cases to
>> bypass the filters. In each case there seems to be an undesirable attribute
>> to the solution.
>>
>> So, whats the best solution?
>
> Did you see Topi's comments on the systemd issue?
>
> https://github.com/systemd/systemd/issues/17368#issuecomment-710485532
>
> I think I agree with this: it's a bit weird to alter the bits after
> the fact. Can't glibc set up everything right from the begining? That
> would keep both concepts working.

The dynamic loader has to process the LOAD segments to get to the ELF
note that says to enable BTI.  Maybe we could do a first pass and load
only the segments that cover notes.  But that requires lots of changes
to generic code in the loader.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: Additional debug info to aid cacheline analysis

2020-10-11 Thread Florian Weimer
* Mark Wielaard:

> On Sun, Oct 11, 2020 at 02:15:18PM +0200, Florian Weimer wrote:
>> * Mark Wielaard:
>> 
>> > Yes, that would work. I don't know what the lowest supported GCC
>> > version is, but technically it was definitely fixed in 4.10.0, 4.8.4
>> > and 4.9.2. And various distros would probably have backported the
>> > fix. But checking for 5.0+ would certainly give you a good version.
>> >
>> > How about the attached?
>> 
>> Would it be possible to test for the actual presence of the bug, using
>> -fcompare-debug?
>
> Yes, that was discussed in the original commit message, but it was decided
> that disabling it unconditionaly was easier. See commit 2062afb4f.

I think the short test case was not yet available at the time of the
Linux commit.  But then it may not actually detect the bug in all
affected compilers.

Anyway, making this conditional on the GCC version is already a clear
improvement.


Re: Additional debug info to aid cacheline analysis

2020-10-11 Thread Florian Weimer
* Mark Wielaard:

> Yes, that would work. I don't know what the lowest supported GCC
> version is, but technically it was definitely fixed in 4.10.0, 4.8.4
> and 4.9.2. And various distros would probably have backported the
> fix. But checking for 5.0+ would certainly give you a good version.
>
> How about the attached?

Would it be possible to test for the actual presence of the bug, using
-fcompare-debug?

(But it seems to me that the treatment of this particular compiler bug
is an outlier: other equally tricky bugs do not receive this kind of
attention.)


Re: Control Dependencies vs C Compilers

2020-10-07 Thread Florian Weimer
* Peter Zijlstra:

> On Tue, Oct 06, 2020 at 11:20:01PM +0200, Florian Weimer wrote:
>> * Peter Zijlstra:
>> 
>> > Our Documentation/memory-barriers.txt has a Control Dependencies section
>> > (which I shall not replicate here for brevity) which lists a number of
>> > caveats. But in general the work-around we use is:
>> >
>> >x = READ_ONCE(*foo);
>> >if (x > 42)
>> >WRITE_ONCE(*bar, 1);
>> >
>> > Where READ/WRITE_ONCE() cast the variable volatile. The volatile
>> > qualifier dissuades the compiler from assuming it knows things and we
>> > then hope it will indeed emit the branch like we'd expect.
>> >
>> >
>> > Now, hoping the compiler generates correct code is clearly not ideal and
>> > very dangerous indeed. Which is why my question to the compiler folks
>> > assembled here is:
>> >
>> >   Can we get a C language extention for this?
>> 
>> For what exactly?
>
> A branch that cannot be optimized away and prohibits lifting stores
> over. One possible suggestion would be allowing the volatile keyword as
> a qualifier to if.
>
>   x = *foo;
>   volatile if (x > 42)
>   *bar = 1;
>
> This would tell the compiler that the condition is special in that it
> must emit a conditional branch instruction and that it must not lift
> stores (or sequence points) over it.

But it's not the if statement, but the expression in it, right?  So this
would not be a valid transformation:

x = *foo;
bool flag = x > 42;
volatile if (flag)
*bar = 1;

And if we had this:

unsigned x = *foo;
volatile if (x >= 17 && x < 42)
*bar = 1;

Would it be valid to transform this into (assuming that I got the
arithmetic correct):

unsigned x = *foo;
volatile if ((x - 17) < 25)
*bar = 1;

Or would this destroy the magic because arithmetic happens on the value
before the comparison?

>> But not using READ_ONCE and WRITE_ONCE?
>
> I'm OK with READ_ONCE(), but the WRITE_ONCE() doesn't help much, if
> anything. The compiler is always allowed to lift stores, regardless of
> the qualifiers used.

I would hope that with some level of formalization, it can be shown that
no additional synchronization is necessary beyond the load/conditional
sequence.

>> I think in GCC, they are called __atomic_load_n(foo, __ATOMIC_RELAXED)
>> and __atomic_store_n(foo, __ATOMIC_RELAXED).  GCC can't optimize relaxed
>> MO loads and stores because the C memory model is defective and does not
>> actually guarantee the absence of out-of-thin-air values (a property it
>> was supposed to have).
>
> AFAIK people want to get that flaw in the C memory model fixed (which to
> me seemd like a very good idea).

It's been a long time since people realized that this problem exists,
with several standard releases since then.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: Control Dependencies vs C Compilers

2020-10-06 Thread Florian Weimer
* Peter Zijlstra:

> Our Documentation/memory-barriers.txt has a Control Dependencies section
> (which I shall not replicate here for brevity) which lists a number of
> caveats. But in general the work-around we use is:
>
>   x = READ_ONCE(*foo);
>   if (x > 42)
>   WRITE_ONCE(*bar, 1);
>
> Where READ/WRITE_ONCE() cast the variable volatile. The volatile
> qualifier dissuades the compiler from assuming it knows things and we
> then hope it will indeed emit the branch like we'd expect.
>
>
> Now, hoping the compiler generates correct code is clearly not ideal and
> very dangerous indeed. Which is why my question to the compiler folks
> assembled here is:
>
>   Can we get a C language extention for this?

For what exactly?

Do you want a compiler that never simplifies conditional expressions
(like some people want compilers that never re-associate floating point
operations)?

With a bit of effort, we could prototype such a compiler and run
benchmarks against a kernel that was built using it.  But I'm not sure
if that's a good use of resources.

> And while we have a fair number (and growing) existing users of this in
> the kernel, I'd not be adverse to having to annotate them.

But not using READ_ONCE and WRITE_ONCE?

I think in GCC, they are called __atomic_load_n(foo, __ATOMIC_RELAXED)
and __atomic_store_n(foo, __ATOMIC_RELAXED).  GCC can't optimize relaxed
MO loads and stores because the C memory model is defective and does not
actually guarantee the absence of out-of-thin-air values (a property it
was supposed to have).  Obviously, actual implementations want to
provide this guarantee.  So it's really impossible right now to argue
about this in any formal way and determine the validity of optimizations
(which is why there are none, hopefully).

In standard C, there is , but its relaxed MO loads and
stores can of course be optimized, so you'll need a compiler extension
here.

A different way of annotating this would be a variant of _Atomic where
plain accesses have relaxed MO, not seq-cst MO.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [RFC PATCH 0/4] x86: Improve Minimum Alternate Stack Size

2020-10-06 Thread Florian Weimer
* Dave Martin via Libc-alpha:

> On Tue, Oct 06, 2020 at 08:33:47AM -0700, Dave Hansen wrote:
>> On 10/6/20 8:25 AM, Dave Martin wrote:
>> > Or are people reporting real stack overruns on x86 today?
>> 
>> We have real overruns.  We have ~2800 bytes of XSAVE (regisiter) state
>> mostly from AVX-512, and a 2048 byte MINSIGSTKSZ.
>
> Right.  Out of interest, do you believe that's a direct consequence of
> the larger kernel-generated signal frame, or does the expansion of
> userspace stack frames play a role too?

I must say that I do not quite understand this question.

32 64-*byte* registers simply need 2048 bytes of storage space worst
case, there is really no way around that.

> In practice software just assumes SIGSTKSZ and then ignores the problem
> until / unless an actual stack overflow is seen.
>
> There's probably a lot of software out there whose stack is
> theoretically too small even without AVX-512 etc. in the mix, especially
> when considering the possibility of nested signals...

That is certainly true.  We have seen problems with ntpd, which
requested a 16 KiB stack, at a time when there were various deductions
from the stack size, and since the glibc dynamic loader also uses XSAVE,
ntpd exceeded the remaining stack space.  But in this case, we just
fudged the stack size computation in pthread_create and made it less
likely that the dynamic loader was activated, which largely worked
around this particular problem.  For MINSIGSTKSZ, we just don't have
this option because it's simply too small in the first place.

I don't immediately recall a bug due to SIGSTKSZ being too small.  The
test cases I wrote for this were all artificial, to raise awareness of
this issue (applications treating these as recommended values, rather
than minimum value to avoid immediately sigaltstack/phtread_create
failures, same issue with PTHREAD_STACK_MIN).

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [RFC PATCH 1/2] rseq: Implement KTLS prototype for x86-64

2020-09-29 Thread Florian Weimer
* Mathieu Desnoyers:

>> So we have a bootstrap issue here that needs to be solved, I think.
>
> The one thing I'm not sure about is whether the vDSO interface is indeed
> superior to KTLS, or if it is just the model we are used to.
>
> AFAIU, the current use-cases for vDSO is that an application calls into
> glibc, which then calls the vDSO function exposed by the kernel. I wonder
> whether the vDSO indirection is really needed if we typically have a glibc
> function used as indirection ? For an end user, what is the benefit of vDSO
> over accessing KTLS data directly from glibc ?

I think the kernel can only reasonably maintain a single userspace data
structure.  It's not reasonable to update several versions of the data
structure in parallel.

This means that glibc would have to support multiple kernel data
structures, and users might lose userspace acceleration after a kernel
update, until they update glibc as well.  The glibc update should be
ABI-compatible, but someone would still have to backport it, apply it to
container images, etc.

What's worse, the glibc code would be quite hard to test because we
would have to keep around multiple kernel versions to exercise all the
different data structure variants.

In contrast, the vDSO code always matches the userspace data structures,
is always updated at the same time, and tested together.  That looks
like a clear win to me.

> If we decide that using KTLS from a vDSO function is indeed a requirement,
> then, as you point out, the thread_pointer is available as ABI, but we miss
> the KTLS offset.
>
> Some ideas on how we could solve this: we could either make the KTLS
> offset part of the ABI (fixed offset), or save the offset near the
> thread pointer at a location that would become ABI. It would have to
> be already populated with something which can help detect the case
> where a vDSO is called from a thread which does not populate KTLS
> though. Is that even remotely doable ?

I don't know.

We could decide that these accelerated system calls must only be called
with a valid TCB.  That's unavoidable if the vDSO sets errno directly,
so it's perhaps not a big loss.  It's also backwards-compatible because
existing TCB-less code won't know about those new vDSO entrypoints.
Calling into glibc from a TCB-less thread has always been undefined.
TCB-less code would have to make direct, non-vDSO system calls, as today.

For discovering the KTLS offset, a per-process page at a fixed offset
from the vDSO code (i.e., what real shared objects already do for global
data) could store this offset.  This way, we could entirely avoid an ABI
dependency.

We'll see what will break once we have the correct TID after vfork. 8->
glibc currently supports malloc-after-vfork as an extension, and
a lot of software depends on it (OpenJDK, for example).

>> With the latter, we could
>> directly expose the vDSO implementation to applications, assuming that
>> we agree that the vDSO will not fail with ENOSYS to request fallback to
>> the system call, but will itself perform the system call.
>
> We should not forget the fields needed by rseq as well: the rseq_cs
> pointer and the cpu_id fields need to be accessed directly from the
> rseq critical section, without function call. Those use-cases require
> that applications and library can know the KTLS offset and size and
> use those fields directly.

Yes, but those offsets could be queried using a function from the vDSO
(or using a glibc interface, to simplify linking).

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [RFC PATCH 1/2] rseq: Implement KTLS prototype for x86-64

2020-09-28 Thread Florian Weimer
* Mathieu Desnoyers:

> Upstreaming efforts aiming to integrate rseq support into glibc led to
> interesting discussions, where we identified a clear need to extend the
> size of the per-thread structure shared between kernel and user-space
> (struct rseq).  This is something that is not possible with the current
> rseq ABI.  The fact that the current non-extensible rseq kernel ABI
> would also prevent glibc's ABI to be extended prevents its integration
> into glibc.
>
> Discussions with glibc maintainers led to the following design, which we
> are calling "Kernel Thread Local Storage" or KTLS:
>
> - at glibc library init:
>   - glibc queries the size and alignment of the KTLS area supported by the
> kernel,
>   - glibc reserves the memory area required by the kernel for main
> thread,
>   - glibc registers the offset from thread pointer where the KTLS area
> will be placed for all threads belonging to the threads group which
> are created with clone3 CLONE_RSEQ_KTLS,
> - at nptl thread creation:
>   - glibc reserves the memory area required by the kernel,
> - application/libraries can query glibc for the offset/size of the
>   KTLS area, and offset from the thread pointer to access that area.

One remaining challenge see is that we want to use vDSO functions to
abstract away the exact layout of the KTLS area.  For example, there are
various implementation strategies for getuid optimizations, some of them
exposing a shared struct cred in a thread group, and others not doing
that.

The vDSO has access to the thread pointer because it's ABI (something
that we recently (and quite conveniently) clarified for x86).  What it
does not know is the offset of the KTLS area from the thread pointer.
In the original rseq implementation, this offset could vary from thread
to thread in a process, although the submitted glibc implementation did
not use this level of flexibility and the offset is constant.  The vDSO
is not relocated by the run-time dynamic loader, so it can't use ELF TLS
data.

Furthermore, not all threads in a thread group may have an associated
KTLS area.  In a potential glibc implementation, only the threads
created by pthread_create would have it; threads created directly using
clone would lack it (and would not even run with a correctly set up
userspace TCB).

So we have a bootstrap issue here that needs to be solved, I think.

In most cases, I would not be too eager to bypass the vDSO completely,
and having the kernel expose a data-only interface.  I could perhaps
make an exception for the current TID because that's so convenient to
use in mutex implementations, and errno.  With the latter, we could
directly expose the vDSO implementation to applications, assuming that
we agree that the vDSO will not fail with ENOSYS to request fallback to
the system call, but will itself perform the system call.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-24 Thread Florian Weimer
* Madhavan T. Venkataraman:

> Otherwise, using an ABI quirk or a calling convention side effect to
> load the PC into a GPR is, IMO, non-standard or non-compliant or
> non-approved or whatever you want to call it. I would be
> conservative and not use it. Who knows what incompatibility there
> will be with some future software or hardware features?

AArch64 PAC makes a backwards-incompatible change that touches this
area, but we'll see if they can actually get away with it.

In general, these things are baked into the ABI, even if they are not
spelled out explicitly in the psABI supplement.

> For instance, in the i386 example, we do a call without a matching return.
> Also, we use a pop to undo the call. Can anyone tell me if this kind of use
> is an ABI approved one?

Yes, for i386, this is completely valid from an ABI point of view.
It's equally possible to use a regular function call and just read the
return address that has been pushed to the stack.  Then there's no
stack mismatch at all.  Return stack predictors (including the one
used by SHSTK) also recognize the CALL 0 construct, so that's fine as
well.  The i386 psABI does not use function descriptors, and either
approach (out-of-line thunk or CALL 0) is in common use to materialize
the program counter in a register and construct the GOT pointer.

> If the kernel supplies this, then all applications and libraries can use
> it for all architectures with one single, simple API. Without this, each
> application/library has to roll its own solution for every architecture-ABI
> combo it wants to support.

Is there any other user for these type-generic trampolines?
Everything else I've seen generates machine code specific to the
function being called.  libffi is quite the outlier in my experience
because the trampoline calls a generic data-driven
marshaller/unmarshaller.  The other trampoline generators put this
marshalling code directly into the generated trampoline.

I'm still not convinced that this can't be done directly in libffi,
without kernel help.  Hiding the architecture-specific code in the
kernel doesn't reduce overall system complexity.

> As an example, in libffi:
>
>   ffi_closure_alloc() would call alloc_tramp()
>
>   ffi_prep_closure_loc() would call init_tramp()
>
>   ffi_closure_free() would call free_tramp()
>
> That is it! It works on all the architectures supported in the kernel for
> trampfd.

ffi_prep_closure_loc would still need to check whether the trampoline
has been allocated by alloc_tramp because some applications supply
their own (executable and writable) mapping.  ffi_closure_alloc would
need to support different sizes (not matching the trampoline).  It's
also unclear to me to what extent software out there writes to the
trampoline data directly, bypassing the libffi API (the structs are
not opaque, after all).  And all the existing libffi memory management
code (including the embedded dlmalloc copy) would be needed to support
kernels without trampfd for years to come.

I very much agree that we have a gap in libffi when it comes to
JIT-less operation.  But I'm not convinced that kernel support is
needed to close it, or that it is even the right design.


Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-23 Thread Florian Weimer
* Solar Designer:

> While I share my opinion here, I don't mean that to block Madhavan's
> work.  I'd rather defer to people more knowledgeable in current userland
> and ABI issues/limitations and plans on dealing with those, especially
> to Florian Weimer.  I haven't seen Florian say anything specific for or
> against Madhavan's proposal, and I'd like to.  (Have I missed that?)

There was a previous discussion, where I provided feedback (not much
different from the feedback here, given that the mechanism is mostly the
same).

I think it's unnecessary for the libffi use case.  Precompiled code can
be loaded from disk because the libffi trampolines are so regular.  On
most architectures, it's not even the code that's patched, but some of
the data driving it, which happens to be located on the same page due to
a libffi quirk.

The libffi use case is a bit strange anyway: its trampolines are
type-generic, and the per-call adjustment is data-driven.  This means
that once you have libffi in the process, you have a generic
data-to-function-call mechanism available that can be abused (it's even
fully CET compatible in recent versions).  And then you need to look at
the processes that use libffi.  A lot of them contain bytecode
interpreters, and those enable data-driven arbitrary code execution as
well.  I know that there are efforts under way to harden Python, but
it's going to be tough to get to the point where things are still
difficult for an attacker once they have the ability to make mprotect
calls.

It was pointed out to me that libffi is doing things wrong, and the
trampolines should not be type-generic, but generated so that they match
the function being called.  That is, the marshal/unmarshal code would be
open-coded in the trampoline, rather than using some generic mechanism
plus run-time dispatch on data tables describing the function type.
That is a very different design (and typically used by compilers (JIT or
not JIT) to implement native calls).  Mapping some code page with a
repeating pattern would no longer work to defeat anti-JIT measures
because it's closer to real JIT.  I don't know if kernel support could
make sense in this context, but it would be a completely different
patch.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: Expose 'array_length()' macro in

2020-09-22 Thread Florian Weimer
* Jonathan Wakely:

> I don't see much point in using std::size here. If you're going to
> provide the alternative implementation for when std::size isn't
> defined, why not just use it always?
>
> template
> #if __cplusplus >= 201103L
> constexpr
> #endif
> inline std::size_t
> __array_length(const _Tp(&)[_Num]) __THROW
> {
>   return _Num;
> }
>
> This only requires , not .

I agree that this is an advantage.  But the version without constexpr is
not sufficient because __array_length does not produce a constant
expression.

I've seen something like this used instead:

  template
  char (&___array_length(const _Tp(&)[_Num]))[_Num];
  #define __array_length(v) (sizeof(___array_length(v)))

If the function type is too cute, a helper struct could be used instead.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-17 Thread Florian Weimer
* Madhavan T. Venkataraman:

> On 9/17/20 10:36 AM, Madhavan T. Venkataraman wrote:
 libffi
 ==

 I have implemented my solution for libffi and provided the changes for
 X86 and ARM, 32-bit and 64-bit. Here is the reference patch:

 http://linux.microsoft.com/~madvenka/libffi/libffi.v2.txt
>>> The URL does not appear to work, I get a 403 error.
>> I apologize for that. That site is supposed to be accessible publicly.
>> I will contact the administrator and get this resolved.
>> 
>> Sorry for the annoyance.

> Could you try the link again and confirm that you can access it?
> Again, sorry for the trouble.

Yes, it works now.  Thanks for having it fixed.


Re: [PATCH v2 0/4] [RFC] Implement Trampoline File Descriptor

2020-09-16 Thread Florian Weimer
* madvenka:

> Examples of trampolines
> ===
>
> libffi (A Portable Foreign Function Interface Library):
>
> libffi allows a user to define functions with an arbitrary list of
> arguments and return value through a feature called "Closures".
> Closures use trampolines to jump to ABI handlers that handle calling
> conventions and call a target function. libffi is used by a lot
> of different applications. To name a few:
>
>   - Python
>   - Java
>   - Javascript
>   - Ruby FFI
>   - Lisp
>   - Objective C

libffi does not actually need this.  It currently collocates
trampolines and the data they need on the same page, but that's
actually unecessary.  It's possible to avoid doing this just by
changing libffi, without any kernel changes.

I think this has already been done for the iOS port.

> The code for trampoline X in the trampoline table is:
> 
>   load&code_table[X], code_reg
>   load(code_reg), code_reg
>   load&data_table[X], data_reg
>   load(data_reg), data_reg
>   jumpcode_reg
> 
> The addresses &code_table[X] and &data_table[X] are baked into the
> trampoline code. So, PC-relative data references are not needed. The user
> can modify code_table[X] and data_table[X] dynamically.

You can put this code into the libffi shared object and map it from
there, just like the rest of the libffi code.  To get more
trampolines, you can map the page containing the trampolines multiple
times, each instance preceded by a separate data page with the control
information.

I think the previous patch submission has also resulted in several
comments along those lines, so I'm not sure why you are reposting
this.

> libffi
> ==
>
> I have implemented my solution for libffi and provided the changes for
> X86 and ARM, 32-bit and 64-bit. Here is the reference patch:
>
> http://linux.microsoft.com/~madvenka/libffi/libffi.v2.txt

The URL does not appear to work, I get a 403 error.

> If the trampfd patchset gets accepted, I will send the libffi changes
> to the maintainers for a review. BTW, I have also successfully executed
> the libffi self tests.

I have not seen your libffi changes, but I expect that the complexity
is about the same as a userspace-only solution.


Cc:ing libffi upstream for awareness.  The start of the thread is
here:




Re: [PATCH v9 3/3] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

2020-09-03 Thread Florian Weimer
* Minchan Kim:

> On Tue, Sep 01, 2020 at 08:46:02PM +0200, Florian Weimer wrote:
>> * Minchan Kim:
>> 
>> >   ssize_t process_madvise(int pidfd, const struct iovec *iovec,
>> > unsigned long vlen, int advice, unsigned int flags);
>> 
>> size_t for vlen provides a clearer hint regarding the type of special
>> treatment needed for ILP32 here (zero extension, not changing the type
>> to long long).
>> 
>
> All existing system calls using iove in Linux uses unsigned long so
> I want to be consistent with them unless process_madvise need something
> speicial.

Userspace uses int, following POSIX (where applicable).  There is no
consistency to be had here.


Re: [PATCH v9 3/3] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

2020-09-01 Thread Florian Weimer
* Minchan Kim:

>   ssize_t process_madvise(int pidfd, const struct iovec *iovec,
> unsigned long vlen, int advice, unsigned int flags);

size_t for vlen provides a clearer hint regarding the type of special
treatment needed for ILP32 here (zero extension, not changing the type
to long long).


Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-09-01 Thread Florian Weimer
* Yu-cheng Yu:

> On 9/1/2020 10:50 AM, Florian Weimer wrote:
>> * Yu-cheng Yu:
>> 
>>> Like other arch_prctl()'s, this parameter was 'unsigned long'
>>> earlier. The idea was, since this arch_prctl is only implemented for
>>> the 64-bit kernel, we wanted it to look as 64-bit only.  I will change
>>> it back to 'unsigned long'.
>> What about x32?  In general, long is rather problematic for x32.
>
> The problem is the size of 'long', right?
> Because this parameter is passed in a register, and only the lower
> bits are used, x32 works as well.

The userspace calling convention leaves the upper 32-bit undefined.
Therefore, this only works by accident if the kernel does not check that
the upper 32-bit are zero, which is probably a kernel bug.

It's unclear to me what you are trying to accomplish.  Why do you want
to use unsigned long here?  The correct type appears to be unsigned int.
This correctly expresses that the upper 32 bits of the register do not
matter.

Thanks,
Florian



Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-09-01 Thread Florian Weimer
* Yu-cheng Yu:

> Like other arch_prctl()'s, this parameter was 'unsigned long'
> earlier. The idea was, since this arch_prctl is only implemented for
> the 64-bit kernel, we wanted it to look as 64-bit only.  I will change
> it back to 'unsigned long'.

What about x32?  In general, long is rather problematic for x32.

Thanks,
Florian



Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-08-27 Thread Florian Weimer
* H. J. Lu:

> Can you think of ANY issues of passing more arguments to arch_prctl?

On x32, the glibc arch_prctl system call wrapper only passes two
arguments to the kernel, and applications have no way of detecting that.
musl only passes two arguments on all architectures.  It happens to work
anyway with default compiler flags, but that's an accident.

Thanks,
Florian



Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-08-27 Thread Florian Weimer
* H. J. Lu:

> On Thu, Aug 27, 2020 at 6:19 AM Florian Weimer  wrote:
>>
>> * Dave Martin:
>>
>> > You're right that this has implications: for i386, libc probably pulls
>> > more arguments off the stack than are really there in some situations.
>> > This isn't a new problem though.  There are already generic prctls with
>> > fewer than 4 args that are used on x86.
>>
>> As originally posted, glibc prctl would have to know that it has to pull
>> an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
>> then the u64 argument is a problem for arch_prctl as well.
>>
>
> Argument of ARCH_X86_CET_DISABLE is int and passed in register.

The commit message and the C source say otherwise, I think (not sure
about the C source, not a kernel hacker).

Thanks,
Florian



Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-08-27 Thread Florian Weimer
* Dave Martin:

> You're right that this has implications: for i386, libc probably pulls
> more arguments off the stack than are really there in some situations.
> This isn't a new problem though.  There are already generic prctls with
> fewer than 4 args that are used on x86.

As originally posted, glibc prctl would have to know that it has to pull
an u64 argument off the argument list for ARCH_X86_CET_DISABLE.  But
then the u64 argument is a problem for arch_prctl as well.

Thanks,
Florian



Re: [RFC PATCH] mm: extend memfd with ability to create "secret" memory areas

2020-08-26 Thread Florian Weimer
* Andy Lutomirski:

>> I _believe_ there are also things like AES-NI that can get strong
>> protection from stuff like this.  They load encryption keys into (AVX)
>> registers and then can do encrypt/decrypt operations without the keys
>> leaving the registers.  If the key was loaded from a secret memory area
>> right into the registers, I think the protection from cache attacks
>> would be pretty strong.
>
> Except for context switches :)

An rseq sequence could request that the AVX registers should be
cleared on context switch.  (I'm mostly kidding.)

I think the main issue is that we do not have a good established
programming model to actually use such features and completely avoid
making copies of secret data.


Re: [PATCH v11 25/25] x86/cet/shstk: Add arch_prctl functions for shadow stack

2020-08-26 Thread Florian Weimer
* Dave Martin:

> On Tue, Aug 25, 2020 at 04:34:27PM -0700, Yu, Yu-cheng wrote:
>> On 8/25/2020 4:20 PM, Dave Hansen wrote:
>> >On 8/25/20 2:04 PM, Yu, Yu-cheng wrote:
>> I think this is more arch-specific.  Even if it becomes a new syscall,
>> we still need to pass the same parameters.
>> >>>
>> >>>Right, but without the copying in and out of memory.
>> >>>
>> >>Linux-api is already on the Cc list.  Do we need to add more people to
>> >>get some agreements for the syscall?
>> >What kind of agreement are you looking for?  I'd suggest just coding it
>> >up and posting the patches.  Adding syscalls really is really pretty
>> >straightforward and isn't much code at all.
>> >
>> 
>> Sure, I will do that.
>
> Alternatively, would a regular prctl() work here?

Is this something appliation code has to call, or just the dynamic
loader?

prctl in glibc is a variadic function, so if there's a mismatch between
the kernel/userspace syscall convention and the userspace calling
convention (for variadic functions) for specific types, it can't be made
to work in a generic way.

The loader can use inline assembly for system calls and does not have
this issue, but applications would be implcated by it.

Thanks,
Florian



Re: [PATCH v11 9/9] x86: Disallow vsyscall emulation when CET is enabled

2020-08-25 Thread Florian Weimer
* Andy Lutomirski:

> On Mon, Aug 24, 2020 at 5:30 PM Yu-cheng Yu  wrote:
>>
>> From: "H.J. Lu" 
>>
>> Emulation of the legacy vsyscall page is required by some programs built
>> before 2013.  Newer programs after 2013 don't use it.  Disallow vsyscall
>> emulation when Control-flow Enforcement (CET) is enabled to enhance
>> security.
>
> NAK.
>
> By all means disable execute emulation if CET-IBT is enabled at the
> time emulation is attempted, and maybe even disable the vsyscall page
> entirely if you can magically tell that CET-IBT will be enabled when a
> process starts, but you don't get to just disable it outright on a
> CET-enabled kernel.

Yeah, we definitely would have to revert/avoid this downstream.  People
definitely want to run glibc-2.12-era workloads on current kernels.
Thanks for catching it.

Florian



Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-08-02 Thread Florian Weimer
* Madhavan T. Venkataraman:

> Standardization
> -
>
> Trampfd is a framework that can be used to implement multiple
> things. May be, a few of those things can also be implemented in
> user land itself. But I think having just one mechanism to execute
> dynamic code objects is preferable to having multiple mechanisms not
> standardized across all applications.
>
> As an example, let us say that I am able to implement support for
> JIT code. Let us say that an interpreter uses libffi to execute a
> generated function. The interpreter would use trampfd for the JIT
> code object and get an address. Then, it would pass that to libffi
> which would then use trampfd for the trampoline. So, trampfd based
> code objects can be chained.

There is certainly value in coordination.  For example, it would be
nice if unwinders could recognize the trampolines during all phases
and unwind correctly through them (including when interrupted by an
asynchronous symbol).  That requires some level of coordination with
the unwinder and dynamic linker.

A kernel solution could hide the intermediate state in a kernel-side
trap handler, but I think it wouldn't reduce the overall complexity.


Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor

2020-07-29 Thread Florian Weimer
* Andy Lutomirski:

> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.

libffi does something like this for iOS, I believe.

The only thing you really need is a PC-relative indirect call, with the
target address loaded from a different page.  The trampoline handler can
do all the rest because it can identify the trampoline from the stack.
Having a closure parameter loaded into a register will speed things up,
of course.

I still hope to transition libffi to this model for most Linux targets.
It really simplifies things because you don't have to deal with cache
flushes (on both the data and code aliases for SELinux support).

But the key observation is that efficient trampolines do not need
run-time code generation at all because their code is so regular.

Thanks,
Florian



Re: [PATCH v7 4/7] fs: Introduce O_MAYEXEC flag for openat2(2)

2020-07-26 Thread Florian Weimer
* Al Viro:

> On Thu, Jul 23, 2020 at 07:12:24PM +0200, Mickaël Salaün wrote:
>> When the O_MAYEXEC flag is passed, openat2(2) may be subject to
>> additional restrictions depending on a security policy managed by the
>> kernel through a sysctl or implemented by an LSM thanks to the
>> inode_permission hook.  This new flag is ignored by open(2) and
>> openat(2) because of their unspecified flags handling.  When used with
>> openat2(2), the default behavior is only to forbid to open a directory.
>
> Correct me if I'm wrong, but it looks like you are introducing a magical
> flag that would mean "let the Linux S&M take an extra special whip
> for this open()".
>
> Why is it done during open?  If the caller is passing it deliberately,
> why not have an explicit request to apply given torture device to an
> already opened file?  Why not sys_masochism(int fd, char *hurt_flavour),
> for that matter?

While I do not think this is appropriate language for a workplace, Al
has a point: If the auditing event can be generated on an already-open
descriptor, it would also cover scenarios like this one:

  perl < /path/to/script

Where the process that opens the file does not (and cannot) know that it
will be used for execution purposes.

Thanks,
Florian



Re: [PATCH] copy_xstate_to_kernel: Fix typo which caused GDB regression

2020-07-21 Thread Florian Weimer
* Kevin Buettner:

> This commit fixes a regression encountered while running the
> gdb.base/corefile.exp test in GDB's test suite.
>
> In my testing, the typo prevented the sw_reserved field of struct
> fxregs_state from being output to the kernel XSAVES area.  Thus the
> correct mask corresponding to XCR0 was not present in the core file
> for GDB to interrogate, resulting in the following behavior:
>
> [kev@f32-1 gdb]$ ./gdb -q testsuite/outputs/gdb.base/corefile/corefile 
> testsuite/outputs/gdb.base/corefile/corefile.core
> Reading symbols from testsuite/outputs/gdb.base/corefile/corefile...
> [New LWP 232880]
>
> warning: Unexpected size of section `.reg-xstate/232880' in core file.
>
> With the typo fixed, the test works again as expected.
>
> Signed-off-by: Kevin Buettner 
> ---
>  arch/x86/kernel/fpu/xstate.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 6a54e83d5589..9cf40a7ff7ae 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1022,7 +1022,7 @@ int copy_xstate_to_kernel(void *kbuf, struct 
> xregs_state *xsave, unsigned int of
>   copy_part(offsetof(struct fxregs_state, st_space), 128,
> &xsave->i387.st_space, &kbuf, &offset_start, &count);
>   if (header.xfeatures & XFEATURE_MASK_SSE)
> - copy_part(xstate_offsets[XFEATURE_MASK_SSE], 256,
> + copy_part(xstate_offsets[XFEATURE_SSE], 256,
> &xsave->i387.xmm_space, &kbuf, &offset_start, &count);
>   /*
>* Fill xsave->i387.sw_reserved value for ptrace frame:

Does this read out-of-bounds, potentially disclosing kernel memory?
Not if the system supports AVX, I assume.


Re: [RFC PATCH 0/4] rseq: Introduce extensible struct rseq

2020-07-15 Thread Florian Weimer
* Carlos O'Donell:

> On 7/13/20 11:03 PM, Mathieu Desnoyers wrote:
>> Recent discussion led to a solution for extending struct rseq. This is
>> an implementation of the proposed solution.
>> 
>> Now is a good time to agree on this scheme before the release of glibc
>> 2.32, just in case there are small details to fix on the user-space
>> side in order to allow extending struct rseq.
>
> Adding extensibility to the rseq registration process would be great,
> but we are out of time for the glibc 2.32 release.
>
> Should we revert rseq for glibc 2.32 and spend quality time discussing
> the implications of an extensible design, something that Google already
> says they are doing?
>
> We can, with a clear head, and an agreed upon extension mechanism
> include rseq in glibc 2.33 (release scheduled for Feburary 1st 2021).
> We release time boxed every 6 months, no deviation, so you know when
> your next merge window will be.
>
> We have already done the hard work of fixing the nesting signal
> handler issues, and glibc integration. If we revert today that will 
> also give time for Firefox and Chrome to adjust their sandboxes.
>
> Do you wish to go forward with rseq as we have it in glibc 2.32,
> or do you wish to revert rseq from glibc 2.32, discuss the extension
> mechanism, and put it back into glibc 2.33 with adjustments?

I posted the glibc revert:

  

I do not think we have any other choice at this point.

Thanks,
Florian



Re: [RFC PATCH 2/4] rseq: Allow extending struct rseq

2020-07-15 Thread Florian Weimer
* Mathieu Desnoyers:

> - On Jul 15, 2020, at 9:42 AM, Florian Weimer fwei...@redhat.com wrote:
>> * Mathieu Desnoyers:
>> 
> [...]
>>> How would this allow early-rseq-adopter libraries to interact with
>>> glibc ?
>> 
>> Under all extension proposals I've seen so far, early adopters are
>> essentially incompatible with glibc rseq registration.  I don't think
>> you can have it both ways.
>
> The basic question I'm not sure about is whether we are allowed to increase
> the size and alignement of __rseq_abi from e.g. glibc 2.32 to glibc 2.33.

With the current mechanism (global TLS data symbol), we can do that
using symbol versioning.  That means that we can only do this on a
release boundary, and that it's incompatible with other libraries which
use an interposing unversioned symbol.

Thanks,
Florian



Re: [RFC PATCH 2/4] rseq: Allow extending struct rseq

2020-07-15 Thread Florian Weimer
* Mathieu Desnoyers:

> So indeed it could be done today without upgrading the toolchains by
> writing custom assembler for each architecture to get the thread's
> struct rseq. AFAIU the ABI to access the thread pointer is fixed for
> each architecture, right ?

Yes, determining the thread pointer and access initial-exec TLS
variables is baked into the ABI.

> How would this allow early-rseq-adopter libraries to interact with
> glibc ?

Under all extension proposals I've seen so far, early adopters are
essentially incompatible with glibc rseq registration.  I don't think
you can have it both ways.

Thanks,
Florian



Re: [RFC PATCH 2/4] rseq: Allow extending struct rseq

2020-07-15 Thread Florian Weimer
* Mathieu Desnoyers:

> Practically speaking, I suspect this would mean postponing availability of
> rseq for widely deployed applications for a few more years ?

There is no rseq support in GCC today, so you have to write assembler
code anyway.

Thanks,
Florian



Re: [RFC PATCH 2/4] rseq: Allow extending struct rseq

2020-07-14 Thread Florian Weimer
* Chris Kennelly:

> When glibc provides registration, is the anticipated use case that a
> library would unregister and reregister each thread to "upgrade" it to
> the most modern version of interface it knows about provided by the
> kernel?

Absolutely not, that is likely to break other consumers because an
expected rseq area becomes dormant instead.

> There, I could assume an all-or-nothing registration of the new
> feature--limited only by kernel availability for thread
> homogeneity--but inconsistencies across early adopter libraries would
> mean each thread would have to examine its own TLS to determine if a
> feature were available.

Exactly.  Certain uses of seccomp can also have this effect,
presenting a non-homogeneous view.


Re: [RFC PATCH 2/4] rseq: Allow extending struct rseq

2020-07-14 Thread Florian Weimer
* Mathieu Desnoyers:

>> How are extensions going to affect the definition of struct rseq,
>> including its alignment?
>
> The alignment will never decrease. If the structure becomes large enough
> its alignment could theoretically increase. Would that be an issue ?

Telling the compiler that struct is larger than it actually is, or that
it has more alignment than in memory, results in undefined behavior,
even if only fields are accessed in the smaller struct region.

An increase in alignment from 32 to 64 is perhaps not likely to have
this effect.  But the undefined behavior is still there, and has been
observed for mismatches like 8 vs 16.

>> As things stand now, glibc 2.32 will make the size and alignment of
>> struct rseq part of its ABI, so it can't really change after that.
>
> Can the size and alignment of a structure be defined as minimum alignment
> and size values ? For instance, those would be invariant for a given glibc
> version (if we always use the internal struct rseq declaration), but could
> be increased in future versions.

Not if we are talking about a global (TLS) data symbol.  No such changes
are possible there.  We have some workarounds for symbols that live
exclusively within glibc, but they don't work if there are libraries out
there which interpose the symbol.

>> With a different approach, we can avoid making the symbol size part of
>> the ABI, but then we cannot use the __rseq_abi TLS symbol.  As a result,
>> interoperability with early adopters would be lost.
>
> Do you mean with a function "getter", and then keeping that pointer around
> in a per-user TLS ? I would prefer to avoid that because it adds an extra
> pointer dereference on a fast path.

My choice would have been a function that returns the offset from the
thread pointer (which has to be unchanged regarding all threads).

Thanks,
Florian



Re: [RFC PATCH 2/4] rseq: Allow extending struct rseq

2020-07-14 Thread Florian Weimer
* Mathieu Desnoyers:

> + /*
> +  * Very last field of the structure, to calculate size excluding padding
> +  * with offsetof().
> +  */
> + char end[];
>  } __attribute__((aligned(4 * sizeof(__u64;

This makes the header incompatible with standard C++.

How are extensions going to affect the definition of struct rseq,
including its alignment?

As things stand now, glibc 2.32 will make the size and alignment of
struct rseq part of its ABI, so it can't really change after that.

With a different approach, we can avoid making the symbol size part of
the ABI, but then we cannot use the __rseq_abi TLS symbol.  As a result,
interoperability with early adopters would be lost.

One way to avoid this problem would be for every library to register its
own rseq area, of the appropriate size.  Then process-wide coordination
in userspace would not be needed.

Thanks,
Florian



Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

2020-07-08 Thread Florian Weimer
* Christian Brauner:

> I've been following this a little bit. The kernel version itself doesn't
> really mean anything and the kernel version is imho not at all
> interesting to userspace applications. Especially for cross-distro
> programs. We can't go around and ask Red Hat, SUSE, Ubuntu, Archlinux,
> openSUSE and god knows who what other distro what their fixed kernel
> version is.

And Red Hat Enterprise Linux only has a dozen or two kernel branches
under active maintenance, each with their own succession of version
numbers.  It's just not feasible.  Even figuring out the branch based
on the kernel version can be tricky!

> (Also, as a side-note. I see that you're passing struct rseq *rseq with
> a length argument but you are not versioning by size. Is that
> intentional? That basically somewhat locks you to the current struct
> rseq layout and means users might run into problems when you extend
> struct rseq in the future as they can't pass the new struct down to
> older kernels. The way we deal with this is now - rseq might preceed
> this - is copy_struct_from_user()

The kernel retains the pointer after the system call returns.
Basically, ownership of the memory area is transferred to the kernel.
It's like set_robust_list in this regard.


Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

2020-07-08 Thread Florian Weimer
* Mathieu Desnoyers:

> Allright, thanks for the insight! I'll drop these patches and focus only
> on the bugfix.

Thanks, much appreciated!


Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

2020-07-07 Thread Florian Weimer
* Carlos O'Donell:

> It's not a great fit IMO. Just let the kernel version be the arbiter of
> correctness.

For manual review, sure.  But checking it programmatically does not
yield good results due to backports.  Even those who use the stable
kernel series sometimes pick up critical fixes beforehand, so it's not
reliable possible for a program to say, “I do not want to run on this
kernel because it has a bad version”.  We had a recent episode of this
with the Go runtime, which tried to do exactly this.


Re: [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix

2020-07-07 Thread Florian Weimer
I would like to point out that the subject is misleading: This is not
an ABI change.  It fixes the contents of the __rseq_abi TLS variable
(as glibc calls it), but that's it.

(Sorry, I should have mentioned this earlier.)


Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

2020-07-07 Thread Florian Weimer
* Mathieu Desnoyers:

> Those are very good points. One possibility we have would be to let
> glibc do the rseq registration without the RSEQ_FLAG_RELIABLE_CPU_ID
> flag. On kernels with the bug present, the cpu_id field is still good
> enough for typical uses of sched_getcpu() which does not appear to
> have a very strict correctness requirement on returning the right
> cpu number.
>
> Then libraries and applications which require a reliable cpu_id
> field could check this on their own by calling rseq with the
> RSEQ_FLAG_RELIABLE_CPU_ID flag. This would not make the state more
> complex in __rseq_abi, and let each rseq user decide about its own
> fate: whether it uses rseq or keeps using an rseq-free fallback.
>
> I am still tempted to allow combining RSEQ_FLAG_REGISTER |
> RSEQ_FLAG_RELIABLE_CPU_ID for applications which would not be using
> glibc, and want to check this flag on thread registration.

Well, you could add a bug fix level field to the __rseq_abi variable.
Then applications could check if the kernel has the appropriate level
of non-buggyness.  But the same thing could be useful for many other
kernel interfaces, and I haven't seen such a fix level value for them.
What makes rseq so special?

It won't help with the present bug, but maybe we should add an rseq
API sub-call that blocks future rseq registration for the thread.
Then we can add a glibc tunable that flips off rseq reliably if people
do not want to use it for some reason (and switch to the non-rseq
fallback code instead).  But that's going to help with future bugs
only.


Re: [RFC PATCH for 5.8 1/4] sched: Fix unreliable rseq cpu_id for new tasks

2020-07-07 Thread Florian Weimer
* Mathieu Desnoyers:

> While integrating rseq into glibc and replacing glibc's sched_getcpu
> implementation with rseq, glibc's tests discovered an issue with
> incorrect __rseq_abi.cpu_id field value right after the first time
> a newly created process issues sched_setaffinity.
>
> For the records, it triggers after building glibc and running tests, and
> then issuing:
>
>   for x in {1..2000} ; do posix/tst-affinity-static  & done
>
> and shows up as:
>
> error: Unexpected CPU 2, expected 0
> error: Unexpected CPU 2, expected 0
> error: Unexpected CPU 2, expected 0
> error: Unexpected CPU 2, expected 0
> error: Unexpected CPU 138, expected 0
> error: Unexpected CPU 138, expected 0
> error: Unexpected CPU 138, expected 0
> error: Unexpected CPU 138, expected 0

As far as I can tell, the glibc reproducer no longer shows the issue
with this patch applied.

Tested-By: Florian Weimer 


Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

2020-07-07 Thread Florian Weimer
* Mathieu Desnoyers:

> commit 93b585c08d16 ("Fix: sched: unreliable rseq cpu_id for new tasks")
> addresses an issue with cpu_id field of newly created processes. Expose
> a flag which can be used by user-space to query whether the kernel
> implements this fix.
>
> Considering that this issue can cause corruption of user-space per-cpu
> data updated with rseq, it is recommended that user-space detects
> availability of this fix by using the RSEQ_FLAG_RELIABLE_CPU_ID flag
> either combined with registration or on its own before using rseq.

Presumably, the intent is that glibc uses RSEQ_FLAG_RELIABLE_CPU_ID to
register the rseq area.  That will surely prevent glibc itself from
activating rseq on broken kernels.  But if another rseq library
performs registration and has not been updated to use
RSEQ_FLAG_RELIABLE_CPU_ID, we still end up with an active rseq area
(and incorrect CPU IDs from sched_getcpu in glibc).  So further glibc
changes will be needed.  I suppose we could block third-party rseq
registration with a registration of a hidden rseq area (not
__rseq_abi).  But then the question is if any of the third-party rseq
users are expecting the EINVAL error code from their failed
registration.

The rseq registration state machine is quite tricky already, and the
need to use RSEQ_FLAG_RELIABLE_CPU_ID would make it even more
complicated.  Even if we implemented all the changes, it's all going
to be essentially dead, untestable code in a few months, when the
broken kernels are out of circulation.  It does not appear to be good
investment to me.


Re: [RFC PATCH for 5.8 0/4] rseq cpu_id ABI fix

2020-07-06 Thread Florian Weimer
* Mathieu Desnoyers:

> This is an RFC aiming for quick inclusion into the Linux kernel, unless
> we prefer reverting the entire rseq glibc integration and try again in 6
> months. Their upcoming release is on August 3rd, so we need to take a
> decision on this matter quickly.

Just to clarify here, I don't think it's necessary to change glibc so
that it only enables the rseq functionality if this particular bug is
not present.  From our perspective, it's just another kernel bug.

If you truly feel we must not enable this feature in its present
kernel state, then we need a working patch on all sides by the end of
the week because the hard ABI freeze for glibc 2.32 is coming up, and
at without any other patches, the only safe choice to prevent glibc
from using slightly broken rseq support would be to back out the rseq
patches.

But again, I don't think such drastic action is necessary.


Re: [PATCH 2/3] Linux: Use rseq in sched_getcpu if available (v9)

2020-07-06 Thread Florian Weimer
* Mathieu Desnoyers:

> - On Jul 6, 2020, at 1:50 PM, Florian Weimer fwei...@redhat.com wrote:
>
>> * Mathieu Desnoyers:
>> 
>>> Now we need to discuss how we introduce that fix in a way that will
>>> allow user-space to trust the __rseq_abi.cpu_id field's content.
>> 
>> I don't think that's necessary.  We can mention it in the glibc
>> distribution notes on the wiki.
>> 
>>> The usual approach to kernel bug fixing is typically to push the fix,
>>> mark it for stable kernels, and expect everyone to pick up the
>>> fixes. I wonder how comfortable glibc would be to replace its
>>> sched_getcpu implementation with a broken-until-fixed kernel rseq
>>> implementation without any mechanism in place to know whether it can
>>> trust the value of the cpu_id field. I am extremely reluctant to do
>>> so.
>> 
>> We have already had similar regressions in sched_getcpu, and we didn't
>> put anything into glibc to deal with those.
>
> Was that acceptable because having a wrong cpu number would never trigger
> corruption, only slowdowns ?

First of all, it's a kernel bug.  It's rare that we put workarounds for
kernel bugs into glibc.

And yes, in pretty much all cases it's just a performance issue for
sched_getcpu.  When you know the CPU ID of a thread due to pinning to a
single CPU, why would you call sched_getcpu?  (That's the case where you
could get corruption in theory.)

> In the case of rseq, having the wrong cpu_id value is a real issue
> which will lead to corruption and crashes. So I maintain my reluctance
> to introduce the fix without any way for userspace to know whether the
> cpu_id field value is reliable.

Yes, for rseq itself, the scenario is somewhat different.  Still, it's
just another kernel bug.  There will be others. 8-/

>From a schedule point of view, it looks tough to get the magic flag into
the mainline kernel in time for the upcoming glibc 2.32 release.  If you
insist on registering rseq only if the bug is not present, we'll
probably have to back out some or all of the rseq changes.

Thanks,
Florian



Re: [PATCH 2/3] Linux: Use rseq in sched_getcpu if available (v9)

2020-07-06 Thread Florian Weimer
* Mathieu Desnoyers:

> Now we need to discuss how we introduce that fix in a way that will
> allow user-space to trust the __rseq_abi.cpu_id field's content.

I don't think that's necessary.  We can mention it in the glibc
distribution notes on the wiki.

> The usual approach to kernel bug fixing is typically to push the fix,
> mark it for stable kernels, and expect everyone to pick up the
> fixes. I wonder how comfortable glibc would be to replace its
> sched_getcpu implementation with a broken-until-fixed kernel rseq
> implementation without any mechanism in place to know whether it can
> trust the value of the cpu_id field. I am extremely reluctant to do
> so.

We have already had similar regressions in sched_getcpu, and we didn't
put anything into glibc to deal with those.

Just queue the fix for the stable kernels.  I expect that all
distributions track stable kernel branches in some way, so just put into
the kernel commit message that this commit is needed for a working
sched_getcpu in glibc 2.32 and later.

Once the upstream fix is in Linus' tree, I'm going to file a request to
backport the fix into the Red Hat Enterprise Linux 8.

Thanks for finding the root cause so quickly,
Florian



Re: [PATCH 2/3] Linux: Use rseq in sched_getcpu if available (v9)

2020-07-06 Thread Florian Weimer
* Mathieu Desnoyers:

> When available, use the cpu_id field from __rseq_abi on Linux to
> implement sched_getcpu().  Fall-back on the vgetcpu vDSO if
> unavailable.

I've pushed this to glibc master, but unfortunately it looks like this
exposes a kernel bug related to affinity mask changes.

After building and testing glibc, this

  for x in {1..2000} ; do posix/tst-affinity-static  & done

produces some “error:” lines for me:

error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0

“expected 0” is a result of how the test has been written, it bails out
on the first failure, which happens with CPU ID 0.

Smaller systems can use a smaller count than 2000 to reproduce this.  It
also happens sporadically when running the glibc test suite itself
(which is why it took further testing to reveal this issue).

I can reproduce this with the Debian 4.19.118-2+deb10u1 kernel, the
Fedora 5.6.19-300.fc32 kernel, and the Red Hat Enterprise Linux kernel
4.18.0-193.el8 (all x86_64).

As to the cause, I'd guess that the exit path in the sched_setaffinity
system call fails to update the rseq area, so that userspace can observe
the outdated CPU ID there.

Thanks,
Florian



Re: [PATCH 1/3] glibc: Perform rseq registration at C startup and thread creation (v22)

2020-07-02 Thread Florian Weimer
* Mathieu Desnoyers via Libc-alpha:

> Register rseq TLS for each thread (including main), and unregister for
> each thread (excluding main).  "rseq" stands for Restartable Sequences.
>
> See the rseq(2) man page proposed here:
>   https://lkml.org/lkml/2018/9/19/647
>
> Those are based on glibc master branch commit 3ee1e0ec5c.
> The rseq system call was merged into Linux 4.18.
>
> The TLS_STATIC_SURPLUS define is increased to leave additional room for
> dlopen'd initial-exec TLS, which keeps elf/tst-auditmany working.
>
> The increase (76 bytes) is larger than 32 bytes because it has not been
> increased in quite a while.  The cost in terms of additional TLS storage
> is quite significant, but it will also obscure some initial-exec-related
> dlopen failures.

We need another change to get this working on most non-x86
architectures:

diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index 817bcbbf59..ca13778ca9 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -134,6 +134,12 @@ void
 _dl_determine_tlsoffset (void)
 {
   size_t max_align = TLS_TCB_ALIGN;
+  /* libc.so with rseq has TLS with 32-byte alignment.  Since TLS is
+ initialized before audit modules are loaded and slotinfo
+ information is available, this is not taken into account below in
+ the audit case.  */
+  max_align = MAX (max_align, 32U);
+
   size_t freetop = 0;
   size_t freebottom = 0;

This isn't visible on x86-64 because TLS_TCB_ALIGN is already 64 there.

I plan to re-test with this fix and push the series.

Carlos, is it okay if I fold in the dl-tls.c change if testing looks
good?

Thanks,
Florian



Re: [PATCH 1/3] glibc: Perform rseq registration at C startup and thread creation (v21)

2020-06-24 Thread Florian Weimer
* Mathieu Desnoyers:

>> I think we should keep things simple on the glibc side for now and do
>> this changes to the kernel headers first.
>
> Just to be sure I understand what you mean by "keep things simple", do you
> recommend removing the following lines completely for now from sys/rseq.h ?
>
> /* Ensure the compiler supports rseq_align.  */
> __rseq_static_assert (__rseq_alignof (struct rseq_cs) >= 32, "alignment");
> __rseq_static_assert (__rseq_alignof (struct rseq) >= 32, "alignment");

Yes, that's what I meant.

Thanks,
Florian



Re: [PATCH 1/3] glibc: Perform rseq registration at C startup and thread creation (v21)

2020-06-24 Thread Florian Weimer
* Mathieu Desnoyers:

>> I'm still worried that __rseq_static_assert and __rseq_alignof will show
>> up in the UAPI with textually different definitions.  (This does not
>> apply to __rseq_tls_model_ie.)
>
> What makes this worry not apply to __rseq_tls_model_ie ?

It's not needed by the kernel header because it doesn't contain a
__rseq_abi declaration.

>> 
>> Is my worry unfounded?
>
> So AFAIU you worry that eventually sys/rseq.h and linux/rseq.h carry different
> definitions of __rseq_static_assert and __rseq_alignof.
>
> Indeed, I did not surround those #define with #ifndef/#endif. Maybe we should 
> ?
>
> Just in case the definitions end up being different (worse case scenario), we
> should expect their behavior to be pretty much equivalent. So going for the
> following should address your concern I think:

I think we should keep things simple on the glibc side for now and do
this changes to the kernel headers first.

Thanks,
Florian



Re: [PATCH 1/3] glibc: Perform rseq registration at C startup and thread creation (v21)

2020-06-24 Thread Florian Weimer
* Mathieu Desnoyers:

> diff --git a/manual/threads.texi b/manual/threads.texi
> index bb7a42c655..d5069d5581 100644
> --- a/manual/threads.texi
> +++ b/manual/threads.texi

> +@deftypevar {struct rseq} __rseq_abi
> +@standards{Linux, sys/rseq.h}
> +@Theglibc{} implements a @code{__rseq_abi} TLS symbol to interact with
> +the Restartable Sequences system call.  The layout of this structure is
> +defined by the @file{sys/rseq.h} header.  Registration of each thread's
> +@code{__rseq_abi} is performed by @theglibc{} at library initialization
> +and thread creation. The manual for the rseq system call can be found
> +at 
> @uref{https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2}.

Should be “creation.  The” (two spaces after a sentence-ending period).

> diff --git a/sysdeps/unix/sysv/linux/sys/rseq.h 
> b/sysdeps/unix/sysv/linux/sys/rseq.h
> new file mode 100644
> index 00..5e118c1781
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/sys/rseq.h

> +#ifdef __cplusplus
> +# if  __cplusplus >= 201103L
> +#  define __rseq_static_assert(expr, diagnostic) static_assert (expr, 
> diagnostic)
> +#  define __rseq_alignof(type)   alignof (type)
> +#  define __rseq_tls_storage_class   thread_local
> +# endif
> +#elif (defined __STDC_VERSION__ ? __STDC_VERSION__ : 0) >= 201112L
> +# define __rseq_static_assert(expr, diagnostic)  _Static_assert (expr, 
> diagnostic)
> +# define __rseq_alignof(type)_Alignof (type)
> +# define __rseq_tls_storage_class_Thread_local
> +#endif
> +
> +#ifndef __rseq_static_assert
> +/* Try to use _Static_assert macro from sys/cdefs.h.  */
> +# ifdef _Static_assert
> +#  define __rseq_static_assert(expr, diagnostic) _Static_assert (expr, 
> diagnostic)
> +# else
> +#  define __rseq_static_assert(expr, diagnostic) /* Nothing.  */
> +# endif
> +#endif
> +
> +/* Rely on GNU extensions for older standards and tls model.  */
> +#ifdef __GNUC__
> +# ifndef __rseq_alignof
> +#  define __rseq_alignof(x) __alignof__ (x)
> +# endif
> +# define __rseq_tls_model_ie __attribute__ ((__tls_model__ ("initial-exec")))
> +#else
> +/* Specifying the TLS model on the declaration is optional.  */
> +# define __rseq_tls_model_ie /* Nothing.  */
> +#endif

I'm still worried that __rseq_static_assert and __rseq_alignof will show
up in the UAPI with textually different definitions.  (This does not
apply to __rseq_tls_model_ie.)

Is my worry unfounded?

Thanks,
Florian



Re: Add a new fchmodat4() syscall, v2

2020-06-09 Thread Florian Weimer
* Palmer Dabbelt:

> This patch set adds fchmodat4(), a new syscall. The actual
> implementation is super simple: essentially it's just the same as
> fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
> I've attempted to make this match "man 2 fchmodat" as closely as
> possible, which says EINVAL is returned for invalid flags (as opposed to
> ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
> I have a sketch of a glibc patch that I haven't even compiled yet, but
> seems fairly straight-forward:

What's the status here?  We'd really like to see this system call in the
kernel because our emulation in glibc has its problems (especially if
/proc is not mounted).

Thanks,
Florian



Re: [PATCH glibc 1/3] glibc: Perform rseq registration at C startup and thread creation (v20)

2020-06-03 Thread Florian Weimer
* Mathieu Desnoyers:

> - On Jun 3, 2020, at 8:05 AM, Florian Weimer fwei...@redhat.com wrote:
>
>> * Mathieu Desnoyers:
>> 
>>> +#ifdef __cplusplus
>>> +# if  __cplusplus >= 201103L
>>> +#  define __rseq_static_assert(expr, diagnostic) static_assert (expr,
>>> diagnostic)
>>> +#  define __rseq_alignof(type)   alignof (type)
>>> +#  define __rseq_alignas(x)  alignas (x)
>>> +#  define __rseq_tls_storage_class   thread_local
>>> +# endif
>>> +#elif (defined __STDC_VERSION__ ? __STDC_VERSION__ : 0) >= 201112L
>>> +# define __rseq_static_assert(expr, diagnostic)  _Static_assert (expr,
>>> diagnostic)
>>> +# define __rseq_alignof(type)_Alignof (type)
>>> +# define __rseq_alignas(x)   _Alignas (x)
>>> +# define __rseq_tls_storage_class_Thread_local
>>> +#endif
>> 
>> This does not seem to work.  I get this with GCC 9:
>> 
>> In file included from /tmp/cih_test_gsrKbC.cc:8:0:
>> ../sysdeps/unix/sysv/linux/sys/rseq.h:42:50: error: attribute ignored
>> [-Werror=attributes]
>> #  define __rseq_alignas(x)  alignas (x)
>>  ^
>> ../sysdeps/unix/sysv/linux/sys/rseq.h:122:14: note: in expansion of macro
>> ‘__rseq_alignas’
>> uint32_t __rseq_alignas (32) version;
>>  ^
>
> Is that when compiling C or C++ code ? If it's C code, I would expect
> "_Alignas" to be used, not "alignas".
>
> Which exact version of gcc do you use ?

C++ code.  CXX was set to this compiler at configure time:

gcc version 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC) 

>> In any case, these changes really have to go into the UAPI header first.
>> Only the __thread handling should remain.  Otherwise, we'll have a tough
>> situation on our hands changing the UAPI header, without introducing
>> macro definition conflicts.  I'd suggest to stick to the aligned
>> attribute for the time being, like the current UAPI headers.
>
> OK. Should I do that in a separate patch, or you do it on top of my patchset,
> or should I re-spin another round of the series ?

I think the initial commit should mirror the current UAPI header
contents.

Keep the macros for the UAPI patch though. 8-)  We can pick up these
changes once they have been merged into Linux.

Thanks,
Florian



Re: [PATCH glibc 1/3] glibc: Perform rseq registration at C startup and thread creation (v20)

2020-06-03 Thread Florian Weimer
* Mathieu Desnoyers:

> +#ifdef __cplusplus
> +# if  __cplusplus >= 201103L
> +#  define __rseq_static_assert(expr, diagnostic) static_assert (expr, 
> diagnostic)
> +#  define __rseq_alignof(type)   alignof (type)
> +#  define __rseq_alignas(x)  alignas (x)
> +#  define __rseq_tls_storage_class   thread_local
> +# endif
> +#elif (defined __STDC_VERSION__ ? __STDC_VERSION__ : 0) >= 201112L
> +# define __rseq_static_assert(expr, diagnostic)  _Static_assert (expr, 
> diagnostic)
> +# define __rseq_alignof(type)_Alignof (type)
> +# define __rseq_alignas(x)   _Alignas (x)
> +# define __rseq_tls_storage_class_Thread_local
> +#endif

This does not seem to work.  I get this with GCC 9:

In file included from /tmp/cih_test_gsrKbC.cc:8:0:
../sysdeps/unix/sysv/linux/sys/rseq.h:42:50: error: attribute ignored 
[-Werror=attributes]
 #  define __rseq_alignas(x)  alignas (x)
  ^
../sysdeps/unix/sysv/linux/sys/rseq.h:122:14: note: in expansion of macro 
‘__rseq_alignas’
 uint32_t __rseq_alignas (32) version;
  ^

In any case, these changes really have to go into the UAPI header first.
Only the __thread handling should remain.  Otherwise, we'll have a tough
situation on our hands changing the UAPI header, without introducing
macro definition conflicts.  I'd suggest to stick to the aligned
attribute for the time being, like the current UAPI headers.

The resut looks okay to me.

I'm still waiting for feedback from other maintainers whether the level
of documentation and testing is appropriate.

Thanks,
Florian



  1   2   3   4   5   >