from:"Carlos O'Donell"

Re: [PATCH v4 2/4] x86/elf: Support a new ELF aux vector AT_MINSIGSTKSZ

2021-01-20 Thread Carlos O'Donell

On 1/15/21 4:10 PM, Chang S. Bae via Libc-alpha wrote:
> Historically, signal.h defines MINSIGSTKSZ (2KB) and SIGSTKSZ (8KB), for
> use by all architectures with sigaltstack(2). Over time, the hardware state
> size grew, but these constants did not evolve. Today, literal use of these
> constants on several architectures may result in signal stack overflow, and
> thus user data corruption.
> 
> A few years ago, the ARM team addressed this issue by establishing
> getauxval(AT_MINSIGSTKSZ), such that the kernel can supply at runtime value
> that is an appropriate replacement on the current and future hardware.
> 
> Add getauxval(AT_MINSIGSTKSZ) support to x86, analogous to the support
> added for ARM in commit 94b07c1f8c39 ("arm64: signal: Report signal frame
> size to userspace via auxv").

The value of AT_MINSIGSTKSZ has been generic ABI for all architectures since
2018 when it was added to glibc's generic elf.h because of arm64.

Rather than define this just for x86 may we please put this into the generic
headers, since it has been there since 2018?

Any userspace code that might be inspecting for AT_MINSIGSTKSZ is going to
expect a value of 51.

No other architecture has a define for value 51. This way we avoid any accidents
with the value being different for architectures.
 
> Reported-by: Florian Weimer 
> Fixes: c2bc11f10a39 ("x86, AVX-512: Enable AVX-512 States Context Switch")
> Signed-off-by: Chang S. Bae 
> Reviewed-by: Len Brown 
> Cc: H.J. Lu 
> Cc: Fenghua Yu 
> Cc: Dave Martin 
> Cc: Michael Ellerman 
> Cc: x...@kernel.org
> Cc: libc-al...@sourceware.org
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=153531
> ---
>  arch/x86/include/asm/elf.h | 4 
>  arch/x86/include/uapi/asm/auxvec.h | 6 --
>  arch/x86/kernel/signal.c   | 5 +
>  3 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index b9a5d488f1a5..044b024abea1 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -311,6 +311,7 @@ do {  
> \
>   NEW_AUX_ENT(AT_SYSINFO, VDSO_ENTRY);\
>   NEW_AUX_ENT(AT_SYSINFO_EHDR, VDSO_CURRENT_BASE);\
>   }   \
> + NEW_AUX_ENT(AT_MINSIGSTKSZ, get_sigframe_size());   
> \
>  } while (0)
>  
>  /*
> @@ -327,6 +328,7 @@ extern unsigned long task_size_32bit(void);
>  extern unsigned long task_size_64bit(int full_addr_space);
>  extern unsigned long get_mmap_base(int is_legacy);
>  extern bool mmap_address_hint_valid(unsigned long addr, unsigned long len);
> +extern unsigned long get_sigframe_size(void);
>  
>  #ifdef CONFIG_X86_32
>  
> @@ -348,6 +350,7 @@ do {  
> \
>   if (vdso64_enabled) \
>   NEW_AUX_ENT(AT_SYSINFO_EHDR,\
>   (unsigned long __force)current->mm->context.vdso); \
> + NEW_AUX_ENT(AT_MINSIGSTKSZ, get_sigframe_size());   
> \
>  } while (0)
>  
>  /* As a historical oddity, the x32 and x86_64 vDSOs are controlled together. 
> */
> @@ -356,6 +359,7 @@ do {  
> \
>   if (vdso64_enabled) \
>   NEW_AUX_ENT(AT_SYSINFO_EHDR,\
>   (unsigned long __force)current->mm->context.vdso); \
> + NEW_AUX_ENT(AT_MINSIGSTKSZ, get_sigframe_size());   
> \
>  } while (0)
>  
>  #define AT_SYSINFO   32
> diff --git a/arch/x86/include/uapi/asm/auxvec.h 
> b/arch/x86/include/uapi/asm/auxvec.h
> index 580e3c567046..edd7808060e6 100644
> --- a/arch/x86/include/uapi/asm/auxvec.h
> +++ b/arch/x86/include/uapi/asm/auxvec.h
> @@ -10,11 +10,13 @@
>  #endif
>  #define AT_SYSINFO_EHDR  33
>  
> +#define AT_MINSIGSTKSZ   51

This definition should go into the generic auxvec.h.

We could also remove the arm64 define, but that is orthogonal arch cleanup.

> +
>  /* entries in ARCH_DLINFO: */
>  #if defined(CONFIG_IA32_EMULATION) || !defined(CONFIG_X86_64)
> -# define AT_VECTOR_SIZE_ARCH 2
> +# define AT_VECTOR_SIZE_ARCH 3
>  #else /* else it's non-compat x86-64 */
> -# define AT_VECTOR_SIZE_ARCH 1
> +# define AT_VECTOR_SIZE_ARCH 2
>  #endif
>  
>  #endif /* _ASM_X86_AUXVEC_H */
> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> index 138a9f5b78d8..761d856f8ef7 100644
> --- a/arch/x86/kernel/signal.c
> +++ b/arch/x86/kernel/signal.c
> @@ -716,6 +716,11 @@ void __init init_sigframe_size(void)
>   max_frame_size = round_up(max_frame_size,

Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

2020-11-25 Thread Carlos O'Donell

On 11/25/20 7:17 PM, Thomas Gleixner wrote:
> Carlos, Petr,
> 
> On Wed, Nov 25 2020 at 15:37, Carlos O'Donell wrote:
>> On 11/19/20 7:14 PM, Thomas Gleixner wrote:
>>> So from my point of view asking for distorted time still _is_ a request
>>> for ponies.
>>
>> I'm happy if you say it's more work than the value it provides.
> 
> Thinking more about it. Would a facility which provides:
> 
>  CLOCK_FAKE_MONOTONIC|BOOTTIME|REALTIME
> 
> where you can go wild on setting time to whatever you want solve
> your problem?

We would need a way to inject CLOCK_FAKE_* in lieu of the real
constants.

There are only two straight forward ways I know how to do that
and they aren't very useful e.g. alternative build, syscall hot-path
debug code to alter the constant.

We might write a syscall interception framework using seccomp
and SECCOMP_RET_TRACE, but that involves ptrace'ing the process
under test, and is equivalent to a micro-sandbox. I'm not against
that idea for testing; we would test what we ship.

I don't think eBPF can modify the incoming arguments.

I need to go check if systemtap can modify incoming arguments;
I've never done that in any script.

In what other ways can we inject the new constants?

-- 
Cheers,
Carlos.

Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

2020-11-25 Thread Carlos O'Donell

On 11/19/20 7:14 PM, Thomas Gleixner wrote:
> I hope you are aware that the time namespace offsets have to be set
> _before_ the process starts and can't be changed afterwards,
> i.e. settime() is not an option.

I not interested in settime(). I saw Petr's request and forwarded it on
here to further the educational conversation about CLOCK_REALTIME and
cement a consensus around this issue. I'm happy to evangelize that we
won't support settime() for the specific reasons you call out and that
way I can give architectural guidance to setup systems in a particular
way to use CRIU or VM+container if needed.

> That might limit the usability for your use case and this can't be
> changed at all because there might be armed timers and other time
> related things which would start to go into full confusion mode.

The use of time in these cases, from first principles, seems to
degenerate into:

* Verify a time-dependent action is correct.

In glibc's case:

* Verify various APIs after y2038.

In Petr's case he could split the test into two such tests
but he would have to reproduce the system state for the second
test and I expect he wants to avoid that.

In that case I think Petr has to use CRIU to start the container,
stop it, advance time, and restart. That should work perfectly
in that use case and solve all the problems by relying on the work
done by CRIU.

> The supported use case is container life migration and that _is_ very
> careful about restoring time and armed timers and if their user space
> tools screw it up then they can keep the bits and pieces.

Agreed.

> So in order to utilize that you'd have to checkpoint the container,
> manipulate the offsets and restore it.

Or use the same mechanisms CRIU uses.

I can't rely on CRIU because I have to bootstrap a toolchain and userspace.
We should be able to write a thin veneer into our own testing wrapper and
emulate whatever CRIU does. We know apriori that our test framework starts
the test without anything having been executed yet. So we have that benefit.
Currently we unshare() for NEWUSER/NEWPID/NEWNS, but I expect that will
get a little more advanced. Already having these namespaces helps immensely
when adding fs-related tests.

> Aside of this, there are other things, e.g. file times, packet
> timestamps etc. which are based on CLOCK_REALTIME. What to do about
> them? Translate these to/from name space time or not? There is a long
> list of other horrors which are related to that.

We haven't even started testing for the upcoming negative leap second ;-)

> So _you_ might say, that you don't care about file times, RTC, timers
> expiring at the wrong time, packet timestamps and whatever.

I do care about them, but only given certain contexts.

> But then the next test dude comes around and want's to test exactly
> these interfaces and we have to slap the time namespace conversions for
> REALTIME and TAI all over the place because we already support the
> minimal thing.

That's a decision you need to make when asked those questions.

> Can you see why this is a slippery slope and why I'm extremly reluctant
> to even provide the minimal 'distort realtime when the namespace starts'
> support?

I would argue this is a slippery slope fallacy. If and when we get better
vm+container support we just tear all this code out and tell people to
start using those frameworks. The vm+container frameworks have independent
reasons to exist and so will continue to improve for security isolation
purposes and end up solving time testing issues by allowing us complete
control over the VMs time.

>> Hopefully this ilustrates that real time name space is not "request for
>> ponny" :-)
> 
> I can understand your pain and why you want to distort time, but please
> understand that timekeeping is complex. The primary focus must be
> correctness, scalability and maintainability which is already hard
> enough to achieve. Just for the perspective: It took us only 8 years to
> get the kernel halfways 2038 ready (filesystems still outstanding).

I agree. The upstream glibc community has been working on y2038 since 2018;
not as long as the kernel.

> So from my point of view asking for distorted time still _is_ a request
> for ponies.

I'm happy if you say it's more work than the value it provides.

> The fixed offsets for clock MONOTONIC/BOOTTIME are straight forward,
> absolutely make sense and they have a limited scope of exposure. clock
> REALTIME/TAI are very different beasts which entail a slew of horrors.
> Adding settime() to the mix makes it exponentially harder.

Right.

-- 
Cheers,
Carlos.

Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

2020-11-19 Thread Carlos O'Donell

On 11/6/20 7:47 PM, Thomas Gleixner wrote:
> On Thu, Nov 05 2020 at 12:25, Carlos O'Donell wrote:
>> On 10/30/20 9:38 PM, Thomas Gleixner wrote:
>> If kata grows up quickly perhaps this entire problem becomes solved, but 
>> until
>> then I continue to have a testing need for a distinct CLOCK_REALTIME in a
>> time namespace (and it need not be unconditional, if I have to engage magic
>> then I'm happy to do that).
> 
> Conditional, that might be a way to go.
> 
> Would CONFIG_DEBUG_DISTORTED_CLOCK_REALTIME be a way to go? IOW,
> something which is clearly in the debug section of the kernel which wont
> get turned on by distros (*cough*) and comes with a description that any
> bug reports against it vs. time correctness are going to be ignored.

Yes. I would be requiring CONFIG_DEBUG_DISTORTED_CLOCK_REALTIME.

Let me be clear though, the distros have *+debug kernels for which this
CONFIG_DEBUG_* could get turned on? In Fedora *+debug kernels we enable all
sorts of things like CONFIG_DEBUG_OBJECTS_* and CONFIG_DEBUG_SPINLOCK etc.
etc. etc.

I would push Fedora/RHEL to ship this in the *+debug kernels. That way I can 
have
this on for local test/build cycle. Would you be OK with that?

We could have it disabled by default but enabled via proc like
unprivileged_userns_clone was at one point? I want to avoid accidental use in
Fedora *+debug kernels unless the developer is actively going to run tests that
require time manipulation e.g. thousands of DNSSEC tests with timeouts [1].

I also need a way to determine the feature is enabled or disabled so I can XFAIL
the tests and tell the developer they need to turn on the feature in the host
kernel (and not to complain when CLOCK_REALTIME is wrong). A proc interface 
solves
this in a straight forward way.

I could then also tell my hardware partners to turn it on during certain 
test/build
cycles. It violates "ship what you test" but increases test coverage and can be
run as a distinct test cycle. I could also have our internal builders turn this
feature on so we can run rpm %check phases with this feature enabled (operations
might refuse, but in that case my day-to-day developer testing still helps by
orders of magnitude).

Notes:
[1] Petr Špaček commented on DNSSEC and expiration testing as another 
real-world testing
scenario: https://sourceware.org/pipermail/libc-alpha/2020-November/119785.html
Still a testing scenario, but an example outside of glibc for networking, where 
they
have a need to execute thousands of tests with accelerated timeout. If 
vm+containers
catches up, and I think they will, we'll have a solution in a few years.

>> * Adding CLOCK_REALTIME to the kernel is a lot of work given the expected
>>   guarantees for a local system.
> 
> Correct.
> 
>> * CLOCK_REALTIME is an expensive resource to maintain, even more expensive
>>   than other resources where the kernel can balance their usage.
> 
> Correct.
> 
>> * On balance it would be better to use vm or vm+containers e.g. kata as a
>>   solution to having CLOCK_REALTIME distinct in the container.
> 
> That'd be the optimal solution, but the above might be a middle ground.

Agreed.

-- 
Cheers,
Carlos.

Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

2020-11-05 Thread Carlos O'Donell

On 10/30/20 9:38 PM, Thomas Gleixner wrote:
> Coming back to your test coverage argument. I really don't see a problem
> with the requirement of having qemu installed in order to run 'make
> check'.

Cost. It's is cheaper and easier to maintain and deploy containers.

A full VM requires maintaining and updating images, and kernel builds
independent of what the developer is using for their development system.
This goes out of date quickly and needs a lot of resources to maintain.

When you get away from a VM you can then engage the entire developer
community to just run your userspace testing on *their* hardware on *their*
kernels. So I can go to Arm, Intel, AMD, IBM, SUSE, Red Hat, etc. and say:
"All you need to do is run 'make check' and the tests will verify your
hardware and kernel are working correctly." Those developers don't want their
system clocks adjusted during testing, and they are as busy as you and I are.

Container registries and tooling are much lighter weight and support layering
changes on top of base images in ways which allow different testing scenarios.
I don't have any desire to build a similar ecosystem for VM images or wait for
VM+container (kata) tooling to grow up.

If kata grows up quickly perhaps this entire problem becomes solved, but until
then I continue to have a testing need for a distinct CLOCK_REALTIME in a
time namespace (and it need not be unconditional, if I have to engage magic
then I'm happy to do that).

> If you can't ask that from your contributors, then asking me to provide
> you a namespace magic is just hillarious. The contributor who refuses to
> install qemu will also insist to run on some last century kernel which
> does not even know about name spaces at all.

I don't disagree with you, it is *absolutely* a VM tooling issue, and that
containers are easier to maintain, and deploy. With namespaces I can build
glibc, a sysroot, and run it in isolation very quickly.

Just so I understand, let me reiterate your position:

* Adding CLOCK_REALTIME to the kernel is a lot of work given the expected
  guarantees for a local system.

* CLOCK_REALTIME is an expensive resource to maintain, even more expensive
  than other resources where the kernel can balance their usage.

* On balance it would be better to use vm or vm+containers e.g. kata as a
  solution to having CLOCK_REALTIME distinct in the container.
 Thanks for your feedback.

-- 
Cheers,
Carlos.

Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

2020-10-30 Thread Carlos O'Donell

On 10/30/20 4:06 PM, Thomas Gleixner wrote:
> On Fri, Oct 30 2020 at 12:58, Carlos O'Donell wrote:
>> I expect that more requests for further time isolation will happen
>> given the utility of this in containers.
> 
> There was a lengthy discussion about this and the only "usecase" which
> was brought up was having different NTP servers in name spaces, i.e. the
> leap second ones and the smearing ones.

In the non-"request for ponies" category:

* Running legacy 32-bit applications in containers with CLOCK_REALTIME set
  to some value below y2038.

* Testing kernel and userspace clock handling code without needing to
  run on bare-metal, VM, or other.

> Now imagine 1000 containers each running their own NTP. Guess what the
> host does in each timer interrupt? Chasing 1000 containers and update
> their notion of CLOCK_REALTIME. In the remaining 5% CPU time the 1000
> containers can do their computations.

How is this different than balancing any other resource that you give
to a container/vm on a host?

Can you enable 1000 containers running smbd/nmbd and expect to get
great IO performance?

> But even if you restrict it to a trivial offset without NTP
> capabilities, what's the semantics of that offset when the host time is
> set?

Now you're talking about an implementation. This thread is simply
"Would we implement CLOCK_REALTIME?" Is the answer "Maybe, if we solve
all these other problems?"

>> If we have to use qemu today then that's where we're at, but again
>> I expect our use case is representative of more than just glibc.
> 
> For testing purposes it might be. For real world use cases not so
> much. People tend to rely on the coordinated nature of CLOCK_TAI and
> CLOCK_REALTIME.

Except we have two real world use cases, at the top of this email, 
that could extend to a lot of software. We know legacy 32-bit 
applications exist that will break with CLOCK_REALTIME past
y2038. Software exists that manipulates time and needs testing
with specific time values e.g. month crossings, day crossings,
leap year crossings, etc.

>> Does checkpointing work robustly when userspace APIS use 
>> CLOCK_REALTIME (directly or indirectly) in the container?
> 
> AFAICT, yes. That was the conclusion over the lenghty discussion about
> time name spaces and their requirements.

If this is the case then have we established behaviours that
happen when such processes are migrated to other systems with
different CLOCK_REALTIME clocks? Would these behaviours serve
as the basis of how CLOCK_REALTIME in a namespace would behave?

That is to say that migrating a container to a system with a
different CLOCK_REALTIME should behave similarly to what happens
when CLOCK_REALTIME is changed locally and you have a container
with a unique CLOCK_REALTIME?

> Here is the Linux plumber session related to that:
>  https://www.youtube.com/watch?v=sjRUiqJVzOA

Thanks. I watched the session. Informative :-)

-- 
Cheers,
Carlos.

Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

2020-10-30 Thread Carlos O'Donell

On 10/30/20 11:10 AM, Thomas Gleixner via Libc-alpha wrote:
> On Fri, Oct 30 2020 at 10:02, Zack Weinberg wrote:
>> On Fri, Oct 30, 2020 at 9:57 AM Cyril Hrubis  wrote:
 According to patch description [1] and time_namespaces documentation
 [2] the CLOCK_REALTIME is not supported (for now?) to avoid complexity
 and overhead in the kernel.
>> ...
 To be more specific - [if this were supported] it would be possible to 
 modify time after time_t
 32 bit overflow (i.e. Y2038 bug) on the process running Y2038
 regression tests on the host system (64 bit one). By using Linux time
 namespaces the system time will not be affected in any way.
>>>
>>> And what's exactly wrong with moving the system time forward for a
>>> duration of the test?
>>
>> Interference with other processes on the same computer?  Some of us
>> *do* like to run the glibc test suite on computers not entirely
>> devoted to glibc CI.
> 
> That's what virtual machines are for.

Certainly, that is always an option, just like real hardware.

However, every requirement we add to testing reduces the number of
times that developer will run the test on their system and potentially
catch a problem during development. Yes, CI helps, but "make check"
gives more coverage. More kernel variants tested in all downstream rpm
%check builds or developer systems. Just like kernel self tests help
today.

glibc uses namespaces in "make check" to increase the number of userspace
and kernel features we can test immediately and easily on developer
*or* distribution build systems.

So the natural extension is to further isolate the testing namespace
using the time namespace to test and verify y2038. If we can't use
namespaces then we'll have to move the tests out to the less
frequently run scripts we use for cross-target toolchain testing,
and so we'll see a 100x drop in coverage.

I expect that more requests for further time isolation will happen
given the utility of this in containers.

If we have to use qemu today then that's where we're at, but again
I expect our use case is representative of more than just glibc.

Does checkpointing work robustly when userspace APIS use 
CLOCK_REALTIME (directly or indirectly) in the container?

-- 
Cheers,
Carlos.

Re: [PATCH v4 0/2] Syscall User Redirection

2020-07-16 Thread Carlos O'Donell

On 7/16/20 4:30 PM, Gabriel Krisman Bertazi wrote:
> Christian Brauner  writes:
> 
>> On Thu, Jul 16, 2020 at 01:25:43PM -0700, Kees Cook wrote:
>>> On Thu, Jul 16, 2020 at 10:22:34PM +0200, Christian Brauner wrote:
 On Thu, Jul 16, 2020 at 01:04:38PM -0700, Kees Cook wrote:
> On Thu, Jul 16, 2020 at 03:31:39PM -0400, Gabriel Krisman Bertazi wrote:
>> This is v4 of Syscall User Redirection.  The implementation itself is
>> not modified from v3, it only applies the latest round of reviews to the
>> selftests.
>>
>> __NR_syscalls is not really exported in header files other than
>> asm-generic for every architecture, so it felt safer to optionally
>> expose it with a fallback to a high value.
>>
>> Also, I didn't expose tests for PR_GET as that is not currently
>> implemented.  If possible, I'd have it supported by a future patchset,
>> since it is not immediately necessary to support this feature.
>
> Thanks! That all looks good to me.

 Don't have any problem with this but did this ever get exposure on
 linux-api? This is the first time I see this pop up.
>>>
>>> I thought I'd added it to CC in the past, but that might have been other
>>> recent unrelated threads. Does this need a full repost there too, you
>>> think?
>>
>> Nah, wasn't my intention to force a repost. Seems that several people
>> have looked this over. :) Just curious why it didn't get to linux-api
>> and we know quite some people who only do look at linux-api (for sanity). :)
> 
> That's my mistake.  I didn't think about it when submitting :(
> 
> If this get re-spinned again I will make sure to CC linux-api.

Thank you! It helps C library implementors stay up to date and comment
on changes that impact userspace ABIs and APIs. This patch set was new
to me. Interesting new feature.

-- 
Cheers,
Carlos.

Re: [RFC PATCH 0/4] rseq: Introduce extensible struct rseq

2020-07-16 Thread Carlos O'Donell

On 7/15/20 9:02 AM, Mathieu Desnoyers wrote:
> At this point, the main question I would like answered is whether
> it would be acceptable to increase the size and alignment of
> the __rseq_abi symbol (which will be exposed by glibc) between
> e.g. glibc 2.32 and 2.33. If it's not possible, then we can
> find other solutions, for instance using an indirection with
> a pointer to an extended structure, but this appears to be
> slightly less efficient.

The answer is always a soft "maybe" because it depends exactly
on how we do it and what consequences we are willing to accept
in the design.

For example, static applications that call dlopen will fail if
we increase the alignment beyond 32 because we had to special
case this scenario. Why did we have to special case it? Because
the "static" part of the runtime needs to create the initial
thread's static TLS space, and since it doesn't know apriori
what will be loaded in the shared library, it needs to make a
"best guess" at the alignment requirement at startup.
We need to discuss this and agree that it's OK. We already want
to deprecate dynamic loading from static applications, so this
may not be a problem in general, but I hope you see my point.
That there are corner cases to be considered and ironed out.

I want to see a detailed design document explaining the various
compatibility issues and how we solve them along with the way
the extension mechanism would work and how it would be compliant
with C/C++ language rules in userspace without adding undue burden
of potentially having to use atomic instructions all the time.
This includes discussing how the headers change. We should also
talk out the options for symbol versioning and their consequences.

I haven't seen enough details, and there isn't really enough
time to discuss this. I think it is *great* that we are discussing
it, but it's safest if we revert rseq, finish the discussion,
and then finalize the inclusion for 2.33 with these details
ironed out.

I feel like we've made all the technical process we need to actually
include rseq in glibc, but this discussion, and the google example
(even if it doesn't match our use case) shows that if we spend another
month hammering out the extension details could yield something we
can use for years to come while we work out other details e.g. cpu_opv.

I can set aside time in the next month to write up such a document
and discuss these issues with you and Florian. The text would form
even more of the language we'd have to include in the man page for
the feature.

In the meantime I think we should revert rseq in glibc and take
our time to hash this out without the looming deadline of August 1st
for the ABI going out the door.

I know this is disappointing, but I think in a month you'll look
back at this, we'll have Fedora Rawhide using the new extensible
version (and you'll be able to point people at that), and we'll
only be 5 months away from an official release with extensible
rseq.

Could you please respond to Florian's request to revert here?
https://sourceware.org/pipermail/libc-alpha/2020-July/116368.html

I'm looking for a Signed-off-by from you that you're OK with
reverting.

-- 
Cheers,
Carlos.

Re: [RFC PATCH 2/4] rseq: Allow extending struct rseq

2020-07-14 Thread Carlos O'Donell

On 7/14/20 9:19 AM, Mathieu Desnoyers wrote:
> Is there an arch-agnostic way to get the thread pointer from user-space code 
> ? That
> would be needed by all rseq critical section implementations.

Yes, and no. We have void *__builtin_thread_pointer (void), but
few architectures implement the builtin so we'd have to go through
a round of compiler updates and backports. All targets know how to
access the thread pointer because the compiler has to generate
IE-mode accesses to the TLS variables.

I have filed an enhancement request:
Bug 96200 - Implement __builtin_thread_pointer() and 
__builtin_set_thread_pointer() if TLS is supported
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96200

We have glibc internal macro APIs to access the thread pointer,
but I would rather the compiler handle the access since it can
schedule the resulting sequence better.

On some arches setting the therad pointer needs a syscall or
equivalent operation (hppa), and for some arches there is no
fixed register (arm) hence the need for __builtin_thread_pointer()
to force the compiler to place the pointer into a register for
function return.

-- 
Cheers,
Carlos.

Re: [RFC PATCH 0/4] rseq: Introduce extensible struct rseq

2020-07-14 Thread Carlos O'Donell

On 7/13/20 11:03 PM, Mathieu Desnoyers wrote:
> Recent discussion led to a solution for extending struct rseq. This is
> an implementation of the proposed solution.
> 
> Now is a good time to agree on this scheme before the release of glibc
> 2.32, just in case there are small details to fix on the user-space
> side in order to allow extending struct rseq.

Adding extensibility to the rseq registration process would be great,
but we are out of time for the glibc 2.32 release.

Should we revert rseq for glibc 2.32 and spend quality time discussing
the implications of an extensible design, something that Google already
says they are doing?

We can, with a clear head, and an agreed upon extension mechanism
include rseq in glibc 2.33 (release scheduled for Feburary 1st 2021).
We release time boxed every 6 months, no deviation, so you know when
your next merge window will be.

We have already done the hard work of fixing the nesting signal
handler issues, and glibc integration. If we revert today that will 
also give time for Firefox and Chrome to adjust their sandboxes.

Do you wish to go forward with rseq as we have it in glibc 2.32,
or do you wish to revert rseq from glibc 2.32, discuss the extension
mechanism, and put it back into glibc 2.33 with adjustments?

-- 
Cheers,
Carlos.

Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

2020-07-07 Thread Carlos O'Donell

On 7/7/20 8:06 AM, Mathieu Desnoyers wrote:
> - On Jul 7, 2020, at 7:32 AM, Florian Weimer f...@deneb.enyo.de wrote:
> 
>> * Mathieu Desnoyers:
>>
>>> Those are very good points. One possibility we have would be to let
>>> glibc do the rseq registration without the RSEQ_FLAG_RELIABLE_CPU_ID
>>> flag. On kernels with the bug present, the cpu_id field is still good
>>> enough for typical uses of sched_getcpu() which does not appear to
>>> have a very strict correctness requirement on returning the right
>>> cpu number.
>>>
>>> Then libraries and applications which require a reliable cpu_id
>>> field could check this on their own by calling rseq with the
>>> RSEQ_FLAG_RELIABLE_CPU_ID flag. This would not make the state more
>>> complex in __rseq_abi, and let each rseq user decide about its own
>>> fate: whether it uses rseq or keeps using an rseq-free fallback.
>>>
>>> I am still tempted to allow combining RSEQ_FLAG_REGISTER |
>>> RSEQ_FLAG_RELIABLE_CPU_ID for applications which would not be using
>>> glibc, and want to check this flag on thread registration.
>>
>> Well, you could add a bug fix level field to the __rseq_abi variable.
> 
> Even though I initially planned to make the struct rseq_abi extensible,
> the __rseq_abi variable ends up being fix-sized, so we need to be very
> careful in choosing what we place in the remaining 12 bytes of padding.
> I suspect we'd want to keep 8 bytes to express a pointer to an
> "extended" structure.
> 
> I wonder if a bug fix level "version" is the right approach. We could
> instead have a bitmask of fixes, which the application could independently
> check. For instance, some applications may care about cpu_id field
> reliability, and others not.

I agree with Florian.

Developers are not interested in a bitmask of fixes.

They want a version they can check and that at a given version everything
works as expected.

In reality today this is the kernel version.

So rseq is broken from a developer perspective until they can get a new
kernel or an agreement from their downstream vendor that revision Z of
the kernel they are using has the fix you've just committed.

Essentially this problem solves itself at level higher than the interfaces
we're talking about today.

Encoding this as a bitmask of fixes is an overengineered solution for a
problem that the downstream communities already know how to solve.

I would strongly suggest a "version" or nothing.

>> Then applications could check if the kernel has the appropriate level
>> of non-buggyness.  But the same thing could be useful for many other
>> kernel interfaces, and I haven't seen such a fix level value for them.
>> What makes rseq so special?
> 
> I guess my only answer is because I care as a user of the system call, and
> what is a system call without users ? I have real applications which work
> today with end users deploying them on various kernels, old and new, and I
> want to take advantage of this system call to speed them up. However, if I
> have to choose between speed and correctness (in other words, not crashing
> a critical application), I will choose correctness. So if I cannot detect
> that I can safely use the system call, it becomes pretty much useless even
> for my own use-cases.

Yes.

In the case of RHEL we have *tons*  of users in the same predicament.

They just wait until the RHEL kernel team releases a fixed kernel version
and check for that version and deploy with that version.

Likewise every other user of a kernel. They solve it by asking their
kernel provider, internal or external, to verify the fix is applied to
the deployment kernel.

If they are an ISV they have to test and deploy similar strategies for
multiple kernel version.

By trying to automate this you are encoding into the API some level of
package management concepts e.g. patch level, revision level, etc.

This is difficult to do without a more generalized API. Why do it just
for rseq? Why do it with the few bits you have?

It's not a great fit IMO. Just let the kernel version be the arbiter of
correctness.

>> It won't help with the present bug, but maybe we should add an rseq
>> API sub-call that blocks future rseq registration for the thread.
>> Then we can add a glibc tunable that flips off rseq reliably if people
>> do not want to use it for some reason (and switch to the non-rseq
>> fallback code instead).  But that's going to help with future bugs
>> only.
> 
> I don't think it's needed. All I really need is to have _some_ way to
> let lttng-ust or liburcu query whether the cpu_id field is reliable. This
> state does not have to be made quickly accessible to other libraries,
> nor does it have to be shared between libraries. It would allow each
> user library or application to make its own mind on whether it can use
> rseq or not.

That check is "kernel version > x.y.z" :-)

or

"My kernel vendor told me it's fixed."

-- 
Cheers,
Carlos.

Re: [PATCH 1/3] glibc: Perform rseq registration at C startup and thread creation (v22)

2020-07-03 Thread Carlos O'Donell

On 7/2/20 10:46 AM, Florian Weimer wrote:
> * Mathieu Desnoyers via Libc-alpha:
> 
>> Register rseq TLS for each thread (including main), and unregister for
>> each thread (excluding main).  "rseq" stands for Restartable Sequences.
>>
>> See the rseq(2) man page proposed here:
>>   https://lkml.org/lkml/2018/9/19/647
>>
>> Those are based on glibc master branch commit 3ee1e0ec5c.
>> The rseq system call was merged into Linux 4.18.
>>
>> The TLS_STATIC_SURPLUS define is increased to leave additional room for
>> dlopen'd initial-exec TLS, which keeps elf/tst-auditmany working.
>>
>> The increase (76 bytes) is larger than 32 bytes because it has not been
>> increased in quite a while.  The cost in terms of additional TLS storage
>> is quite significant, but it will also obscure some initial-exec-related
>> dlopen failures.
> 
> We need another change to get this working on most non-x86
> architectures:
> 
> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
> index 817bcbbf59..ca13778ca9 100644
> --- a/elf/dl-tls.c
> +++ b/elf/dl-tls.c
> @@ -134,6 +134,12 @@ void
>  _dl_determine_tlsoffset (void)
>  {
>size_t max_align = TLS_TCB_ALIGN;
> +  /* libc.so with rseq has TLS with 32-byte alignment.  Since TLS is
> + initialized before audit modules are loaded and slotinfo
> + information is available, this is not taken into account below in
> + the audit case.  */
> +  max_align = MAX (max_align, 32U);
> +
>size_t freetop = 0;
>size_t freebottom = 0;
> 
> This isn't visible on x86-64 because TLS_TCB_ALIGN is already 64 there.
> 
> I plan to re-test with this fix and push the series.
> 
> Carlos, is it okay if I fold in the dl-tls.c change if testing looks
> good?

I have reviewed the above and I think it is the correct *pragmatic* fix.

The reality is that to fix this fully you must use a two stage loading
process to pre-examine all audit modules *before* setting the fundamental
alignment of the TCB.  This isn't easy with the current loader framework.
Therefore the above is a good pragmatic solution.

There is always going to be a bit of a chicken and an egg situation.
We want to provide a fundamental alignment requirement but we haven't
yet seen all the requirements on alignment. So the best we could do is
look over DT_NEEDED, DT_AUDIT, LD_PRELOAD, etc. get the best answer
and then fail any subsequent dlopen's that load objects with higher
fundamental requirements for alignment of the TCB.

The audit modules are problematic becuase they are loaded *before*
anything else is loaded, *before* we've examined any of the actual
objects we're about to load because they can influence the search
paths. Again, this means the above solution is a perfect pragmatic
choice. The real solution is to rearchitect the early audit module
loading into two stages and that's work we can do later.

OK with the above change.

-- 
Cheers,
Carlos.

Re: [PATCH] futex: send SIGBUS if argument is not aligned on a four-byte boundary

2020-05-19 Thread Carlos O'Donell

On Fri, May 15, 2020 at 1:50 PM Carlos O'Donell  wrote:
> This isn't fixed because this is the older code in pthread_mutex_lock
> which we haven't ported to futex-internal.h yet, otherwise we would abort
> the process.

I filed this upstream as a QoI issue so I don't forget to port the existing code
to the newer internal interfaces for futex handling.

"Bug 25997 - pthread_mutex_lock QoI issue for unaligned futex."
https://sourceware.org/bugzilla/show_bug.cgi?id=25997

I checked that -Wcast-align=strict does warn about this case, but it's
rarely used
in production code that I've worked with. I'm following up with the
compiler people
to see if we can consistently warn in these cases and so avoid this kind of code
existing in the first place.

Cheers,
Carlos.

Re: [PATCH] futex: send SIGBUS if argument is not aligned on a four-byte boundary

2020-05-15 Thread Carlos O'Donell

On 5/15/20 12:27 PM, Peter Zijlstra wrote:
> On Fri, May 15, 2020 at 06:36:47PM +0300, Konstantin Khlebnikov wrote:
>> Userspace implementations of mutexes (including glibc) in some cases
>> retries operation without checking error code from syscall futex.
>> This is good for performance because most errors are impossible when
>> locking code trusts itself.

In newer versions of glibc, which won't solve this problem for older
distributions (or newer glibc on older kernels), we've refactored all
of this code into futex-internal.h and do things like this (example
from one of the generic internal interfaces for futex use):

149 case -ETIMEDOUT: /* Cannot have happened as we provided no timeout.  */
150 case -EFAULT: /* Must have been caused by a glibc or application bug.  
*/
151 case -EINVAL: /* Either due to wrong alignment or due to the timeout not
152  being normalized.  Must have been caused by a glibc or
153  application bug.  */
154 case -ENOSYS: /* Must have been caused by a glibc bug.  */
155 /* No other errors are documented at this time.  */
156 default:
157   futex_fatal_error ();
158 }

Several of the pthread interfaces are using this code so they won't suffer
from "stuck EINVAL loops" like below.

We worked with all the interested parties to get `man 2 futex` updated
with the expected semantics and error return codes.

We don't want to ignore what the kernel is returning and we terminate
the process without propagating that error upwards for the simple 
API cases.

Likewise note the "default:" which means if we get new futex error that
is not documented we also terminate the process.

>> Some errors which could came from outer code are handled automatically,
>> for example invalid address triggers SIGSEGV on atomic fast path.
>>
>> But one case turns into nasty busy-loop: when address is unaligned.
>> futex(FUTEX_WAIT) returns EINVAL immediately and loop goes to retry.
>>
>> Example which loops inside second call rather than hung peacefully:
>>
>> #include 
>> #include 
>>
>> int main(int argc, char **argv)
>> {
>>  char buf[sizeof(pthread_mutex_t) + 1];
>>  pthread_mutex_t *mutex = (pthread_mutex_t *)(buf + 1);
>>
>>  pthread_mutex_init(mutex, NULL);
>>  pthread_mutex_lock(mutex);
>>  pthread_mutex_lock(mutex);
>> }

This isn't fixed because this is the older code in pthread_mutex_lock
which we haven't ported to futex-internal.h yet, otherwise we would abort
the process.

A quick change to use the newer interface (futex_wait_simple), and the
example above fails as expected:

./test
The futex facility returned an unexpected error code.
Aborted (core dumped)

And it does not loop. I'm open to bikeshed on the existing error message
(which has been there since 2014 / commit a2f0363f817).

coredumpctl debug loop-futex/test

Core was generated by `./test'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
49return ret;
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1  0x7f1cac0d2872 in __GI_abort () at abort.c:79
#2  0x7f1cac12a248 in __libc_message (action=action@entry=do_abort, 
fmt=fmt@entry=0x7f1cac234a57 "%s")
at ../sysdeps/posix/libc_fatal.c:155
#3  0x7f1cac12a27a in __GI___libc_fatal (
message=message@entry=0x7f1cac288000 "The futex facility returned an 
unexpected error code.\n")
at ../sysdeps/posix/libc_fatal.c:164
#4  0x7f1cac283fdc in futex_fatal_error () at 
../sysdeps/nptl/futex-internal.h:157
#5  futex_wait (private=, expected=2, futex_word=0x7f1cac283fdc 
<__lll_lock_wait+92>)
at ../sysdeps/nptl/futex-internal.h:157
#6  futex_wait_simple (private=, expected=2, 
futex_word=0x7f1cac283fdc <__lll_lock_wait+92>)
at ../sysdeps/nptl/futex-internal.h:172
#7  __lll_lock_wait (futex=futex@entry=0x7ffdb1f0a2c1, private=) 
at lowlevellock.c:53
#8  0x7f1cac27cbf3 in __GI___pthread_mutex_lock (mutex=0x7ffdb1f0a2c1) at 
../nptl/pthread_mutex_lock.c:80
#9  0x0040117a in main (argc=1, argv=0x7ffdb1f0a3f8) at test.c:11

So semantically the kernel change makes sense, and will terminate the
process for glibc today, and matches what the refactored glibc code
will do in userspace for more of the interfaces in the future.

>> It seems there is no practical usage for calling syscall futex for
>> unaligned address. This may be only bug in user space. Let's help
>> and handle this gracefully without adding extra code on fast path.

The only use case I could see is retroactively adding a futex to the
existing ABI of a structure and wanting to avoid padding. That does
not seem like a common enough use case that we would want to support.
To get efficient cache-line usage you'll want to pack things by hand.

>> This patch sends SIGBUS signal to slay task and break busy-loop.
>>
>> Signed-off-by: Konstantin Khlebnikov 
>> Reported-by: Maxim Samoylov 
> 
> Seems

Re: [PATCH glibc 2.31 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v12)

2019-09-11 Thread Carlos O'Donell

On 9/11/19 3:08 PM, Florian Weimer wrote:
> * Carlos O'Donell:
> 
>> It would be easier to merge the patch set if it were just an unconditional
>> registration like we do for set_robust_list().
> 
> Note that this depends on the in-tree system call numbers list, which I
> still need to finish according to Joseph's specifications.

Which work is this? Do you have a URL reference to WIP?

> (We have something that should work for us as long as we can get large
> machines from the lab, but I agree that it's not very useful if glibc
> bot-cycle time is roughly one business day.)

Yeah, we have to discuss how to accelerate this.

-- 
Cheers,
Carlos.

Re: [PATCH glibc 2.31 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v12)

2019-09-11 Thread Carlos O'Donell

On 9/11/19 2:26 PM, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>> +#ifdef SHARED
>> +  if (rtld_active ())
>> +{
>> +  /* Register rseq ABI to the kernel.   */
>> +  (void) rseq_register_current_thread ();
>> +}
>> +#else
> 
> I think this will need *another* check for the inner libc in an audit
> module.  See what we do in malloc.  __libc_multiple_libcs is supposed to
> cover that, but it's unfortunately not reliable.
> 
> I believe without that additional check, the first audit module we load
> will claim rseq, and the main program will not have control over the
> rseq area.  Reversed roles would be desirable here.
> 
> In October, I hope to fix __libc_multiple_libcs, and then you can use
> just that.  (We have a Fedora patch that is supposed to fix it, I need
> to documen the mechanism and upstream it.)

This is a technical issue we can resolve.

>> +/* Advertise Restartable Sequences registration ownership across
>> +   application and shared libraries.
>> +
>> +   Libraries and applications must check whether this variable is zero or
>> +   non-zero if they wish to perform rseq registration on their own. If it
>> +   is zero, it means restartable sequence registration is not handled, and
>> +   the library or application is free to perform rseq registration. In
>> +   that case, the library or application is taking ownership of rseq
>> +   registration, and may set __rseq_handled to 1. It may then set it back
>> +   to 0 after it completes unregistering rseq.
>> +
>> +   If __rseq_handled is found to be non-zero, it means that another
>> +   library (or the application) is currently handling rseq registration.
>> +
>> +   Typical use of __rseq_handled is within library constructors and
>> +   destructors, or at program startup.  */
>> +
>> +int __rseq_handled;
> 
> Are there any programs that use __rseq_handled *today*?
> 
> I'm less convinced that we actually need this.  I don't think we have
> ever done anything like that before, and I don't think it's necessary.
> Any secondary rseq library just needs to note if it could perform
> registration, and if it failed to do so, do not perform unregistration
> in a pthread destructor callback.
> 
> Sure, there's the matter of pthread destructor ordering, but that
> problem is not different from any other singleton (thread-local or not),
> and the fix for the last time this has come up (TLS destructors vs
> dlclose) was to upgrade glibc.

This is a braoder issue.

Mathieu,

It would be easier to merge the patch set if it were just an unconditional
registration like we do for set_robust_list().

What's your thought there?

-- 
Cheers,
Carlos.

Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)

2019-06-10 Thread Carlos O'Donell

On 6/6/19 7:57 AM, Florian Weimer wrote:
> Let me ask the key question again: Does it matter if code observes the
> rseq area first without kernel support, and then with kernel support?
> If we don't expect any problems immediately, we do not need to worry
> much about the constructor ordering right now.  I expect that over time,
> fixing this properly will become easier.

I just wanted to chime in and say that splitting this into:

* Ownership (__rseq_handled)

* Initialization (__rseq_abi)

Makes sense to me.

I agree we need an answer to this question of ownership but not yet
initialized, to owned and initialized.

I like the idea of having __rseq_handled in ld.so.

-- 
Cheers,
Carlos.

Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)

2019-04-09 Thread Carlos O'Donell


On 4/9/19 9:58 AM, Tulio Magno Quites Machado Filho wrote:

Alan Modra  writes:

Yes, looks fine to me, except that in VLE mode (do we care?)
".long 0x0fe50553" disassembles as
0:  0f e5   se_cmphl r5,r30
2:  05 53   se_mullw r3,r5
No illegal/trap/privileged insn there.

".long 0x0fe5000b" might be better to cover VLE.


Looks good for me too.


The requirement that it be a valid instruction is simply to aid in the
disassembly of rseq regions which may be hand written assembly with a
thin veneer of CFI/DWARF information.

It has already been pointed out that POWER uses data in the instruction
stream for jump tables to implement switch statements, but that specific
use has compiler support and one presumes good debug information. So as
Alan says, there is already data in the insn stream, though such things
can't be good for performance (pollutes D-cache, problematic for
speculative execution).


Actually, it better fits what Carlos O'Donnell had requested:


I think the order of preference is:

1.  An uncommon insn (with random immediate values), in a literal pool, that is
  not a useful ROP/JOP sequence (very uncommon)
2a. A uncommon TRAP hopefully with some immediate data encoded (maybe uncommon)
2b. A NOP to avoid affecting speculative execution (maybe uncommon)

With 2a/2b being roughly equivalent depending on speculative execution policy.


Yes, though "in a literal pool" is something that is not required, since
users might not want literal pools and so we shouldn't require that
feature (it also pollutes D-cache).

Keep in mind the insn will never execute.

If a trap insn calls out the nature of the signature more clearly then
use that instead.

--
Cheers,
Carlos.

Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)

2019-04-08 Thread Carlos O'Donell


On 4/8/19 3:20 PM, Tulio Magno Quites Machado Filho wrote:

Carlos O'Donell  writes:


On 4/5/19 5:16 AM, Florian Weimer wrote:

* Carlos O'Donell:

It is valuable that it be a trap, particularly for constant pools because
it means that a jump into the constant pool will trap.


Sorry, I don't understand why this matters in this context.  Would you
please elaborate?


Sorry, I wasn't very clear.

My point is only that any accidental jumps, either with off-by-one (like you
fixed in gcc/glibc's signal unwinding most recently), result in a process fault
rather than executing RSEQ_SIG as a valid instruction *and then* continuing
onwards to the handler.

A process fault is achieved either by a trap, or an invalid instruction, or
a privileged insn (like suggested for MIPS in this thread).


In that case, mtmsr (Move to Machine State Register) seems a good candidate.

mtmsr is available both on 32 and 64 bits since their first implementations.

It's a privileged instruction and should never appear in userspace
code (causes SIGILL).

Any comments?
 
That seems good to me.


Mathieu,

What's required to move this forward for POWER?

--
Cheers,
Carlos.

Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)

2019-04-05 Thread Carlos O'Donell


On 4/5/19 5:16 AM, Florian Weimer wrote:

* Carlos O'Donell:

It is valuable that it be a trap, particularly for constant pools because
it means that a jump into the constant pool will trap.


Sorry, I don't understand why this matters in this context.  Would you
please elaborate?


Sorry, I wasn't very clear.

My point is only that any accidental jumps, either with off-by-one (like you
fixed in gcc/glibc's signal unwinding most recently), result in a process fault
rather than executing RSEQ_SIG as a valid instruction *and then* continuing
onwards to the handler.

A process fault is achieved either by a trap, or an invalid instruction, or
a privileged insn (like suggested for MIPS in this thread).

--
Cheers,
Carlos.

Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)

2019-04-04 Thread Carlos O'Donell


On 3/25/19 11:54 AM, Mathieu Desnoyers wrote:

Hi Carlos,

- On Mar 22, 2019, at 4:09 PM, Carlos O'Donell codon...@redhat.com wrote:

[...]

I took care of all your comments for an upcoming round of patches, except the
following that remain open (see answer inline). I'm adding Linux maintainers
for ARM, aarch64, MIPS, powerpc, s390, x86 in CC to discuss the choice of
code signature prior to the abort handler for each of those architectures.


Thank you for kicking off this conversation.

Every architecture should have a reasonable RSEQ_SIG that applies to their
ISA with a comment about why that instruction was chosen. It should be a
conscious choice, without a default.


* Support for automatically registering threads with the Linux rseq(2)
   system call has been added.  This system call is implemented starting
   from Linux 4.18.  The Restartable Sequences ABI accelerates user-space
   operations on per-cpu data.  It allows user-space to perform updates
   on per-cpu data without requiring heavy-weight atomic operations. See
   https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/
   for further explanation.

   In order to be activated, it requires that glibc is built against
   kernel headers that include this system call, and that glibc detects
   availability of that system call at runtime.


Suggest:

* Support for automatically registering threads with the Linux rseq(2)
  system call has been added.  This system call is implemented starting
  from Linux 4.18.  The Restartable Sequences ABI accelerates user-space
  operations on per-cpu data.  It allows user-space to perform updates
  on per-cpu data without requiring heavy-weight atomic operations.
  Automatically registering threads allows all libraries, including libc,
  to make immediate use of the rseq(2) support by using the documented ABI.
  See 'man 2 rseq' for the details of the ABI shared between libc and the
  kernel.



For reference the assembly code I'm pointing at below can be found
in the Linux selftests under:

tools/testing/selftests/rseq/rseq-*.h


OK.



+++ b/sysdeps/unix/sysv/linux/arm/bits/rseq.h

[...]

+
+/* Signature required before each abort handler code.  */
+#define RSEQ_SIG 0x53053053


Why isn't this an arm specific op code? Does the user have to mark this
up as part of a constant pool when placing it in front of the abort handler
to avoid disassembling the constant as code? This was at one point required
to get gdb to work properly.



For arm, the abort is defined as:

#define __RSEQ_ASM_DEFINE_ABORT(table_label, label, teardown,   \
 abort_label, version, flags,\
 start_ip, post_commit_offset, abort_ip) \
 ".balign 32\n\t"\
 __rseq_str(table_label) ":\n\t" \
 ".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
 ".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 
0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
 ".word " __rseq_str(RSEQ_SIG) "\n\t"\
 __rseq_str(label) ":\n\t"   \
 teardown\
 "b %l[" __rseq_str(abort_label) "]\n\t"

Which contains a copy of the struct rseq_cs for that critical section
close to the actual critical section, within the code, followed by the
signature. The reason why we have a copy of the struct rseq_cs there is
to speed up entry into the critical section by using the "adr" instruction
to compute the address to store into __rseq_cs->rseq_cs.

AFAIU, a literal pool on ARM is defined as something which is always
jumped over (never executed), which is the case here. We always have
an unconditional branch instruction ("b") skipping over each
RSEQ_ASM_DEFINE_ABORT().

Therefore, given that the signature is part of a literal pool on ARM,
it can be any value we choose and should not need to be an actual valid
instruction.

aarch64 defines the abort as:

#define RSEQ_ASM_DEFINE_ABORT(label, abort_label)   
\
 "   b   222f\n"
 \
 "   .inst   "   __rseq_str(RSEQ_SIG) "\n"  
 \
 __rseq_str(label) ":\n"
 \
 "   b   %l[" __rseq_str(abort_label) "]\n" 
 \
 "222:\n"

Where the signature actually maps to a valid instruction. Considering that
aarch64 also have literal pools, and we branch over the signature, I wonder
why it's so importan

Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)

2019-04-04 Thread Carlos O'Donell


On 4/2/19 3:08 AM, Florian Weimer wrote:

* Michael Ellerman:


I'm a bit vague on what we're trying to do here.

But it seems like you want some sort of "eye catcher" prior to the branch?

That value is a valid instruction on current CPUs (rlwimi.
r5,r24,6,1,9), and even if it wasn't it could become one in future.

If you change it to 0x8053530 that is both a valid instruction and is a
nop (conditional trap immediate but with no conditions set).


I think we need something that is very unlikely to appear in the
instruction stream.  It's just a marker.  The instruction will never be
executed, and it does not have to be a trap, either (I believe that a
standard trap instruction would be a bad choice).


I assume you want to avoid a standard trap instruction because it would
be common, and so not meet the intent of the RSEQ_SIG choice as being something
that is *uncommon* right?

It is valuable that it be a trap, particularly for constant pools because
it means that a jump into the constant pool will trap.

--
Cheers,
Carlos.

Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)

2019-04-04 Thread Carlos O'Donell


On 4/2/19 2:02 AM, Michael Ellerman wrote:

Mathieu Desnoyers  writes:

Hi Carlos,

- On Mar 22, 2019, at 4:09 PM, Carlos O'Donell codon...@redhat.com wrote:

...


[...]

+++ b/sysdeps/unix/sysv/linux/powerpc/bits/rseq.h

[...]

+/* Signature required before each abort handler code.  */
+#define RSEQ_SIG 0x53053053


Why isn't this an opcode specific to power?


On powerpc 32/64, the abort is placed in a __rseq_failure executable section:

#define RSEQ_ASM_DEFINE_ABORT(label, abort_label)   
\
 ".pushsection __rseq_failure, \"ax\"\n\t"  
 \
 ".long " __rseq_str(RSEQ_SIG) "\n\t"   
 \
 __rseq_str(label) ":\n\t"  
 \
 "b %l[" __rseq_str(abort_label) "]\n\t"
 \
 ".popsection\n\t"

That section only contains snippets of those trampolines. Arguably, it would be
good if disassemblers could find valid instructions there. Boqun Feng could 
perhaps
shed some light on this signature choice ? Now would be a good time to decide
once and for all whether a valid instruction would be a better choice.


I'm a bit vague on what we're trying to do here.

But it seems like you want some sort of "eye catcher" prior to the branch?

That value is a valid instruction on current CPUs (rlwimi.
r5,r24,6,1,9), and even if it wasn't it could become one in future.

If you change it to 0x8053530 that is both a valid instruction and is a
nop (conditional trap immediate but with no conditions set).


The s390 IBM team needs to respond to this and I want to make sure they ACK
the NOP being used here because it impacts them directly.

I'd like to see Martin's opinion on this.

--
Cheers,
Carlos.

Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)

2019-03-27 Thread Carlos O'Donell


On 3/27/19 5:16 AM, Martin Schwidefsky wrote:

On Mon, 25 Mar 2019 11:54:32 -0400 (EDT)
Mathieu Desnoyers  wrote:


+++ b/sysdeps/unix/sysv/linux/s390/bits/rseq.h

[...]

+
+/* Signature required before each abort handler code.  */
+#define RSEQ_SIG 0x53053053


Why not a s390 specific value here?


s390 also has the abort handler in a __rseq_failure section:

#define RSEQ_ASM_DEFINE_ABORT(label, teardown, abort_label) \
 ".pushsection __rseq_failure, \"ax\"\n\t"   \
 ".long " __rseq_str(RSEQ_SIG) "\n\t"\
 __rseq_str(label) ":\n\t"   \
 teardown\
 "j %l[" __rseq_str(abort_label) "]\n\t" \
 ".popsection\n\t"

Same question applies as powerpc: since disassemblers will try to decode
that instruction, would it be better to define it as a valid one ?

[...]


A 4-byte sequence starting with 0x53 is decoded as a "diebr" instruction.
And please replace that "j %l[...]" with a "jg %l[...]", the branch target
range of the "j" instruction is 64K, not enough for the general case.


Why was this particular operated selected?
 
So on s390 the RSEQ_SIG will show up as an unexpected "divide to integer"

instruction that can't be reached by any control flow?

Can we use a NOP with a unique value in an immediate operand?

The goal being to have something that won't confuse during a debug
session, or that the debugger can ignore (like constant pools on Arm)

--
Cheers,
Carlos.

Re: [PATCH 2/4] glibc: sched_getcpu(): use rseq cpu_id TLS on Linux

2019-03-22 Thread Carlos O'Donell


On 2/12/19 2:42 PM, Mathieu Desnoyers wrote:

When available, use the cpu_id field from __rseq_abi on Linux to
implement sched_getcpu(). Fall-back on the vgetcpu vDSO if unavailable.

Benchmarks:

x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading


This patch looks good to me for master, but is blocked on patch 1/4
being reworked.

Reviewed-by: Carlos O'Donell 


glibc sched_getcpu(): 13.7 ns (baseline)
glibc sched_getcpu() using rseq:   2.5 ns (speedup:  5.5x)
inline load cpuid from __rseq_abi TLS: 0.8 ns (speedup: 17.1x)

Signed-off-by: Mathieu Desnoyers 
CC: Carlos O'Donell 
CC: Florian Weimer 
CC: Joseph Myers 
CC: Szabolcs Nagy 
CC: Thomas Gleixner 
CC: Ben Maurer 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: Boqun Feng 
CC: Will Deacon 
CC: Dave Watson 
CC: Paul Turner 
CC: libc-al...@sourceware.org
CC: linux-kernel@vger.kernel.org
CC: linux-...@vger.kernel.org
---
  sysdeps/unix/sysv/linux/sched_getcpu.c | 25 +++--
  1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/sysdeps/unix/sysv/linux/sched_getcpu.c 
b/sysdeps/unix/sysv/linux/sched_getcpu.c
index fb0d317f83..8bfb03778b 100644
--- a/sysdeps/unix/sysv/linux/sched_getcpu.c
+++ b/sysdeps/unix/sysv/linux/sched_getcpu.c
@@ -24,8 +24,8 @@
  #endif
  #include 
  
-int

-sched_getcpu (void)
+static int
+vsyscall_sched_getcpu (void)


OK.


  {
  #ifdef __NR_getcpu
unsigned int cpu;
@@ -37,3 +37,24 @@ sched_getcpu (void)
return -1;
  #endif
  }
+
+#ifdef __NR_rseq
+#include 
+
+extern __attribute__ ((tls_model ("initial-exec")))
+__thread volatile struct rseq __rseq_abi;


OK.


+
+int
+sched_getcpu (void)
+{
+  int cpu_id = __rseq_abi.cpu_id;
+
+  return cpu_id >= 0 ? cpu_id : vsyscall_sched_getcpu ();


OK. Impressive :-)


+}
+#else
+int
+sched_getcpu (void)
+{
+  return vsyscall_sched_getcpu ();


OK.


+}
+#endif




--
Cheers,
Carlos.

Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)

2019-03-22 Thread Carlos O'Donell


On 2/12/19 2:42 PM, Mathieu Desnoyers wrote:

Register rseq(2) TLS for each thread (including main), and unregister
for each thread (excluding main). "rseq" stands for Restartable
Sequences.


Thanks, the implementation is looking good, before this goes in it needs
some procedural cleanup and the following:

(a) I would like to see a final position from Florian, either accept,
object (objection recorded), or sustained objection (blocks this patch),
on the issue of the registration protocol.

I suggested a possible way forward to remove the registration protocol
down the line, leaving only the __rseq_lib_abi as the residual artifact
once we deprecate coordination with other libraries. Applications would
have to be written with this in mind, and coordination still remains
with libc for the time being. Perhaps Mathieu has another suggestion.

(b) I would like to see a final position from Rich Felker, either accept,
object (objection recorded), or sustained objection (blocks this patch),
on the patch set as a whole. I trust Rich's opinion as an alternative
libc maintainer in this matter and as a independent implementor of
similar standards. Rich will have to implement this for musl, and it
would be good to have consensus from him.

So while I reviewed this patch, and provided nit-picky feedback below,
I think we need a whole new thread, to finalize the way forward for the
registration process.

Next steps:

- Start a new thread to talk *only* about options for removing the
  refcounted registration.


See the rseq(2) man page proposed here:
   https://lkml.org/lkml/2018/9/19/647


Thank you for having a man page, that makes for a very easy to follow
description of the structures, syscall and use cases.



This patch is based on glibc-2.29. The rseq(2) system call was merged
into Linux 4.18.

Signed-off-by: Mathieu Desnoyers 
CC: Carlos O'Donell 
CC: Florian Weimer 
CC: Joseph Myers 
CC: Szabolcs Nagy 
CC: Thomas Gleixner 
CC: Ben Maurer 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: Boqun Feng 
CC: Will Deacon 
CC: Dave Watson 
CC: Paul Turner 
CC: Rich Felker 
CC: libc-al...@sourceware.org
CC: linux-kernel@vger.kernel.org
CC: linux-...@vger.kernel.org
---
Changes since v1:
- Move __rseq_refcount to an extra field at the end of __rseq_abi to
   eliminate one symbol.

   All libraries/programs which try to register rseq (glibc,
   early-adopter applications, early-adopter libraries) should use the
   rseq refcount. It becomes part of the ABI within a user-space
   process, but it's not part of the ABI shared with the kernel per se.

- Restructure how this code is organized so glibc keeps building on
   non-Linux targets.

- Use non-weak symbol for __rseq_abi.

- Move rseq registration/unregistration implementation into its own
   nptl/rseq.c compile unit.

- Move __rseq_abi symbol under GLIBC_2.29.

Changes since v2:
- Move __rseq_refcount to its own symbol, which is less ugly than
   trying to play tricks with the rseq uapi.
- Move __rseq_abi from nptl to csu (C start up), so it can be used
   across glibc, including memory allocator and sched_getcpu(). The
   __rseq_refcount symbol is kept in nptl, because there is no reason
   to use it elsewhere in glibc.

Changes since v3:
- Set __rseq_refcount TLS to 1 on register/set to 0 on unregister
   because glibc is the first/last user.
- Unconditionally register/unregister rseq at thread start/exit, because
   glibc is the first/last user.
- Add missing abilist items.
- Rebase on glibc master commit a502c5294.
- Add NEWS entry.

Changes since v4:
- Do not use "weak" symbols for __rseq_abi and __rseq_refcount. Based on
   "System V Application Binary Interface", weak only affects the link
   editor, not the dynamic linker.
- Install a new sys/rseq.h system header on Linux, which contains the
   RSEQ_SIG definition, __rseq_abi declaration and __rseq_refcount
   declaration. Move those definition/declarations from rseq-internal.h
   to the installed sys/rseq.h header.
- Considering that rseq is only available on Linux, move csu/rseq.c to
   sysdeps/unix/sysv/linux/rseq-sym.c.
- Move __rseq_refcount from nptl/rseq.c to
   sysdeps/unix/sysv/linux/rseq-sym.c, so it is only defined on Linux.
- Move both ABI definitions for __rseq_abi and __rseq_refcount to
   sysdeps/unix/sysv/linux/Versions, so they only appear on Linux.
- Document __rseq_abi and __rseq_refcount volatile.
- Document the RSEQ_SIG signature define.
- Move registration functions from rseq.c to rseq-internal.h static
   inline functions. Introduce empty stubs in misc/rseq-internal.h,
   which can be overridden by architecture code in
   sysdeps/unix/sysv/linux/rseq-internal.h.
- Rename __rseq_register_current_thread and __rseq_unregister_current_thread
   to rseq_register_current_thread and rseq_unregister_current_thread,
   now that those are only visible as internal static inline functions.
- Invoke rseq_reg

Re: Official Linux system wrapper library?

2018-12-10 Thread Carlos O'Donell

On 12/10/18 11:27 AM, Jonathan Corbet wrote:
> On Sat, 8 Dec 2018 20:38:56 -0800
> Randy Dunlap  wrote:
> 
>> On 11/12/18 8:08 AM, Jonathan Corbet wrote:
>>> On Sun, 11 Nov 2018 18:36:30 -0800
>>> Greg KH  wrote:
>>>   
 We should have a checklist.  That's a great idea.  Now to find someone
 to write it... :)  
>>>
>>> Do we think the LPC session might have the right people to create such a
>>> thing?  If so, I can try to put together a coherent presentation of the
>>> result.  
>>
>> Hi,
>> Did anything ever happen with this syscall checklist suggestion?
> 
> No, we really didn't have the right people around to do that,
> unfortunately.  

We already have Documentation/process/adding-syscalls.rst.

The documentation there is quite thorough.

It lists things that people commonly forget e.g. email 
linux-...@vger.kernel.org.

Would it be acceptable to attempt to collate per-libc information
into the adding-syscalls.rst under a new section called:

"Integration with libc"

-- 
Cheers,
Carlos.

Re: Official Linux system wrapper library?

2018-11-15 Thread Carlos O'Donell

On 11/15/18 12:08 PM, Theodore Y. Ts'o wrote:
> On Thu, Nov 15, 2018 at 04:29:43PM +, Joseph Myers wrote:
>> On Thu, 15 Nov 2018, Theodore Y. Ts'o wrote:
>>
>>> That's great.  But is it or is it not true (either de jure or de
>>> facto) that "a single active glibc developer" can block a system call
>>> from being supported by glibc by objecting?  And if not, under what is
>>> the process by resolving a conflict?
>>
>> We use a consensus-building process as described at 
>> .
> 
> So can a single glibc developer can block Consensus?

Yes.

I think the comparison to the "liberum veto" is not a fair
comparison to the way the glibc community works :-)

(1) Community consensus.

Consensus need not imply unanimity.

Consensus is only from the set of important and concerned
interests. The community gets to decide that you're a troll
that does no real work, and can therefore ignore you.

Consensus is blocked only by sustained objection (not just
normal objections, which are recorded as part of the 
development process e.g. "I don't like it, but I leave it
up to you to decide").

Therefore an involved glibc developer can lodge a sustained
objection, and block consensus.

(2) The GNU package maintainers for glibc.

There are 8 GNU package maintainers for glibc.

The package maintainers created the consensus process to
empower the community, but they can act as a final 
review committee to move issues where there are two
reasonable but competing view points.

As Joseph points out we haven't ever used the GNU pakcage
maintainers to vote on a stuck issue, but I will arrange
it when the need arises. If you think we're at that point
with wrapper functions, just say so, but it doesn't seem
like it to me.

-- 
Cheers,
Carlos.

Re: Official Linux system wrapper library?

2018-11-15 Thread Carlos O'Donell

On 11/15/18 12:08 PM, Theodore Y. Ts'o wrote:
> On Thu, Nov 15, 2018 at 04:29:43PM +, Joseph Myers wrote:
>> On Thu, 15 Nov 2018, Theodore Y. Ts'o wrote:
>>
>>> That's great.  But is it or is it not true (either de jure or de
>>> facto) that "a single active glibc developer" can block a system call
>>> from being supported by glibc by objecting?  And if not, under what is
>>> the process by resolving a conflict?
>>
>> We use a consensus-building process as described at 
>> .
> 
> So can a single glibc developer can block Consensus?

Yes.

I think the comparison to the "liberum veto" is not a fair
comparison to the way the glibc community works :-)

(1) Community consensus.

Consensus need not imply unanimity.

Consensus is only from the set of important and concerned
interests. The community gets to decide that you're a troll
that does no real work, and can therefore ignore you.

Consensus is blocked only by sustained objection (not just
normal objections, which are recorded as part of the 
development process e.g. "I don't like it, but I leave it
up to you to decide").

Therefore an involved glibc developer can lodge a sustained
objection, and block consensus.

(2) The GNU package maintainers for glibc.

There are 8 GNU package maintainers for glibc.

The package maintainers created the consensus process to
empower the community, but they can act as a final 
review committee to move issues where there are two
reasonable but competing view points.

As Joseph points out we haven't ever used the GNU pakcage
maintainers to vote on a stuck issue, but I will arrange
it when the need arises. If you think we're at that point
with wrapper functions, just say so, but it doesn't seem
like it to me.

-- 
Cheers,
Carlos.

Re: Official Linux system wrapper library?

2018-11-15 Thread Carlos O'Donell

On 11/14/18 1:47 PM, Joseph Myers wrote:
> On Wed, 14 Nov 2018, Daniel Colascione wrote:
> 
>> A good demonstration of a new commitment to pragmatism would be
>> merging the trivial wrappers for gettid(2).
> 
> I support the addition of gettid (for use with those syscalls that take 
> tids, and with appropriate documentation explaining the properties of 
> tids) - and, generally, wrappers for all non-obsolescent 
> architecture-independent Linux kernel syscalls, including ones that are 
> very Linux-specific, except maybe for a few interfaces fundamentally 
> inconsistent with glibc managing TLS etc. - they are, at least, no worse 
> as a source of APIs than all the old BSD / SVID interfaces we have from 
> when those were used as sources of APIs.
 
I agree. Documentation is important.

-- 
Cheers,
Carlos.

Re: Official Linux system wrapper library?

2018-11-15 Thread Carlos O'Donell

On 11/14/18 1:47 PM, Joseph Myers wrote:
> On Wed, 14 Nov 2018, Daniel Colascione wrote:
> 
>> A good demonstration of a new commitment to pragmatism would be
>> merging the trivial wrappers for gettid(2).
> 
> I support the addition of gettid (for use with those syscalls that take 
> tids, and with appropriate documentation explaining the properties of 
> tids) - and, generally, wrappers for all non-obsolescent 
> architecture-independent Linux kernel syscalls, including ones that are 
> very Linux-specific, except maybe for a few interfaces fundamentally 
> inconsistent with glibc managing TLS etc. - they are, at least, no worse 
> as a source of APIs than all the old BSD / SVID interfaces we have from 
> when those were used as sources of APIs.
 
I agree. Documentation is important.

-- 
Cheers,
Carlos.

Re: Official Linux system wrapper library?

2018-11-14 Thread Carlos O'Donell

On 11/14/18 6:58 AM, Szabolcs Nagy wrote:
> an actual proposal in the thread that i think is
> worth considering is to make the linux syscall
> design process involve libc devs so the c api is
> designed together with the syscall abi.

Right, I see at least 2 actionable items:

* "The Checklist" which everyone making a syscall should
  follow and we create the checklist with input from both
  sides and it becomes the thing you reference e.g.
  "Did you follow the checklist? Where is X?"

* Programmatic / Machine readable description of syscalls.
  This way the kernel gives users the ability to autogenerate
  all the wrappers *if they want to* in a consistent way that
  matches this syscall description format.

-- 
Cheers,
Carlos.

Re: Official Linux system wrapper library?

2018-11-14 Thread Carlos O'Donell

On 11/14/18 6:58 AM, Szabolcs Nagy wrote:
> an actual proposal in the thread that i think is
> worth considering is to make the linux syscall
> design process involve libc devs so the c api is
> designed together with the syscall abi.

Right, I see at least 2 actionable items:

* "The Checklist" which everyone making a syscall should
  follow and we create the checklist with input from both
  sides and it becomes the thing you reference e.g.
  "Did you follow the checklist? Where is X?"

* Programmatic / Machine readable description of syscalls.
  This way the kernel gives users the ability to autogenerate
  all the wrappers *if they want to* in a consistent way that
  matches this syscall description format.

-- 
Cheers,
Carlos.

Re: Official Linux system wrapper library?

2018-11-13 Thread Carlos O'Donell

On 11/12/18 11:43 AM, Joseph Myers wrote:
> On Sun, 11 Nov 2018, Florian Weimer wrote:
> 
>> People may have disappeared from glibc development who have objected to
>> gettid.  I thought this was the case with strlcpy/strlcat, but it was
>> not.
> 
> Well, I know of two main people who were objecting to the notion of adding 
> bindings for all non-obsolescent syscalls, Linux-specific if not suitable 
> for adding to the OS-independent GNU API, and neither seems to have posted 
> in the past year.
> 
>> At present, it takes one semi-active glibc contributor to block addition
>> of a system call.  The process to override a sustained objection has
>> never been used successfully, and it is a lot of work to get it even
>> started.
> 
> We don't have such a process.  (I've suggested, e.g. in conversation with 
> Carlos at the Cauldron, that we should have something involving a 
> supermajority vote of the GNU maintainers for glibc in cases where we're 
> unable to reach a consensus in the community as a whole.)
 
... and I need a good excuse to propose such a process :-)

-- 
Cheers,
Carlos.

Re: Official Linux system wrapper library?

2018-11-13 Thread Carlos O'Donell

On 11/12/18 11:43 AM, Joseph Myers wrote:
> On Sun, 11 Nov 2018, Florian Weimer wrote:
> 
>> People may have disappeared from glibc development who have objected to
>> gettid.  I thought this was the case with strlcpy/strlcat, but it was
>> not.
> 
> Well, I know of two main people who were objecting to the notion of adding 
> bindings for all non-obsolescent syscalls, Linux-specific if not suitable 
> for adding to the OS-independent GNU API, and neither seems to have posted 
> in the past year.
> 
>> At present, it takes one semi-active glibc contributor to block addition
>> of a system call.  The process to override a sustained objection has
>> never been used successfully, and it is a lot of work to get it even
>> started.
> 
> We don't have such a process.  (I've suggested, e.g. in conversation with 
> Carlos at the Cauldron, that we should have something involving a 
> supermajority vote of the GNU maintainers for glibc in cases where we're 
> unable to reach a consensus in the community as a whole.)
 
... and I need a good excuse to propose such a process :-)

-- 
Cheers,
Carlos.

Re: Official Linux system wrapper library?

2018-11-11 Thread Carlos O'Donell

On 11/10/18 2:20 PM, Greg KH wrote:
> Also, what about the basic work of making sure our uapi header files can
> actually be used untouched by a libc?  That isn't the case these days as
> the bionic maintainers like to keep reminding me.  That might be a good
> thing to do _before_ trying to add new things like syscall wrappers.
I agree completely. There are many steps in the checklist to writing
a new syscall, heck we should probably have a checklist!

Socially the issue is difficult because the various communities only
marginally share the same network of developers, care about different
features, or the same features with different priorities.

That doesn't mean we shouldn't try to integrate better. As was pointed
out, various people from the userspace and toolchain communities are
going to LPC to do just this.

-- 
Cheers,
Carlos.

Re: Official Linux system wrapper library?

2018-11-11 Thread Carlos O'Donell

On 11/10/18 2:20 PM, Greg KH wrote:
> Also, what about the basic work of making sure our uapi header files can
> actually be used untouched by a libc?  That isn't the case these days as
> the bionic maintainers like to keep reminding me.  That might be a good
> thing to do _before_ trying to add new things like syscall wrappers.
I agree completely. There are many steps in the checklist to writing
a new syscall, heck we should probably have a checklist!

Socially the issue is difficult because the various communities only
marginally share the same network of developers, care about different
features, or the same features with different priorities.

That doesn't mean we shouldn't try to integrate better. As was pointed
out, various people from the userspace and toolchain communities are
going to LPC to do just this.

-- 
Cheers,
Carlos.

Re: Official Linux system wrapper library?

2018-11-11 Thread Carlos O'Donell

On 11/10/18 2:58 PM, Vlastimil Babka wrote:
> On 11/10/18 8:20 PM, Greg KH wrote:
>> On Sat, Nov 10, 2018 at 10:52:06AM -0800, Daniel Colascione wrote:
>>> Now that glibc is basically not adding any new system call wrappers,
>>
>> Why are they not doing that anymore?
> 
> FYI just noticed there's a topic relevant to this in LPC Toolchain MC:
> 
> https://linuxplumbersconf.org/event/2/contributions/149/

Yes, and Adhemerval put it there on purpose to continue the discussion
between glibc developers and kernel developers. Florian Weimer and I have
both provided input to that talk, so if something comes out of the talk
and you want to talk more, please just reach out.

I hope that kernel developers interested in this topic will attend
and discuss the various ways forward on certain interesting topics :-)

-- 
Cheers,
Carlos.

Re: Official Linux system wrapper library?

2018-11-11 Thread Carlos O'Donell

On 11/10/18 2:58 PM, Vlastimil Babka wrote:
> On 11/10/18 8:20 PM, Greg KH wrote:
>> On Sat, Nov 10, 2018 at 10:52:06AM -0800, Daniel Colascione wrote:
>>> Now that glibc is basically not adding any new system call wrappers,
>>
>> Why are they not doing that anymore?
> 
> FYI just noticed there's a topic relevant to this in LPC Toolchain MC:
> 
> https://linuxplumbersconf.org/event/2/contributions/149/

Yes, and Adhemerval put it there on purpose to continue the discussion
between glibc developers and kernel developers. Florian Weimer and I have
both provided input to that talk, so if something comes out of the talk
and you want to talk more, please just reach out.

I hope that kernel developers interested in this topic will attend
and discuss the various ways forward on certain interesting topics :-)

-- 
Cheers,
Carlos.

Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

2017-10-16 Thread Carlos O'Donell

On 10/13/2017 02:36 PM, Mathieu Desnoyers wrote:
> I also spoke to Carlos O'Donell from glibc about it, and he was very
> excited about the possible use of rseq for malloc speedup/memory usage
> improvement. But again, I don't see a project like glibc starting to
> use a system call for which the number will have to be bumped every
> now and then.
> 
> I would *not* want this merged before we gather significant user feedback.
> The question is: how can we best gather that feedback ?
> 
> Perhaps one approach could be to reserve system call numbers for
> sys_rseq and sys_cpu_opv, but leave them unimplemented for now
> (ENOSYS). This would lessen the amount of pain user-space would have
> to go through to adapt to system call number changes, and we could
> provide the implementation of those system calls in a -rseq tree, which
> I'd be happy to maintain in order to gather feedback. If it ends up that
> it's not the right approach after all, all we would have lost is two
> unwired system call numbers per architecture.
> 
> Thoughts ?

We have similar problems in glibc with API/ABI issues, and there 
isn't really any way around this except to present a reviewer with
an overwhelming amount of evidence that use cases exist and work.

How you collect, summarize, and analyze that overwhelming evidence
is up to you, specific to each change, and difficult to do accurately
and with any large measure of statistical confidence. The reviewer
has to basically trust you to some degree :-)

We should probably be working together to present the case to Linus
that glibc is immediately ready to use restartable sequences and
provide the use cases we have in mind with a public branch showing
the work and the results. This would at least convince people that
if we turned this on, every application would get benefit from a
GNU system running glibc (which is less than the number of people
running Linux on phones these days so YMMV).

As always, glibc can use any new kernel features immediately,
and only needs to detect presence at startup.

My only concrete suggestion would be to add a level of indirection,
some way to fetch the new syscalls dynamically at program startup,
then I could construct a way to call them, mark it RO, and use that
e.g. a userspace syscall table populated dynamically for experimental
syscalls (semantic changes would require changes in the name used for
lookup). It's just an expansion of the number of bits used to identify
the syscall. Obviously such a patch is only for downstream testing
in order to gather consensus for upstream patches.

-- 
Cheers,
Carlos.

Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call

2017-10-16 Thread Carlos O'Donell

On 10/13/2017 02:36 PM, Mathieu Desnoyers wrote:
> I also spoke to Carlos O'Donell from glibc about it, and he was very
> excited about the possible use of rseq for malloc speedup/memory usage
> improvement. But again, I don't see a project like glibc starting to
> use a system call for which the number will have to be bumped every
> now and then.
> 
> I would *not* want this merged before we gather significant user feedback.
> The question is: how can we best gather that feedback ?
> 
> Perhaps one approach could be to reserve system call numbers for
> sys_rseq and sys_cpu_opv, but leave them unimplemented for now
> (ENOSYS). This would lessen the amount of pain user-space would have
> to go through to adapt to system call number changes, and we could
> provide the implementation of those system calls in a -rseq tree, which
> I'd be happy to maintain in order to gather feedback. If it ends up that
> it's not the right approach after all, all we would have lost is two
> unwired system call numbers per architecture.
> 
> Thoughts ?

We have similar problems in glibc with API/ABI issues, and there 
isn't really any way around this except to present a reviewer with
an overwhelming amount of evidence that use cases exist and work.

How you collect, summarize, and analyze that overwhelming evidence
is up to you, specific to each change, and difficult to do accurately
and with any large measure of statistical confidence. The reviewer
has to basically trust you to some degree :-)

We should probably be working together to present the case to Linus
that glibc is immediately ready to use restartable sequences and
provide the use cases we have in mind with a public branch showing
the work and the results. This would at least convince people that
if we turned this on, every application would get benefit from a
GNU system running glibc (which is less than the number of people
running Linux on phones these days so YMMV).

As always, glibc can use any new kernel features immediately,
and only needs to detect presence at startup.

My only concrete suggestion would be to add a level of indirection,
some way to fetch the new syscalls dynamically at program startup,
then I could construct a way to call them, mark it RO, and use that
e.g. a userspace syscall table populated dynamically for experimental
syscalls (semantic changes would require changes in the name used for
lookup). It's just an expansion of the number of bits used to identify
the syscall. Obviously such a patch is only for downstream testing
in order to gather consensus for upstream patches.

-- 
Cheers,
Carlos.

Re: RFC: Audit Kernel Container IDs

2017-09-13 Thread Carlos O'Donell

On 09/13/2017 12:13 PM, Richard Guy Briggs wrote:
> Containers are a userspace concept.  The kernel knows nothing of them.

I am looking at this RFC from a userspace perspective, particularly from
the loader's point of view and the unshare syscall and the semantics that
arise from the use of it.

At a high level what you are doing is providing a way to group, without
hierarchy, processes and namespaces. The processes can move between
container's if they have CAP_CONTAINER_ADMIN and can open and write to
a special proc file.

* With unshare a thread may dissociate part of its execution context and
  therefore see a distinct mount namespace. When you say "process" in this
  particular RFC do you exclude the fact that a thread might be in a
  distinct container from the rest of the threads in the process?

> The Linux audit system needs a way to be able to track the container
> provenance of events and actions.  Audit needs the kernel's help to do
> this.

* Why does the Linux audit system need to tracker container provenance?

  - How does it help to provide better audit messages?

  - Is it be enough to list the namespace that a process occupies?

* Why does it need the kernel's help?

  - Is there a race condition that is only fixable with kernel support?

  - Or is it easier with kernel help but not required?

Providing background on these questions would help clarify the
design requirements.

> Since the concept of a container is entirely a userspace concept, a
> trigger signal from the userspace container orchestration system
> initiates this.  This will define a point in time and a set of resources
> associated with a particular container with an audit container ID.

Please don't use the word 'signal', I suggest 'register' since you are
writing to a filesystem.

> The trigger is a pseudo filesystem (proc, since PID tree already exists)
> write of a u64 representing the container ID to a file representing a
> process that will become the first process in a new container.
> This might place restrictions on mount namespaces required to define a
> container, or at least careful checking of namespaces in the kernel to
> verify permissions of the orchestrator so it can't change its own
> container ID.
> A bind mount of nsfs may be necessary in the container orchestrator's
> mntNS.
> 
> Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo
> filesystem to have this action permitted.  At that time, record the
> child container's user-supplied 64-bit container identifier along with

What is a "child container?" Containers don't have any hierarchy.

I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents
your continued operation as we have today?

> the child container's first process (which may become the container's
> "init" process) process ID (referenced from the initial PID namespace),
> all namespace IDs (in the form of a nsfs device number and inode number
> tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying
> op=$action field.

What kind of requirement is there on the first tid/pid registering
the container ID? What if the 8th tid/pid does the registration?
Would that mean that the first process of the container did not
register? It seems like you are suggesting that the registration
by the 8th tid/pid causes a cascading registration progress,
registering all tid/pids in the same grouping? Is that true?

> Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
> container ID present on an auditable action or event.
> 
> Forked and cloned processes inherit their parent's container ID,
> referenced in the process' audit_context struct.

So a cloned process with CLONE_NEWNS has the came container ID
as the parent process that called clone, at least until the clone
has time to change to a new container ID?

Do you forsee any case where someone might need a semantic that is
slightly different? For example wanting to set the container ID on
clone?

> Log the creation of every namespace, inheriting/adding its spawning
> process' containerID(s), if applicable.  Include the spawning and
> spawned namespace IDs (device and inode number tuples).
> [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
> Note: At this point it appears only network namespaces may need to track
> container IDs apart from processes since incoming packets may cause an
> auditable event before being associated with a process.

OK.

> Log the destruction of every namespace when it is no longer used by any
> process, include the namespace IDs (device and inode number tuples).
> [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
> 
> Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
> the parent and child namespace IDs for any changes to a process'
> namespaces. [setns(2)]
> Note: It may be possible to combine AUDIT_NS_* record formats and
> distinguish them with an op=$action field depending on the fields
> required for each message type.

Re: RFC: Audit Kernel Container IDs

2017-09-13 Thread Carlos O'Donell

On 09/13/2017 12:13 PM, Richard Guy Briggs wrote:
> Containers are a userspace concept.  The kernel knows nothing of them.

I am looking at this RFC from a userspace perspective, particularly from
the loader's point of view and the unshare syscall and the semantics that
arise from the use of it.

At a high level what you are doing is providing a way to group, without
hierarchy, processes and namespaces. The processes can move between
container's if they have CAP_CONTAINER_ADMIN and can open and write to
a special proc file.

* With unshare a thread may dissociate part of its execution context and
  therefore see a distinct mount namespace. When you say "process" in this
  particular RFC do you exclude the fact that a thread might be in a
  distinct container from the rest of the threads in the process?

> The Linux audit system needs a way to be able to track the container
> provenance of events and actions.  Audit needs the kernel's help to do
> this.

* Why does the Linux audit system need to tracker container provenance?

  - How does it help to provide better audit messages?

  - Is it be enough to list the namespace that a process occupies?

* Why does it need the kernel's help?

  - Is there a race condition that is only fixable with kernel support?

  - Or is it easier with kernel help but not required?

Providing background on these questions would help clarify the
design requirements.

> Since the concept of a container is entirely a userspace concept, a
> trigger signal from the userspace container orchestration system
> initiates this.  This will define a point in time and a set of resources
> associated with a particular container with an audit container ID.

Please don't use the word 'signal', I suggest 'register' since you are
writing to a filesystem.

> The trigger is a pseudo filesystem (proc, since PID tree already exists)
> write of a u64 representing the container ID to a file representing a
> process that will become the first process in a new container.
> This might place restrictions on mount namespaces required to define a
> container, or at least careful checking of namespaces in the kernel to
> verify permissions of the orchestrator so it can't change its own
> container ID.
> A bind mount of nsfs may be necessary in the container orchestrator's
> mntNS.
> 
> Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo
> filesystem to have this action permitted.  At that time, record the
> child container's user-supplied 64-bit container identifier along with

What is a "child container?" Containers don't have any hierarchy.

I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents
your continued operation as we have today?

> the child container's first process (which may become the container's
> "init" process) process ID (referenced from the initial PID namespace),
> all namespace IDs (in the form of a nsfs device number and inode number
> tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying
> op=$action field.

What kind of requirement is there on the first tid/pid registering
the container ID? What if the 8th tid/pid does the registration?
Would that mean that the first process of the container did not
register? It seems like you are suggesting that the registration
by the 8th tid/pid causes a cascading registration progress,
registering all tid/pids in the same grouping? Is that true?

> Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
> container ID present on an auditable action or event.
> 
> Forked and cloned processes inherit their parent's container ID,
> referenced in the process' audit_context struct.

So a cloned process with CLONE_NEWNS has the came container ID
as the parent process that called clone, at least until the clone
has time to change to a new container ID?

Do you forsee any case where someone might need a semantic that is
slightly different? For example wanting to set the container ID on
clone?

> Log the creation of every namespace, inheriting/adding its spawning
> process' containerID(s), if applicable.  Include the spawning and
> spawned namespace IDs (device and inode number tuples).
> [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
> Note: At this point it appears only network namespaces may need to track
> container IDs apart from processes since incoming packets may cause an
> auditable event before being associated with a process.

OK.

> Log the destruction of every namespace when it is no longer used by any
> process, include the namespace IDs (device and inode number tuples).
> [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
> 
> Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
> the parent and child namespace IDs for any changes to a process'
> namespaces. [setns(2)]
> Note: It may be possible to combine AUDIT_NS_* record formats and
> distinguish them with an op=$action field depending on the fields
> required for each message type.

Re: new ELF marking

2017-08-09 Thread Carlos O'Donell

On 08/09/2017 05:24 PM, H.J. Lu wrote:
> On Wed, Aug 9, 2017 at 1:32 PM, Kostya Serebryany  wrote:

 I believe this would only be an output bit, but I'm not sure how it
 would be wired into binutils. Kostya, do you know any details about
 how AddressSanitizer might be able to create this ELF note?
>>
>> I don't, hopefully H.J's suggestion works.
>> Will it be backward compatible?
>> (i.e. will the binaries built in the new way work on the old kernels?)
>>
>>
> 
> Yes,  it is backward compatible by design.

... only as long as the semantics implied by the flags are hints that
can be ignored.

-- 
Cheers,
Carlos.

Re: new ELF marking

2017-08-09 Thread Carlos O'Donell

On 08/09/2017 05:24 PM, H.J. Lu wrote:
> On Wed, Aug 9, 2017 at 1:32 PM, Kostya Serebryany  wrote:

 I believe this would only be an output bit, but I'm not sure how it
 would be wired into binutils. Kostya, do you know any details about
 how AddressSanitizer might be able to create this ELF note?
>>
>> I don't, hopefully H.J's suggestion works.
>> Will it be backward compatible?
>> (i.e. will the binaries built in the new way work on the old kernels?)
>>
>>
> 
> Yes,  it is backward compatible by design.

... only as long as the semantics implied by the flags are hints that
can be ignored.

-- 
Cheers,
Carlos.

Re: [musl] Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-04-25 Thread Carlos O'Donell

On 04/25/2017 02:45 AM, Hauke Mehrtens wrote:
> On 03/08/2017 05:39 PM, Carlos O'Donell wrote:
>> Any header needing compat with a libc includes libc-compat.h (per the 
>> documented way the model works). With this patch any included linux kernel
>> header that also includes libc-compat.h would immediately define all 
>> the __UAPI_DEF_* constants to 1 as-if it had defined those structures, 
>> but it has not.
>>
>> For example, with these two patches applied, the inclusion of linux/if.h
>> would define __UAPI_DEF_XATTR to 1, but linux/if.h has not defined
>> XATTR_CREATE or other constants, so a subsequent inclusion sys/xattrs.h
>> from userspace would _not_ define XATTR_CREATE because __UAPI_DEF_XATTR set
>> to 1 indicates the kernel has.
>>
>> I don't want to read into the model you are proposing and would rather you
>> document the semantics clearly so we can all see what you mean.
> 
> What about moving the code from libc-comapt.h into the specific header
> files? This way including linux/if.h would not have an impact on
> __UAPI_DEF_XATTR, because this is defined in linux/xattr.h. We would
> still have a problem when the specific Linux header file gets included
> before the libc header file, but at least musl does not support this anyway.

The point of libc-compat.h is to contain the libc-related logic to a single 
header
where it can be reviewed easily as a whole for each libc.

Headers that include libc-compat.h need not have any libc-related logic, they 
need
only guard their structures with the appropriate __UAPI_DEF* macros per the 
rules
described in libc-compat.h.

This way we minimize any changes to the header files and keep the complex
logic in one place where the libc authors can discuss it.

In glibc right now we support including linux or glibc header files first,
and this has always been a requirement from the start. This requirement dictates
that the kernel know which libc it's being used with so it can tailor 
coordination.

If musl only needs header coordination in one direction, then support only that
direction, but please do not presume to simplify the code by deleting a bunch of
things that were worked into the kernel to ensure header inclusion ordering 
works
in both ways.

I have lots of customers writing code that includes kernel headers and letting 
them
include headers safely in any order where both headers provide the same define 
is
the simplest thing.

If you have a suggestion, please describe your proposed model in detail, and 
provide
a patch that explains and show how it works.

-- 
Cheers,
Carlos.

Re: [musl] Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-04-25 Thread Carlos O'Donell

On 04/25/2017 02:45 AM, Hauke Mehrtens wrote:
> On 03/08/2017 05:39 PM, Carlos O'Donell wrote:
>> Any header needing compat with a libc includes libc-compat.h (per the 
>> documented way the model works). With this patch any included linux kernel
>> header that also includes libc-compat.h would immediately define all 
>> the __UAPI_DEF_* constants to 1 as-if it had defined those structures, 
>> but it has not.
>>
>> For example, with these two patches applied, the inclusion of linux/if.h
>> would define __UAPI_DEF_XATTR to 1, but linux/if.h has not defined
>> XATTR_CREATE or other constants, so a subsequent inclusion sys/xattrs.h
>> from userspace would _not_ define XATTR_CREATE because __UAPI_DEF_XATTR set
>> to 1 indicates the kernel has.
>>
>> I don't want to read into the model you are proposing and would rather you
>> document the semantics clearly so we can all see what you mean.
> 
> What about moving the code from libc-comapt.h into the specific header
> files? This way including linux/if.h would not have an impact on
> __UAPI_DEF_XATTR, because this is defined in linux/xattr.h. We would
> still have a problem when the specific Linux header file gets included
> before the libc header file, but at least musl does not support this anyway.

The point of libc-compat.h is to contain the libc-related logic to a single 
header
where it can be reviewed easily as a whole for each libc.

Headers that include libc-compat.h need not have any libc-related logic, they 
need
only guard their structures with the appropriate __UAPI_DEF* macros per the 
rules
described in libc-compat.h.

This way we minimize any changes to the header files and keep the complex
logic in one place where the libc authors can discuss it.

In glibc right now we support including linux or glibc header files first,
and this has always been a requirement from the start. This requirement dictates
that the kernel know which libc it's being used with so it can tailor 
coordination.

If musl only needs header coordination in one direction, then support only that
direction, but please do not presume to simplify the code by deleting a bunch of
things that were worked into the kernel to ensure header inclusion ordering 
works
in both ways.

I have lots of customers writing code that includes kernel headers and letting 
them
include headers safely in any order where both headers provide the same define 
is
the simplest thing.

If you have a suggestion, please describe your proposed model in detail, and 
provide
a patch that explains and show how it works.

-- 
Cheers,
Carlos.

Re: [musl] Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-04-25 Thread Carlos O'Donell

On 04/25/2017 02:37 AM, Hauke Mehrtens wrote:
> 
> 
> On 03/08/2017 01:46 PM, David Woodhouse wrote:
>> On Fri, 2016-11-11 at 07:08 -0500, Felix Janda wrote:
>>> Currently, libc-compat.h detects inclusion of specific glibc headers,
>>> and defines corresponding _UAPI_DEF_* macros, which in turn are used in
>>> uapi headers to prevent definition of conflicting structures/constants.
>>> There is no such detection for other c libraries, for them the
>>> _UAPI_DEF_* macros are always defined as 1, and so none of the possibly
>>> conflicting definitions are suppressed.
>>>
>>> This patch enables non-glibc c libraries to request the suppression of
>>> any specific interface by defining the corresponding _UAPI_DEF_* macro
>>> as 0.
>>
>> Ick. It's fairly horrid for kernel headers to be reacting to __GLIBC__
>> in any way. That's just wrong.
>>
>> It makes more sense for C libraries to define the __UAPI_DEF_xxx for
>> themselves as and when they add their own support for certain things,
>> and for the kernel not to have incestuous knowledge of them.
>>
>> The part you add here in the #else /* !__GLIBC__ */ part is what we
>> should do at *all* times.
>>
>> I understand that we'll want to grandfather in the glibc horridness,
>> but let's make it clear that that's what it is, by letting it set the
>> appropriate __UAPI_DEF_xxx macros to zero, and then continue through to
>> your new part. Something like this (incremental to yours):
> 
> Felix's and this change and are looking better than my patches here:
> https://lkml.org/lkml/2017/3/12/235
> 
> Is someone working on brining this into the mainline kernel?
> 
> Is it correct that only the comments should be improved?
> musl only supports including the musl header files before the kernel
> anyway, I assume that it is not needed to make the kernel uapi code
> support also the other order.

Please repost to linux-api so other relevant C library authors can review
the changes and comment on how they might be harmonized.

Whether or not you  support both inclusion orders, kernel first, or libc first,
is a property of your libc implementation.

Today glibc aims to support both inclusion orders since we see applications
with either order, and the ordering should not matter in this case. You either
get one or the other without needing to know any special rules about header
ordering.

-- 
Cheers,
Carlos.

Re: [musl] Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-04-25 Thread Carlos O'Donell

On 04/25/2017 02:37 AM, Hauke Mehrtens wrote:
> 
> 
> On 03/08/2017 01:46 PM, David Woodhouse wrote:
>> On Fri, 2016-11-11 at 07:08 -0500, Felix Janda wrote:
>>> Currently, libc-compat.h detects inclusion of specific glibc headers,
>>> and defines corresponding _UAPI_DEF_* macros, which in turn are used in
>>> uapi headers to prevent definition of conflicting structures/constants.
>>> There is no such detection for other c libraries, for them the
>>> _UAPI_DEF_* macros are always defined as 1, and so none of the possibly
>>> conflicting definitions are suppressed.
>>>
>>> This patch enables non-glibc c libraries to request the suppression of
>>> any specific interface by defining the corresponding _UAPI_DEF_* macro
>>> as 0.
>>
>> Ick. It's fairly horrid for kernel headers to be reacting to __GLIBC__
>> in any way. That's just wrong.
>>
>> It makes more sense for C libraries to define the __UAPI_DEF_xxx for
>> themselves as and when they add their own support for certain things,
>> and for the kernel not to have incestuous knowledge of them.
>>
>> The part you add here in the #else /* !__GLIBC__ */ part is what we
>> should do at *all* times.
>>
>> I understand that we'll want to grandfather in the glibc horridness,
>> but let's make it clear that that's what it is, by letting it set the
>> appropriate __UAPI_DEF_xxx macros to zero, and then continue through to
>> your new part. Something like this (incremental to yours):
> 
> Felix's and this change and are looking better than my patches here:
> https://lkml.org/lkml/2017/3/12/235
> 
> Is someone working on brining this into the mainline kernel?
> 
> Is it correct that only the comments should be improved?
> musl only supports including the musl header files before the kernel
> anyway, I assume that it is not needed to make the kernel uapi code
> support also the other order.

Please repost to linux-api so other relevant C library authors can review
the changes and comment on how they might be harmonized.

Whether or not you  support both inclusion orders, kernel first, or libc first,
is a property of your libc implementation.

Today glibc aims to support both inclusion orders since we see applications
with either order, and the ordering should not matter in this case. You either
get one or the other without needing to know any special rules about header
ordering.

-- 
Cheers,
Carlos.

Re: Formal description of system call interface

2017-04-21 Thread Carlos O'Donell

On 11/06/2016 05:39 PM, Dmitry Vyukov wrote:
> Hello,
> 
> This is notes from the discussion we had at Linux Plumbers this week
> regarding providing a formal description of system calls (user API).
> 
> The idea come up in the context of syzkaller, syscall fuzzer, which
> has descriptions for 1000+ syscalls mostly concentrating on types of
> arguments and return values. However, problems are that a small group
> of people can't write descriptions for all syscalls; can't keep them
> up-to-date and doesn't have necessary domain expertise to do correct
> descriptions in some cases.
> 
> We identified a surprisingly large number of potential users for such
> descriptions:
>  - fuzzers (syzkaller, trinity, iknowthis)
>  - strace/syscall tracepoints (capturing indirect arguments and
>printing human-readable info)
>  - generation of entry points for C libraries (glibc, liblinux
>(raw syscalls), Go runtime, clang/gcc sanitizers)

To add another:

Auto-generation of SYS_* macros (sys/syscalls.h) in glibc which are
required for syscall().

It would mean we could copy the list directly from the most recently
released kernel instead of relying on distro kernel UAPI headers package.

We need this information in the released kernel.

>  - valgrind/sanitizers checking of input/output values of syscalls
>  - seccomp filters (minijail, libseccomp) need to know interfaces
>to generate wrappers
>  - safety certification (requires syscall specifications)
>  - man pages (could provide actual syscall interface rather than
>glibc wrapper interface, it was noted that possible errno values
>is an important part here)
>  - generation of syscall argument validation in kernel (fast version
>is enabled all the time, extended is optional)
> 
> It's worth noting that number of these users already have some
> descriptions that suffer from the same problems of being
> incomplete/outdated. See also linux-api mailing list description
> which lists an overlapping set of cases:
> https://www.kernel.org/doc/man-pages/linux-api-ml.html
> 
> We discussed several implementation approaches:
>  - Extracting the interface from kernel code either by parsing
>sources or using dwarf. However, current source doesn't have
>enough info: fd are specified as int, while we need to know exact
>fd type (e.g. fd_epoll_t); not possible to extract flag set for
>'int flags'; don't know what is 'char*'.
>  - Making the formal description the master copy and generating
>kernel code from it (structs, flags, syscall entry points).
>This is quite pervasive, but otherwise should work.
>  - Doing what syzkaller currently does: providing the description
>on side. Verifying that description and implementation match
>is an important piece here. We can do dynamic checking in syscall
>entry points (print warnings on anything that does not match
>descriptions); or static checking (but again kernel code doesn't
>have enough info for checking).
> 
> We decided to pursue the last option as the least pervasive for now.
> Several locations for the descriptions were proposed: with source code,
> include/uapi, Documentation.
> 
> Action points:
>  - polish DSL for description (must be extensible)
>  - write a parser for DSL
>  - provide definition for mm syscalls (mm is reasonably simple
>and self-contained)
>  - see if we can do validation of mm arguments

Have we made any progress on these points?

> It was acknowledged that whatever we do now it will probably
> significantly change and evolve over time as we better understand
> what we need and what works.
> 
> For the reference, current syzkaller descriptions are in txt files here:
> https://github.com/google/syzkaller/tree/master/sys
> The most generic syscalls are here:
> https://github.com/google/syzkaller/blob/master/sys/sys.txt
> Specific subsystems are described in separate files, e.g.:
> https://github.com/google/syzkaller/blob/master/sys/bpf.txt
> https://github.com/google/syzkaller/blob/master/sys/tty.txt
> https://github.com/google/syzkaller/blob/master/sys/sndseq.txt
> The descriptions should be self-explanatory, but just in case there
> is also a semi-formal DSL specification here:
> https://github.com/google/syzkaller/blob/master/sys/README.md
> 
> Taking the opportunity, if you see that something is missing/wrong
> in the descriptions of the subsystem you care about, or if it is not
> described at all, fixes are welcome.
> 
> Thanks

-- 
Cheers,
Carlos.

Re: Formal description of system call interface

2017-04-21 Thread Carlos O'Donell

On 11/06/2016 05:39 PM, Dmitry Vyukov wrote:
> Hello,
> 
> This is notes from the discussion we had at Linux Plumbers this week
> regarding providing a formal description of system calls (user API).
> 
> The idea come up in the context of syzkaller, syscall fuzzer, which
> has descriptions for 1000+ syscalls mostly concentrating on types of
> arguments and return values. However, problems are that a small group
> of people can't write descriptions for all syscalls; can't keep them
> up-to-date and doesn't have necessary domain expertise to do correct
> descriptions in some cases.
> 
> We identified a surprisingly large number of potential users for such
> descriptions:
>  - fuzzers (syzkaller, trinity, iknowthis)
>  - strace/syscall tracepoints (capturing indirect arguments and
>printing human-readable info)
>  - generation of entry points for C libraries (glibc, liblinux
>(raw syscalls), Go runtime, clang/gcc sanitizers)

To add another:

Auto-generation of SYS_* macros (sys/syscalls.h) in glibc which are
required for syscall().

It would mean we could copy the list directly from the most recently
released kernel instead of relying on distro kernel UAPI headers package.

We need this information in the released kernel.

>  - valgrind/sanitizers checking of input/output values of syscalls
>  - seccomp filters (minijail, libseccomp) need to know interfaces
>to generate wrappers
>  - safety certification (requires syscall specifications)
>  - man pages (could provide actual syscall interface rather than
>glibc wrapper interface, it was noted that possible errno values
>is an important part here)
>  - generation of syscall argument validation in kernel (fast version
>is enabled all the time, extended is optional)
> 
> It's worth noting that number of these users already have some
> descriptions that suffer from the same problems of being
> incomplete/outdated. See also linux-api mailing list description
> which lists an overlapping set of cases:
> https://www.kernel.org/doc/man-pages/linux-api-ml.html
> 
> We discussed several implementation approaches:
>  - Extracting the interface from kernel code either by parsing
>sources or using dwarf. However, current source doesn't have
>enough info: fd are specified as int, while we need to know exact
>fd type (e.g. fd_epoll_t); not possible to extract flag set for
>'int flags'; don't know what is 'char*'.
>  - Making the formal description the master copy and generating
>kernel code from it (structs, flags, syscall entry points).
>This is quite pervasive, but otherwise should work.
>  - Doing what syzkaller currently does: providing the description
>on side. Verifying that description and implementation match
>is an important piece here. We can do dynamic checking in syscall
>entry points (print warnings on anything that does not match
>descriptions); or static checking (but again kernel code doesn't
>have enough info for checking).
> 
> We decided to pursue the last option as the least pervasive for now.
> Several locations for the descriptions were proposed: with source code,
> include/uapi, Documentation.
> 
> Action points:
>  - polish DSL for description (must be extensible)
>  - write a parser for DSL
>  - provide definition for mm syscalls (mm is reasonably simple
>and self-contained)
>  - see if we can do validation of mm arguments

Have we made any progress on these points?

> It was acknowledged that whatever we do now it will probably
> significantly change and evolve over time as we better understand
> what we need and what works.
> 
> For the reference, current syzkaller descriptions are in txt files here:
> https://github.com/google/syzkaller/tree/master/sys
> The most generic syscalls are here:
> https://github.com/google/syzkaller/blob/master/sys/sys.txt
> Specific subsystems are described in separate files, e.g.:
> https://github.com/google/syzkaller/blob/master/sys/bpf.txt
> https://github.com/google/syzkaller/blob/master/sys/tty.txt
> https://github.com/google/syzkaller/blob/master/sys/sndseq.txt
> The descriptions should be self-explanatory, but just in case there
> is also a semi-formal DSL specification here:
> https://github.com/google/syzkaller/blob/master/sys/README.md
> 
> Taking the opportunity, if you see that something is missing/wrong
> in the descriptions of the subsystem you care about, or if it is not
> described at all, fixes are welcome.
> 
> Thanks

-- 
Cheers,
Carlos.

Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-03-08 Thread Carlos O'Donell

On 03/08/2017 07:14 PM, Szabolcs Nagy wrote:
> * Carlos O'Donell <car...@redhat.com> [2017-03-08 10:53:00 -0500]:
>> On 11/11/2016 07:08 AM, Felix Janda wrote:
>>> fixes the following compiler errors when  is included
>>> after musl :
>>>
>>> ./linux/in6.h:32:8: error: redefinition of 'struct in6_addr'
>>> ./linux/in6.h:49:8: error: redefinition of 'struct sockaddr_in6'
>>> ./linux/in6.h:59:8: error: redefinition of 'struct ipv6_mreq'
>>
>> Do you have plans for fixing the error when the inclusion order is the other 
>> way?
> 
> the other way (linux header included first) is
> problematic because linux headers don't follow
> all the standards the libc follows, they violate
> namespace rules in their struct definitions, so
> the libc definitions are necessarily incompatible
> with them and thus different translation units can
> end up refering to the same object through
> incompatible types which is undefined.
> (even if the abi matches and thus works across
> the syscall interface, a sufficiently smart
> toolchain can break such code at link time,
> and since the libc itself uses its own definitons
> that's what user code should use too).
> 
> there should be a way to include standard conform
> libc headers and linux headers into the same tu,
> at least the case when all conflicting definitions
> come from the libc should work and i think that
> should be the scope of these libc-compat.h changes.
> (of course if glibc tries to support arbitrary
> interleavings then the changes should not break that)

You can get non-standard defines even when including the
linux headers _after_ libc headers because linux headers
should rightly continue to define things that are required
for linux-specific applications.

IMO the fact that the UAPI headers may cause problems with
standards conformance is orthogonal to the discussion of 
_how_ we fix inclusion order issues.

Some of the network headers can be used in relative safety
and need to be used for some applications. It is those cases
where I'd like to see an inclusion guard design that works
for both inclusion orders.

-- 
Cheers,
Carlos.

Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-03-08 Thread Carlos O'Donell

On 03/08/2017 07:14 PM, Szabolcs Nagy wrote:
> * Carlos O'Donell  [2017-03-08 10:53:00 -0500]:
>> On 11/11/2016 07:08 AM, Felix Janda wrote:
>>> fixes the following compiler errors when  is included
>>> after musl :
>>>
>>> ./linux/in6.h:32:8: error: redefinition of 'struct in6_addr'
>>> ./linux/in6.h:49:8: error: redefinition of 'struct sockaddr_in6'
>>> ./linux/in6.h:59:8: error: redefinition of 'struct ipv6_mreq'
>>
>> Do you have plans for fixing the error when the inclusion order is the other 
>> way?
> 
> the other way (linux header included first) is
> problematic because linux headers don't follow
> all the standards the libc follows, they violate
> namespace rules in their struct definitions, so
> the libc definitions are necessarily incompatible
> with them and thus different translation units can
> end up refering to the same object through
> incompatible types which is undefined.
> (even if the abi matches and thus works across
> the syscall interface, a sufficiently smart
> toolchain can break such code at link time,
> and since the libc itself uses its own definitons
> that's what user code should use too).
> 
> there should be a way to include standard conform
> libc headers and linux headers into the same tu,
> at least the case when all conflicting definitions
> come from the libc should work and i think that
> should be the scope of these libc-compat.h changes.
> (of course if glibc tries to support arbitrary
> interleavings then the changes should not break that)

You can get non-standard defines even when including the
linux headers _after_ libc headers because linux headers
should rightly continue to define things that are required
for linux-specific applications.

IMO the fact that the UAPI headers may cause problems with
standards conformance is orthogonal to the discussion of 
_how_ we fix inclusion order issues.

Some of the network headers can be used in relative safety
and need to be used for some applications. It is those cases
where I'd like to see an inclusion guard design that works
for both inclusion orders.

-- 
Cheers,
Carlos.

Re: [musl] Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-03-08 Thread Carlos O'Donell

On 03/08/2017 11:25 AM, Rich Felker wrote:
> On Wed, Mar 08, 2017 at 10:53:00AM -0500, Carlos O'Donell wrote:
>> On 11/11/2016 07:08 AM, Felix Janda wrote:
>>> Currently, libc-compat.h detects inclusion of specific glibc headers,
>>> and defines corresponding _UAPI_DEF_* macros, which in turn are used in
>>> uapi headers to prevent definition of conflicting structures/constants.
>>> There is no such detection for other c libraries, for them the
>>> _UAPI_DEF_* macros are always defined as 1, and so none of the possibly
>>> conflicting definitions are suppressed.
>>>
>>> This patch enables non-glibc c libraries to request the suppression of
>>> any specific interface by defining the corresponding _UAPI_DEF_* macro
>>> as 0.
>>>
>>> This patch together with the recent musl libc commit
>>>
>>> http://git.musl-libc.org/cgit/musl/commit/?id=04983f2272382af92eb8f8838964ff944fbb8258
>>
>> Would it be possible to amend the musl patch to define the macros to 1.
> 
> I don't follow. They're defined to 0 explicitly to tell the kernel
> headers not to define their own versions of these structs, etc. since
> they would clash. Defining to 1 would have the opposite meaning.

My apologies, I must have misread the original musl patch.

Defining them to a known value is exactly what I was looking for.

The other outstanding questions remain.

-- 
Cheers,
Carlos.

Re: [musl] Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-03-08 Thread Carlos O'Donell

On 03/08/2017 11:25 AM, Rich Felker wrote:
> On Wed, Mar 08, 2017 at 10:53:00AM -0500, Carlos O'Donell wrote:
>> On 11/11/2016 07:08 AM, Felix Janda wrote:
>>> Currently, libc-compat.h detects inclusion of specific glibc headers,
>>> and defines corresponding _UAPI_DEF_* macros, which in turn are used in
>>> uapi headers to prevent definition of conflicting structures/constants.
>>> There is no such detection for other c libraries, for them the
>>> _UAPI_DEF_* macros are always defined as 1, and so none of the possibly
>>> conflicting definitions are suppressed.
>>>
>>> This patch enables non-glibc c libraries to request the suppression of
>>> any specific interface by defining the corresponding _UAPI_DEF_* macro
>>> as 0.
>>>
>>> This patch together with the recent musl libc commit
>>>
>>> http://git.musl-libc.org/cgit/musl/commit/?id=04983f2272382af92eb8f8838964ff944fbb8258
>>
>> Would it be possible to amend the musl patch to define the macros to 1.
> 
> I don't follow. They're defined to 0 explicitly to tell the kernel
> headers not to define their own versions of these structs, etc. since
> they would clash. Defining to 1 would have the opposite meaning.

My apologies, I must have misread the original musl patch.

Defining them to a known value is exactly what I was looking for.

The other outstanding questions remain.

-- 
Cheers,
Carlos.

Re: opaque types instead of union epoll_data

2017-03-08 Thread Carlos O'Donell

On 03/07/2017 12:59 PM, Greg KH wrote:
> On Tue, Mar 07, 2017 at 09:33:57AM -0500, Carlos O'Donell wrote:
>> I don't know anyone else who is working on this problem. Though I
>> have a vested interested in it as a glibc maintainer, since it would
>> be nice to avoid duplicate headers where possible between the kernel
>> and userspace.
> 
> I've been working with the bionic developers to try to help clean up
> some of these things.  Yes, epoll is messy here as you all have pointed
> out, and I'll be glad to review any patches that people submit for this.
> I need to get some spare time (hah!) to work through some of the issues
> that the bionic developers have already pointed out to me...

Sounds awesome. Please keep linux-api in the loop so the other libc
maintainers can follow along. From a glibc perspective I'm interested,
but also busy (see the other thread about libc-compat.h which has
immediate impact on header coordination).

-- 
Cheers,
Carlos.

Re: opaque types instead of union epoll_data

2017-03-08 Thread Carlos O'Donell

On 03/07/2017 12:59 PM, Greg KH wrote:
> On Tue, Mar 07, 2017 at 09:33:57AM -0500, Carlos O'Donell wrote:
>> I don't know anyone else who is working on this problem. Though I
>> have a vested interested in it as a glibc maintainer, since it would
>> be nice to avoid duplicate headers where possible between the kernel
>> and userspace.
> 
> I've been working with the bionic developers to try to help clean up
> some of these things.  Yes, epoll is messy here as you all have pointed
> out, and I'll be glad to review any patches that people submit for this.
> I need to get some spare time (hah!) to work through some of the issues
> that the bionic developers have already pointed out to me...

Sounds awesome. Please keep linux-api in the loop so the other libc
maintainers can follow along. From a glibc perspective I'm interested,
but also busy (see the other thread about libc-compat.h which has
immediate impact on header coordination).

-- 
Cheers,
Carlos.

Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-03-08 Thread Carlos O'Donell

On 03/08/2017 07:46 AM, David Woodhouse wrote:
> On Fri, 2016-11-11 at 07:08 -0500, Felix Janda wrote:
>> Currently, libc-compat.h detects inclusion of specific glibc headers,
>> and defines corresponding _UAPI_DEF_* macros, which in turn are used in
>> uapi headers to prevent definition of conflicting structures/constants.
>> There is no such detection for other c libraries, for them the
>> _UAPI_DEF_* macros are always defined as 1, and so none of the possibly
>> conflicting definitions are suppressed.
>>
>> This patch enables non-glibc c libraries to request the suppression of
>> any specific interface by defining the corresponding _UAPI_DEF_* macro
>> as 0.
> 
> Ick. It's fairly horrid for kernel headers to be reacting to __GLIBC__
> in any way. That's just wrong.
> 
> It makes more sense for C libraries to define the __UAPI_DEF_xxx for
> themselves as and when they add their own support for certain things,
> and for the kernel not to have incestuous knowledge of them.
> 
> The part you add here in the #else /* !__GLIBC__ */ part is what we
> should do at *all* times.
> 
> I understand that we'll want to grandfather in the glibc horridness,
> but let's make it clear that that's what it is, by letting it set the
> appropriate __UAPI_DEF_xxx macros to zero, and then continue through to
> your new part. Something like this (incremental to yours):

Any model we propose should be documented in the header of libc-compat.h
and explain how it works to solve header inclusion order in _both_ directions.
User use cases include header inclusion in _both_ directions and we should look
to support that.

> diff --git a/include/uapi/linux/libc-compat.h 
> b/include/uapi/linux/libc-compat.h
> index c316725..7673158 100644
> --- a/include/uapi/linux/libc-compat.h
> +++ b/include/uapi/linux/libc-compat.h
> @@ -53,41 +53,18 @@
>  
>  /* Coordinate with glibc net/if.h header. */
>  #if defined(_NET_IF_H) && defined(__USE_MISC)
> -
>  /* GLIBC headers included first so don't define anything
>   * that would already be defined. */
> -
>  #define __UAPI_DEF_IF_IFCONF 0
>  #define __UAPI_DEF_IF_IFMAP 0
>  #define __UAPI_DEF_IF_IFNAMSIZ 0
>  #define __UAPI_DEF_IF_IFREQ 0
>  /* Everything up to IFF_DYNAMIC, matches net/if.h until glibc 2.23 */
>  #define __UAPI_DEF_IF_NET_DEVICE_FLAGS 0
> -/* For the future if glibc adds IFF_LOWER_UP, IFF_DORMANT and IFF_ECHO */
> -#ifndef __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO
> -#define __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO 1
> -#endif /* __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO */
> -
> -#else /* _NET_IF_H */
> -
> -/* Linux headers included first, and we must define everything
> - * we need. The expectation is that glibc will check the
> - * __UAPI_DEF_* defines and adjust appropriately. */
> -
> -#define __UAPI_DEF_IF_IFCONF 1
> -#define __UAPI_DEF_IF_IFMAP 1
> -#define __UAPI_DEF_IF_IFNAMSIZ 1
> -#define __UAPI_DEF_IF_IFREQ 1
> -/* Everything up to IFF_DYNAMIC, matches net/if.h until glibc 2.23 */
> -#define __UAPI_DEF_IF_NET_DEVICE_FLAGS 1
> -/* For the future if glibc adds IFF_LOWER_UP, IFF_DORMANT and IFF_ECHO */
> -#define __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO 1
> -

Any header needing compat with a libc includes libc-compat.h (per the 
documented way the model works). With this patch any included linux kernel
header that also includes libc-compat.h would immediately define all 
the __UAPI_DEF_* constants to 1 as-if it had defined those structures, 
but it has not.

For example, with these two patches applied, the inclusion of linux/if.h
would define __UAPI_DEF_XATTR to 1, but linux/if.h has not defined
XATTR_CREATE or other constants, so a subsequent inclusion sys/xattrs.h
from userspace would _not_ define XATTR_CREATE because __UAPI_DEF_XATTR set
to 1 indicates the kernel has.

I don't want to read into the model you are proposing and would rather you
document the semantics clearly so we can all see what you mean.

>  #endif /* _NET_IF_H */
>  
>  /* Coordinate with glibc netinet/in.h header. */
>  #if defined(_NETINET_IN_H)
> -
>  /* GLIBC headers included first so don't define anything
>   * that would already be defined. */
>  #define __UAPI_DEF_IN_ADDR   0
> @@ -104,8 +81,6 @@
>   * additional in6_addr macros e.g. s6_addr16, and s6_addr32. */
>  #if defined(__USE_MISC) || defined (__USE_GNU)
>  #define __UAPI_DEF_IN6_ADDR_ALT  0
> -#else
> -#define __UAPI_DEF_IN6_ADDR_ALT  1
>  #endif
>  #define __UAPI_DEF_SOCKADDR_IN6  0
>  #define __UAPI_DEF_IPV6_MREQ 0
> @@ -113,62 +88,23 @@
>  #define __UAPI_DEF_IPV6_OPTIONS  0
>  #define __UAPI_DEF_IN6_PKTINFO   0
>  #define __UAPI_DEF_IP6_MTUINFO   0
> -
> -#else
> -
> -/* Linux headers included first, and we must define everything
> - * we need. The expectation is that glibc will check the
> - * __UAPI_DEF_* defines and adjust appropriately. */
> -#define __UAPI_DEF_IN_ADDR   1
> -#define

Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-03-08 Thread Carlos O'Donell

On 03/08/2017 07:46 AM, David Woodhouse wrote:
> On Fri, 2016-11-11 at 07:08 -0500, Felix Janda wrote:
>> Currently, libc-compat.h detects inclusion of specific glibc headers,
>> and defines corresponding _UAPI_DEF_* macros, which in turn are used in
>> uapi headers to prevent definition of conflicting structures/constants.
>> There is no such detection for other c libraries, for them the
>> _UAPI_DEF_* macros are always defined as 1, and so none of the possibly
>> conflicting definitions are suppressed.
>>
>> This patch enables non-glibc c libraries to request the suppression of
>> any specific interface by defining the corresponding _UAPI_DEF_* macro
>> as 0.
> 
> Ick. It's fairly horrid for kernel headers to be reacting to __GLIBC__
> in any way. That's just wrong.
> 
> It makes more sense for C libraries to define the __UAPI_DEF_xxx for
> themselves as and when they add their own support for certain things,
> and for the kernel not to have incestuous knowledge of them.
> 
> The part you add here in the #else /* !__GLIBC__ */ part is what we
> should do at *all* times.
> 
> I understand that we'll want to grandfather in the glibc horridness,
> but let's make it clear that that's what it is, by letting it set the
> appropriate __UAPI_DEF_xxx macros to zero, and then continue through to
> your new part. Something like this (incremental to yours):

Any model we propose should be documented in the header of libc-compat.h
and explain how it works to solve header inclusion order in _both_ directions.
User use cases include header inclusion in _both_ directions and we should look
to support that.

> diff --git a/include/uapi/linux/libc-compat.h 
> b/include/uapi/linux/libc-compat.h
> index c316725..7673158 100644
> --- a/include/uapi/linux/libc-compat.h
> +++ b/include/uapi/linux/libc-compat.h
> @@ -53,41 +53,18 @@
>  
>  /* Coordinate with glibc net/if.h header. */
>  #if defined(_NET_IF_H) && defined(__USE_MISC)
> -
>  /* GLIBC headers included first so don't define anything
>   * that would already be defined. */
> -
>  #define __UAPI_DEF_IF_IFCONF 0
>  #define __UAPI_DEF_IF_IFMAP 0
>  #define __UAPI_DEF_IF_IFNAMSIZ 0
>  #define __UAPI_DEF_IF_IFREQ 0
>  /* Everything up to IFF_DYNAMIC, matches net/if.h until glibc 2.23 */
>  #define __UAPI_DEF_IF_NET_DEVICE_FLAGS 0
> -/* For the future if glibc adds IFF_LOWER_UP, IFF_DORMANT and IFF_ECHO */
> -#ifndef __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO
> -#define __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO 1
> -#endif /* __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO */
> -
> -#else /* _NET_IF_H */
> -
> -/* Linux headers included first, and we must define everything
> - * we need. The expectation is that glibc will check the
> - * __UAPI_DEF_* defines and adjust appropriately. */
> -
> -#define __UAPI_DEF_IF_IFCONF 1
> -#define __UAPI_DEF_IF_IFMAP 1
> -#define __UAPI_DEF_IF_IFNAMSIZ 1
> -#define __UAPI_DEF_IF_IFREQ 1
> -/* Everything up to IFF_DYNAMIC, matches net/if.h until glibc 2.23 */
> -#define __UAPI_DEF_IF_NET_DEVICE_FLAGS 1
> -/* For the future if glibc adds IFF_LOWER_UP, IFF_DORMANT and IFF_ECHO */
> -#define __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO 1
> -

Any header needing compat with a libc includes libc-compat.h (per the 
documented way the model works). With this patch any included linux kernel
header that also includes libc-compat.h would immediately define all 
the __UAPI_DEF_* constants to 1 as-if it had defined those structures, 
but it has not.

For example, with these two patches applied, the inclusion of linux/if.h
would define __UAPI_DEF_XATTR to 1, but linux/if.h has not defined
XATTR_CREATE or other constants, so a subsequent inclusion sys/xattrs.h
from userspace would _not_ define XATTR_CREATE because __UAPI_DEF_XATTR set
to 1 indicates the kernel has.

I don't want to read into the model you are proposing and would rather you
document the semantics clearly so we can all see what you mean.

>  #endif /* _NET_IF_H */
>  
>  /* Coordinate with glibc netinet/in.h header. */
>  #if defined(_NETINET_IN_H)
> -
>  /* GLIBC headers included first so don't define anything
>   * that would already be defined. */
>  #define __UAPI_DEF_IN_ADDR   0
> @@ -104,8 +81,6 @@
>   * additional in6_addr macros e.g. s6_addr16, and s6_addr32. */
>  #if defined(__USE_MISC) || defined (__USE_GNU)
>  #define __UAPI_DEF_IN6_ADDR_ALT  0
> -#else
> -#define __UAPI_DEF_IN6_ADDR_ALT  1
>  #endif
>  #define __UAPI_DEF_SOCKADDR_IN6  0
>  #define __UAPI_DEF_IPV6_MREQ 0
> @@ -113,62 +88,23 @@
>  #define __UAPI_DEF_IPV6_OPTIONS  0
>  #define __UAPI_DEF_IN6_PKTINFO   0
>  #define __UAPI_DEF_IP6_MTUINFO   0
> -
> -#else
> -
> -/* Linux headers included first, and we must define everything
> - * we need. The expectation is that glibc will check the
> - * __UAPI_DEF_* defines and adjust appropriately. */
> -#define __UAPI_DEF_IN_ADDR   1
> -#define

Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-03-08 Thread Carlos O'Donell

On 11/11/2016 07:08 AM, Felix Janda wrote:
> Currently, libc-compat.h detects inclusion of specific glibc headers,
> and defines corresponding _UAPI_DEF_* macros, which in turn are used in
> uapi headers to prevent definition of conflicting structures/constants.
> There is no such detection for other c libraries, for them the
> _UAPI_DEF_* macros are always defined as 1, and so none of the possibly
> conflicting definitions are suppressed.
> 
> This patch enables non-glibc c libraries to request the suppression of
> any specific interface by defining the corresponding _UAPI_DEF_* macro
> as 0.
> 
> This patch together with the recent musl libc commit
> 
> http://git.musl-libc.org/cgit/musl/commit/?id=04983f2272382af92eb8f8838964ff944fbb8258

Would it be possible to amend the musl patch to define the macros to 1.

A defined or undefined macro is typo prone.

Please see this wiki for examples of the problems it causes:
https://sourceware.org/glibc/wiki/Wundef

Having the UAPI macros always defined and be 0 or 1 allows you to create
tooling to check for problems, while an undefined macro is just that,
either a mistake or the intended value.

> fixes the following compiler errors when  is included
> after musl :
> 
> ./linux/in6.h:32:8: error: redefinition of 'struct in6_addr'
> ./linux/in6.h:49:8: error: redefinition of 'struct sockaddr_in6'
> ./linux/in6.h:59:8: error: redefinition of 'struct ipv6_mreq'

Do you have plans for fixing the error when the inclusion order is the other 
way?

It will require kernel header changes to have each individual kernel header 
define
the requisite __UAPI_* values to 1 e.g. moving libc-compat.h into the header.

Do you plan to do that in a next step?

> Signed-off-by: Felix Janda 
> ---
> The previous mail misspelled the kernel mailing list. I am sorry for
> this resulting spam.
> 
> There has already been one reply, which is available at
> 
> http://www.openwall.com/lists/musl/2016/11/11/2
> ---
>  include/uapi/linux/libc-compat.h | 52 
> 
>  1 file changed, 52 insertions(+)
> 
> diff --git a/include/uapi/linux/libc-compat.h 
> b/include/uapi/linux/libc-compat.h
> index 44b8a6b..c316725 100644
> --- a/include/uapi/linux/libc-compat.h
> +++ b/include/uapi/linux/libc-compat.h
> @@ -171,42 +171,94 @@
>  #else /* !defined(__GLIBC__) */

Would you please update the lead comment in libc-compat.h explaining this usage
model so glibc and other libc's can follow best practice. Any new usage model
should express how to fix header inclusion ordering in both directions.

>  /* Definitions for if.h */
> +#if !defined(__UAPI_DEF_IF_IFCONF)

Typo prone. Please use #if __UAPI_DEF_IF_IFCONF for all of this.

>  #define __UAPI_DEF_IF_IFCONF 1
> +#endif
> +#if !defined(__UAPI_DEF_IF_IFMAP)
>  #define __UAPI_DEF_IF_IFMAP 1
> +#endif
> +#if !defined(__UAPI_DEF_IFNAMSIZ)
>  #define __UAPI_DEF_IF_IFNAMSIZ 1
> +#endif
> +#if !defined(__UAPI_DEF_IFREQ)
>  #define __UAPI_DEF_IF_IFREQ 1
> +#endif
>  /* Everything up to IFF_DYNAMIC, matches net/if.h until glibc 2.23 */
> +#if !defined(__UAPI_DEF_IF_NET_DEVICE_FLAGS)
>  #define __UAPI_DEF_IF_NET_DEVICE_FLAGS 1
> +#endif
>  /* For the future if glibc adds IFF_LOWER_UP, IFF_DORMANT and IFF_ECHO */
> +#if !defined(__UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO)
>  #define __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO 1
> +#endif
>  
>  /* Definitions for in.h */
> +#if !defined(__UAPI_DEF_IN_ADDR)
>  #define __UAPI_DEF_IN_ADDR   1
> +#endif
> +#if !defined(__UAPI_DEF_IN_IPPROTO)
>  #define __UAPI_DEF_IN_IPPROTO1
> +#endif
> +#if !defined(__UAPI_DEF_IN_PKTINFO)
>  #define __UAPI_DEF_IN_PKTINFO1
> +#endif
> +#if !defined(__UAPI_DEF_IP_MREQ)
>  #define __UAPI_DEF_IP_MREQ   1
> +#endif
> +#if !defined(__UAPI_DEF_SOCKADDR_IN)
>  #define __UAPI_DEF_SOCKADDR_IN   1
> +#endif
> +#if !defined(__UAPI_DEF_IN_CLASS)
>  #define __UAPI_DEF_IN_CLASS  1
> +#endif
>  
>  /* Definitions for in6.h */
> +#if !defined(__UAPI_DEF_IN6_ADDR)
>  #define __UAPI_DEF_IN6_ADDR  1
> +#endif
> +#if !defined(__UAPI_DEF_IN6_ADDR_ALT)
>  #define __UAPI_DEF_IN6_ADDR_ALT  1
> +#endif
> +#if !defined(__UAPI_DEF_SOCKADDR_IN6)
>  #define __UAPI_DEF_SOCKADDR_IN6  1
> +#endif
> +#if !defined(__UAPI_DEF_IPV6_MREQ)
>  #define __UAPI_DEF_IPV6_MREQ 1
> +#endif
> +#if !defined(__UAPI_DEF_IPPROTO_V6)
>  #define __UAPI_DEF_IPPROTO_V61
> +#endif
> +#if !defined(__UAPI_DEF_IPV6_OPTIONS)
>  #define __UAPI_DEF_IPV6_OPTIONS  1
> +#endif
> +#if !defined(__UAPI_DEF_IN6_PKTINFO)
>  #define __UAPI_DEF_IN6_PKTINFO   1
> +#endif
> +#if !defined(__UAPI_DEF_IP6_MTUINFO)
>  #define __UAPI_DEF_IP6_MTUINFO   1
> +#endif
>  
>  /* Definitions for ipx.h */
> +#if !defined(__UAPI_DEF_SOCKADDR_IPX)
>  #define __UAPI_DEF_SOCKADDR_IPX  1
> +#endif
> +#if

Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-03-08 Thread Carlos O'Donell

On 11/11/2016 07:08 AM, Felix Janda wrote:
> Currently, libc-compat.h detects inclusion of specific glibc headers,
> and defines corresponding _UAPI_DEF_* macros, which in turn are used in
> uapi headers to prevent definition of conflicting structures/constants.
> There is no such detection for other c libraries, for them the
> _UAPI_DEF_* macros are always defined as 1, and so none of the possibly
> conflicting definitions are suppressed.
> 
> This patch enables non-glibc c libraries to request the suppression of
> any specific interface by defining the corresponding _UAPI_DEF_* macro
> as 0.
> 
> This patch together with the recent musl libc commit
> 
> http://git.musl-libc.org/cgit/musl/commit/?id=04983f2272382af92eb8f8838964ff944fbb8258

Would it be possible to amend the musl patch to define the macros to 1.

A defined or undefined macro is typo prone.

Please see this wiki for examples of the problems it causes:
https://sourceware.org/glibc/wiki/Wundef

Having the UAPI macros always defined and be 0 or 1 allows you to create
tooling to check for problems, while an undefined macro is just that,
either a mistake or the intended value.

> fixes the following compiler errors when  is included
> after musl :
> 
> ./linux/in6.h:32:8: error: redefinition of 'struct in6_addr'
> ./linux/in6.h:49:8: error: redefinition of 'struct sockaddr_in6'
> ./linux/in6.h:59:8: error: redefinition of 'struct ipv6_mreq'

Do you have plans for fixing the error when the inclusion order is the other 
way?

It will require kernel header changes to have each individual kernel header 
define
the requisite __UAPI_* values to 1 e.g. moving libc-compat.h into the header.

Do you plan to do that in a next step?

> Signed-off-by: Felix Janda 
> ---
> The previous mail misspelled the kernel mailing list. I am sorry for
> this resulting spam.
> 
> There has already been one reply, which is available at
> 
> http://www.openwall.com/lists/musl/2016/11/11/2
> ---
>  include/uapi/linux/libc-compat.h | 52 
> 
>  1 file changed, 52 insertions(+)
> 
> diff --git a/include/uapi/linux/libc-compat.h 
> b/include/uapi/linux/libc-compat.h
> index 44b8a6b..c316725 100644
> --- a/include/uapi/linux/libc-compat.h
> +++ b/include/uapi/linux/libc-compat.h
> @@ -171,42 +171,94 @@
>  #else /* !defined(__GLIBC__) */

Would you please update the lead comment in libc-compat.h explaining this usage
model so glibc and other libc's can follow best practice. Any new usage model
should express how to fix header inclusion ordering in both directions.

>  /* Definitions for if.h */
> +#if !defined(__UAPI_DEF_IF_IFCONF)

Typo prone. Please use #if __UAPI_DEF_IF_IFCONF for all of this.

>  #define __UAPI_DEF_IF_IFCONF 1
> +#endif
> +#if !defined(__UAPI_DEF_IF_IFMAP)
>  #define __UAPI_DEF_IF_IFMAP 1
> +#endif
> +#if !defined(__UAPI_DEF_IFNAMSIZ)
>  #define __UAPI_DEF_IF_IFNAMSIZ 1
> +#endif
> +#if !defined(__UAPI_DEF_IFREQ)
>  #define __UAPI_DEF_IF_IFREQ 1
> +#endif
>  /* Everything up to IFF_DYNAMIC, matches net/if.h until glibc 2.23 */
> +#if !defined(__UAPI_DEF_IF_NET_DEVICE_FLAGS)
>  #define __UAPI_DEF_IF_NET_DEVICE_FLAGS 1
> +#endif
>  /* For the future if glibc adds IFF_LOWER_UP, IFF_DORMANT and IFF_ECHO */
> +#if !defined(__UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO)
>  #define __UAPI_DEF_IF_NET_DEVICE_FLAGS_LOWER_UP_DORMANT_ECHO 1
> +#endif
>  
>  /* Definitions for in.h */
> +#if !defined(__UAPI_DEF_IN_ADDR)
>  #define __UAPI_DEF_IN_ADDR   1
> +#endif
> +#if !defined(__UAPI_DEF_IN_IPPROTO)
>  #define __UAPI_DEF_IN_IPPROTO1
> +#endif
> +#if !defined(__UAPI_DEF_IN_PKTINFO)
>  #define __UAPI_DEF_IN_PKTINFO1
> +#endif
> +#if !defined(__UAPI_DEF_IP_MREQ)
>  #define __UAPI_DEF_IP_MREQ   1
> +#endif
> +#if !defined(__UAPI_DEF_SOCKADDR_IN)
>  #define __UAPI_DEF_SOCKADDR_IN   1
> +#endif
> +#if !defined(__UAPI_DEF_IN_CLASS)
>  #define __UAPI_DEF_IN_CLASS  1
> +#endif
>  
>  /* Definitions for in6.h */
> +#if !defined(__UAPI_DEF_IN6_ADDR)
>  #define __UAPI_DEF_IN6_ADDR  1
> +#endif
> +#if !defined(__UAPI_DEF_IN6_ADDR_ALT)
>  #define __UAPI_DEF_IN6_ADDR_ALT  1
> +#endif
> +#if !defined(__UAPI_DEF_SOCKADDR_IN6)
>  #define __UAPI_DEF_SOCKADDR_IN6  1
> +#endif
> +#if !defined(__UAPI_DEF_IPV6_MREQ)
>  #define __UAPI_DEF_IPV6_MREQ 1
> +#endif
> +#if !defined(__UAPI_DEF_IPPROTO_V6)
>  #define __UAPI_DEF_IPPROTO_V61
> +#endif
> +#if !defined(__UAPI_DEF_IPV6_OPTIONS)
>  #define __UAPI_DEF_IPV6_OPTIONS  1
> +#endif
> +#if !defined(__UAPI_DEF_IN6_PKTINFO)
>  #define __UAPI_DEF_IN6_PKTINFO   1
> +#endif
> +#if !defined(__UAPI_DEF_IP6_MTUINFO)
>  #define __UAPI_DEF_IP6_MTUINFO   1
> +#endif
>  
>  /* Definitions for ipx.h */
> +#if !defined(__UAPI_DEF_SOCKADDR_IPX)
>  #define __UAPI_DEF_SOCKADDR_IPX  1
> +#endif
> +#if

Re: opaque types instead of union epoll_data

2017-03-07 Thread Carlos O'Donell

On 03/07/2017 07:31 AM, Sodagudi Prasad wrote:
> uapi structs epoll_data are more opaque than user space expects.
> kernel have defined as __u64 instead of the union epoll_data. Because
> of this libc have redefined struct epoll_event with union data
> member.

We do the same in glibc.

> https://android.googlesource.com/platform/bionic.git/+/master/libc/include/sys/epoll.h
> typedef union epoll_data {
>   void* ptr;
>   int fd;
>   uint32_t u32;
>   uint64_t u64;
> } epoll_data_t;
> struct epoll_event {
>   uint32_t events;
>   epoll_data_t data;
> }
> 
> Kernel UAPI header defined as "include/uapi/linux/eventpoll.h"
> struct epoll_event {
> __u32 events;
> __u64 data;  =>opaque type instead of union epoll_data
> } EPOLL_PACKED;
> 
> 
> Because of this we are landing into some issues as we copy kernel
> headers. Will it be going to be addressed?

What issues are you having? 

Exactly what problems are you running in to? 

Using all of the UAPI headers directly is not a workable
solution, there is a lot of definition and cleanup work to do there.
Some of the headers are immediately useful, but others, as you note
require quite a bit of work to become useful.

The underlying issue of a mismatch between userspace expectations
and UAPI definitions will get addressed when someone, maybe you, 
works diligently to enhance the kernel UAPI headers to be generally 
useful to userspace.

I don't know anyone else who is working on this problem. Though I
have a vested interested in it as a glibc maintainer, since it would
be nice to avoid duplicate headers where possible between the kernel
and userspace.

-- 
Cheers,
Carlos.

Re: opaque types instead of union epoll_data

2017-03-07 Thread Carlos O'Donell

On 03/07/2017 07:31 AM, Sodagudi Prasad wrote:
> uapi structs epoll_data are more opaque than user space expects.
> kernel have defined as __u64 instead of the union epoll_data. Because
> of this libc have redefined struct epoll_event with union data
> member.

We do the same in glibc.

> https://android.googlesource.com/platform/bionic.git/+/master/libc/include/sys/epoll.h
> typedef union epoll_data {
>   void* ptr;
>   int fd;
>   uint32_t u32;
>   uint64_t u64;
> } epoll_data_t;
> struct epoll_event {
>   uint32_t events;
>   epoll_data_t data;
> }
> 
> Kernel UAPI header defined as "include/uapi/linux/eventpoll.h"
> struct epoll_event {
> __u32 events;
> __u64 data;  =>opaque type instead of union epoll_data
> } EPOLL_PACKED;
> 
> 
> Because of this we are landing into some issues as we copy kernel
> headers. Will it be going to be addressed?

What issues are you having? 

Exactly what problems are you running in to? 

Using all of the UAPI headers directly is not a workable
solution, there is a lot of definition and cleanup work to do there.
Some of the headers are immediately useful, but others, as you note
require quite a bit of work to become useful.

The underlying issue of a mismatch between userspace expectations
and UAPI definitions will get addressed when someone, maybe you, 
works diligently to enhance the kernel UAPI headers to be generally 
useful to userspace.

I don't know anyone else who is working on this problem. Though I
have a vested interested in it as a glibc maintainer, since it would
be nice to avoid duplicate headers where possible between the kernel
and userspace.

-- 
Cheers,
Carlos.

Re: [PATCH] uapi: fix asm/signal.h userspace compilation errors

2017-03-06 Thread Carlos O'Donell

On Fri, Mar 3, 2017 at 8:23 PM, Carlos O'Donell <car...@systemhalted.org> wrote:
> On Thu, Mar 2, 2017 at 10:48 AM, Dmitry V. Levin <l...@altlinux.org> wrote:
>> On Thu, Mar 02, 2017 at 10:22:18AM -0500, Carlos O'Donell wrote:
>>> On Wed, Mar 1, 2017 at 11:20 AM, Arnd Bergmann <a...@arndb.de> wrote:
>>> > On Sun, Feb 26, 2017 at 2:01 AM, Dmitry V. Levin <l...@altlinux.org> 
>>> > wrote:
>>> >> Include  (guarded by #ifndef __KERNEL__) to fix asm/signal.h
>>> >> userspace compilation errors like this:
>>> >>
>>> >> /usr/include/asm/signal.h:126:2: error: unknown type name 'size_t'
>>> >>   size_t ss_size;
>>> >>
>>> >> As no uapi header provides a definition of size_t, inclusion
>>> >> of  seems to be the most conservative fix available.
>> [...]
>>> > I'm not sure if this is the best fix. We generally should not include one
>>> > standard header from another standard header. Would it be possible
>>> > to use __kernel_size_t instead of size_t?
>>>
>>> In glibc we handle this with special use of __need_size_t with GCC's
>>> provided stddef.h.
>>>
>>> For example glibc's signal.h does this:
>>>
>>> # define __need_size_t
>>> # include 
>>
>> Just to make it clear, do you suggest this approach for asm/signal.h as well?

The best practice from the glibc community looks like this:

(a) Create a bits/types/*.h header for the type you need.

e.g.
./time/bits/types/struct_timeval.h
./time/bits/types/struct_itimerspec.h
./time/bits/types/time_t.h
./time/bits/types/struct_timespec.h
./time/bits/types/struct_tm.h
./time/bits/types/clockid_t.h
./time/bits/types/clock_t.h
./time/bits/types/timer_t.h

(b) If neccessary the bits/types/*.h header is a wrapper:
~~~
#define __need_size_t
#include 
~~~
to get access to the compiler provided type.

This way all of the code you need simplifies to includes for the types you need.

e.g.

time/sys/time.h:
...
#include 
#include 
#include 
...

This is what we've been doing in glibc starting last September as we
cleaned up all the convoluted conditional logic to get the types we
needed in the headers that needed them.

Cheers,
Carlos.

Re: [PATCH] uapi: fix asm/signal.h userspace compilation errors

2017-03-06 Thread Carlos O'Donell

On Fri, Mar 3, 2017 at 8:23 PM, Carlos O'Donell  wrote:
> On Thu, Mar 2, 2017 at 10:48 AM, Dmitry V. Levin  wrote:
>> On Thu, Mar 02, 2017 at 10:22:18AM -0500, Carlos O'Donell wrote:
>>> On Wed, Mar 1, 2017 at 11:20 AM, Arnd Bergmann  wrote:
>>> > On Sun, Feb 26, 2017 at 2:01 AM, Dmitry V. Levin  
>>> > wrote:
>>> >> Include  (guarded by #ifndef __KERNEL__) to fix asm/signal.h
>>> >> userspace compilation errors like this:
>>> >>
>>> >> /usr/include/asm/signal.h:126:2: error: unknown type name 'size_t'
>>> >>   size_t ss_size;
>>> >>
>>> >> As no uapi header provides a definition of size_t, inclusion
>>> >> of  seems to be the most conservative fix available.
>> [...]
>>> > I'm not sure if this is the best fix. We generally should not include one
>>> > standard header from another standard header. Would it be possible
>>> > to use __kernel_size_t instead of size_t?
>>>
>>> In glibc we handle this with special use of __need_size_t with GCC's
>>> provided stddef.h.
>>>
>>> For example glibc's signal.h does this:
>>>
>>> # define __need_size_t
>>> # include 
>>
>> Just to make it clear, do you suggest this approach for asm/signal.h as well?

The best practice from the glibc community looks like this:

(a) Create a bits/types/*.h header for the type you need.

e.g.
./time/bits/types/struct_timeval.h
./time/bits/types/struct_itimerspec.h
./time/bits/types/time_t.h
./time/bits/types/struct_timespec.h
./time/bits/types/struct_tm.h
./time/bits/types/clockid_t.h
./time/bits/types/clock_t.h
./time/bits/types/timer_t.h

(b) If neccessary the bits/types/*.h header is a wrapper:
~~~
#define __need_size_t
#include 
~~~
to get access to the compiler provided type.

This way all of the code you need simplifies to includes for the types you need.

e.g.

time/sys/time.h:
...
#include 
#include 
#include 
...

This is what we've been doing in glibc starting last September as we
cleaned up all the convoluted conditional logic to get the types we
needed in the headers that needed them.

Cheers,
Carlos.

Re: [PATCH] uapi: fix asm/signal.h userspace compilation errors

2017-03-03 Thread Carlos O'Donell

On Thu, Mar 2, 2017 at 10:48 AM, Dmitry V. Levin <l...@altlinux.org> wrote:
> On Thu, Mar 02, 2017 at 10:22:18AM -0500, Carlos O'Donell wrote:
>> On Wed, Mar 1, 2017 at 11:20 AM, Arnd Bergmann <a...@arndb.de> wrote:
>> > On Sun, Feb 26, 2017 at 2:01 AM, Dmitry V. Levin <l...@altlinux.org> wrote:
>> >> Include  (guarded by #ifndef __KERNEL__) to fix asm/signal.h
>> >> userspace compilation errors like this:
>> >>
>> >> /usr/include/asm/signal.h:126:2: error: unknown type name 'size_t'
>> >>   size_t ss_size;
>> >>
>> >> As no uapi header provides a definition of size_t, inclusion
>> >> of  seems to be the most conservative fix available.
> [...]
>> > I'm not sure if this is the best fix. We generally should not include one
>> > standard header from another standard header. Would it be possible
>> > to use __kernel_size_t instead of size_t?
>>
>> In glibc we handle this with special use of __need_size_t with GCC's
>> provided stddef.h.
>>
>> For example glibc's signal.h does this:
>>
>> # define __need_size_t
>> # include 
>
> Just to make it clear, do you suggest this approach for asm/signal.h as well?

The kernel is duplicating userspace headers in the UAPI implementation
and running into exactly the same problems we have already solved in
userspace. We currently have no better solution than the "__need_*"
interface for avoiding the duplication of the type definitions and the
problems that come with that.

I am taking this up with senior glibc developers on libc-alpha to see
if we have a better suggestion. If you give me 72 hours I'll either
have a better suggestion or the acknowledgement that my suggestion is
the best practical solution we have.

Note that in a GNU userspace stddef.h here comes from gcc, which
completes the implementation of the C development environment that
provides the types you need. The use of "__need_size_t" is a collusion
between the components of the implementation and use by the Linux
kernel would mean expecting something specific to a GNU
implementation.

I might suggest you use include/uapi/linux/libc-compat.h in an attempt
to abstract away the GNU-specific magic for getting just size_t from
stddef.h. That way you have documented the places that other runtime
authors need to fill out for things to work.

Cheers,
Carlos.

Re: [PATCH] uapi: fix asm/signal.h userspace compilation errors

2017-03-03 Thread Carlos O'Donell

On Thu, Mar 2, 2017 at 10:48 AM, Dmitry V. Levin  wrote:
> On Thu, Mar 02, 2017 at 10:22:18AM -0500, Carlos O'Donell wrote:
>> On Wed, Mar 1, 2017 at 11:20 AM, Arnd Bergmann  wrote:
>> > On Sun, Feb 26, 2017 at 2:01 AM, Dmitry V. Levin  wrote:
>> >> Include  (guarded by #ifndef __KERNEL__) to fix asm/signal.h
>> >> userspace compilation errors like this:
>> >>
>> >> /usr/include/asm/signal.h:126:2: error: unknown type name 'size_t'
>> >>   size_t ss_size;
>> >>
>> >> As no uapi header provides a definition of size_t, inclusion
>> >> of  seems to be the most conservative fix available.
> [...]
>> > I'm not sure if this is the best fix. We generally should not include one
>> > standard header from another standard header. Would it be possible
>> > to use __kernel_size_t instead of size_t?
>>
>> In glibc we handle this with special use of __need_size_t with GCC's
>> provided stddef.h.
>>
>> For example glibc's signal.h does this:
>>
>> # define __need_size_t
>> # include 
>
> Just to make it clear, do you suggest this approach for asm/signal.h as well?

The kernel is duplicating userspace headers in the UAPI implementation
and running into exactly the same problems we have already solved in
userspace. We currently have no better solution than the "__need_*"
interface for avoiding the duplication of the type definitions and the
problems that come with that.

I am taking this up with senior glibc developers on libc-alpha to see
if we have a better suggestion. If you give me 72 hours I'll either
have a better suggestion or the acknowledgement that my suggestion is
the best practical solution we have.

Note that in a GNU userspace stddef.h here comes from gcc, which
completes the implementation of the C development environment that
provides the types you need. The use of "__need_size_t" is a collusion
between the components of the implementation and use by the Linux
kernel would mean expecting something specific to a GNU
implementation.

I might suggest you use include/uapi/linux/libc-compat.h in an attempt
to abstract away the GNU-specific magic for getting just size_t from
stddef.h. That way you have documented the places that other runtime
authors need to fill out for things to work.

Cheers,
Carlos.

Re: [PATCH] uapi: fix asm/signal.h userspace compilation errors

2017-03-02 Thread Carlos O'Donell

On Wed, Mar 1, 2017 at 11:20 AM, Arnd Bergmann  wrote:
> On Sun, Feb 26, 2017 at 2:01 AM, Dmitry V. Levin  wrote:
>> Include  (guarded by #ifndef __KERNEL__) to fix asm/signal.h
>> userspace compilation errors like this:
>>
>> /usr/include/asm/signal.h:126:2: error: unknown type name 'size_t'
>>   size_t ss_size;
>>
>> As no uapi header provides a definition of size_t, inclusion
>> of  seems to be the most conservative fix available.
>>
>> On the kernel side size_t is typedef'ed to __kernel_size_t, so
>> an alternative fix would be to change the type of sigaltstack.ss_size
>> from size_t to __kernel_size_t for all architectures except those where
>> sizeof(size_t) < sizeof(__kernel_size_t), namely, x32 and mips n32.
>>
>> On x32 and mips n32, however, #include  seems to be the most
>> straightforward way to obtain the definition for sigaltstack.ss_size's
>> type.
>>
>> Signed-off-by: Dmitry V. Levin 
>
> I'm not sure if this is the best fix. We generally should not include one
> standard header from another standard header. Would it be possible
> to use __kernel_size_t instead of size_t?

In glibc we handle this with special use of __need_size_t with GCC's
provided stddef.h.

For example glibc's signal.h does this:

# define __need_size_t
# include 

And...

/* Any one of these symbols __need_* means that GNU libc
   wants us just to define one data type.  So don't define
   the symbols that indicate this file's entire job has been done.  */
#if (!defined(__need_wchar_t) && !defined(__need_size_t)\
 && !defined(__need_ptrdiff_t) && !defined(__need_NULL) \
 && !defined(__need_wint_t))

The idea being that the type you want is really defined by stddef.h,
but you just one the one type.

Changing the fundamental type causes the issues you see in patch v2
where sizeof(size_t) < sizeof(__kernel_size_t).

It will only lead to problem substituting the wrong type.

Cheers,
Carlos.

Re: [PATCH] uapi: fix asm/signal.h userspace compilation errors

2017-03-02 Thread Carlos O'Donell

On Wed, Mar 1, 2017 at 11:20 AM, Arnd Bergmann  wrote:
> On Sun, Feb 26, 2017 at 2:01 AM, Dmitry V. Levin  wrote:
>> Include  (guarded by #ifndef __KERNEL__) to fix asm/signal.h
>> userspace compilation errors like this:
>>
>> /usr/include/asm/signal.h:126:2: error: unknown type name 'size_t'
>>   size_t ss_size;
>>
>> As no uapi header provides a definition of size_t, inclusion
>> of  seems to be the most conservative fix available.
>>
>> On the kernel side size_t is typedef'ed to __kernel_size_t, so
>> an alternative fix would be to change the type of sigaltstack.ss_size
>> from size_t to __kernel_size_t for all architectures except those where
>> sizeof(size_t) < sizeof(__kernel_size_t), namely, x32 and mips n32.
>>
>> On x32 and mips n32, however, #include  seems to be the most
>> straightforward way to obtain the definition for sigaltstack.ss_size's
>> type.
>>
>> Signed-off-by: Dmitry V. Levin 
>
> I'm not sure if this is the best fix. We generally should not include one
> standard header from another standard header. Would it be possible
> to use __kernel_size_t instead of size_t?

In glibc we handle this with special use of __need_size_t with GCC's
provided stddef.h.

For example glibc's signal.h does this:

# define __need_size_t
# include 

And...

/* Any one of these symbols __need_* means that GNU libc
   wants us just to define one data type.  So don't define
   the symbols that indicate this file's entire job has been done.  */
#if (!defined(__need_wchar_t) && !defined(__need_size_t)\
 && !defined(__need_ptrdiff_t) && !defined(__need_NULL) \
 && !defined(__need_wint_t))

The idea being that the type you want is really defined by stddef.h,
but you just one the one type.

Changing the fundamental type causes the issues you see in patch v2
where sizeof(size_t) < sizeof(__kernel_size_t).

It will only lead to problem substituting the wrong type.

Cheers,
Carlos.

Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2016-12-28 Thread Carlos O'Donell

On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>>  wrote:
>>> This patch introduces new rlimit resource to manage maximum virtual
>>> address available to userspace to map.
>>>
>>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>>> Not all user space is ready to handle wide addresses. It's known that
>>> at least some JIT compilers use high bit in pointers to encode their
>>> information. It collides with valid pointers with 5-level paging and
>>> leads to crashes.
>>>
>>> The patch aims to address this compatibility issue.
>>>
>>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>>> address available to map by userspace.
>>>
>>> The default hard limit will be RLIM_INFINITY, which basically means that
>>> TASK_SIZE limits available address space.
>>>
>>> The soft limit will also be RLIM_INFINITY everywhere, but the machine
>>> with 5-level paging enabled. In this case, soft limit would be
>>> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>>> paging which known to be safe
>>>
>>> New rlimit resource would follow usual semantics with regards to
>>> inheritance: preserved on fork(2) and exec(2). This has potential to
>>> break application if limits set too wide or too narrow, but this is not
>>> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>>>
>>> As with other resources you can set the limit lower than current usage.
>>> It would affect only future virtual address space allocations.
>>>
>>> Use-cases for new rlimit:
>>>
>>>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>>> its children to use addresses above 47-bits.
>>>
>>>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>>> exec(2) allows the child to use addresses above 47-bits.
>>>
>>>   - Lowering the hard limit to 47-bits would prevent current process all
>>> its children to use addresses above 47-bits, unless a process has
>>> CAP_SYS_RESOURCES.
>>>
>>>   - It’s also can be handy to lower hard or soft limit to arbitrary
>>> address. User-mode emulation in QEMU may lower the limit to 32-bit
>>> to emulate 32-bit machine on 64-bit host.
>>
>> I tend to think that this should be a personality or an ELF flag, not
>> an rlimit.
> 
> My plan was to implement ELF flag on top. Basically, ELF flag would mean
> that we bump soft limit to hard limit on exec.

Could you clarify what you mean by an "ELF flag?"

-- 
Cheers,
Carlos.

Re: [RFC, PATCHv2 29/29] mm, x86: introduce RLIMIT_VADDR

2016-12-28 Thread Carlos O'Donell

On 12/26/2016 09:24 PM, Kirill A. Shutemov wrote:
> On Mon, Dec 26, 2016 at 06:06:01PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 26, 2016 at 5:54 PM, Kirill A. Shutemov
>>  wrote:
>>> This patch introduces new rlimit resource to manage maximum virtual
>>> address available to userspace to map.
>>>
>>> On x86, 5-level paging enables 56-bit userspace virtual address space.
>>> Not all user space is ready to handle wide addresses. It's known that
>>> at least some JIT compilers use high bit in pointers to encode their
>>> information. It collides with valid pointers with 5-level paging and
>>> leads to crashes.
>>>
>>> The patch aims to address this compatibility issue.
>>>
>>> MM would use min(RLIMIT_VADDR, TASK_SIZE) as upper limit of virtual
>>> address available to map by userspace.
>>>
>>> The default hard limit will be RLIM_INFINITY, which basically means that
>>> TASK_SIZE limits available address space.
>>>
>>> The soft limit will also be RLIM_INFINITY everywhere, but the machine
>>> with 5-level paging enabled. In this case, soft limit would be
>>> (1UL << 47) - PAGE_SIZE. It’s current x86-64 TASK_SIZE_MAX with 4-level
>>> paging which known to be safe
>>>
>>> New rlimit resource would follow usual semantics with regards to
>>> inheritance: preserved on fork(2) and exec(2). This has potential to
>>> break application if limits set too wide or too narrow, but this is not
>>> uncommon for other resources (consider RLIMIT_DATA or RLIMIT_AS).
>>>
>>> As with other resources you can set the limit lower than current usage.
>>> It would affect only future virtual address space allocations.
>>>
>>> Use-cases for new rlimit:
>>>
>>>   - Bumping the soft limit to RLIM_INFINITY, allows current process all
>>> its children to use addresses above 47-bits.
>>>
>>>   - Bumping the soft limit to RLIM_INFINITY after fork(2), but before
>>> exec(2) allows the child to use addresses above 47-bits.
>>>
>>>   - Lowering the hard limit to 47-bits would prevent current process all
>>> its children to use addresses above 47-bits, unless a process has
>>> CAP_SYS_RESOURCES.
>>>
>>>   - It’s also can be handy to lower hard or soft limit to arbitrary
>>> address. User-mode emulation in QEMU may lower the limit to 32-bit
>>> to emulate 32-bit machine on 64-bit host.
>>
>> I tend to think that this should be a personality or an ELF flag, not
>> an rlimit.
> 
> My plan was to implement ELF flag on top. Basically, ELF flag would mean
> that we bump soft limit to hard limit on exec.

Could you clarify what you mean by an "ELF flag?"

-- 
Cheers,
Carlos.

Re: [PATCH] binfmt_misc: allow selecting the interpreter based on xattr keywords

2016-08-26 Thread Carlos O'Donell

On 08/26/2016 10:55 AM, Florian Weimer wrote:
> On 08/25/2016 06:15 PM, James Bottomley wrote:
>> On Sun, 2016-08-21 at 21:01 -0700, Josh Max wrote:
>>> This patch allows binfmt_misc to select the interpeter for
>>> arbitrary binaries by comparing a specified registered keyword
>>> with the value of a specified binary's extended attribute
>>> (user.binfmt.interp), and then launching the program with the
>>> registered interpreter.
>>> 
>>> This is useful when wanting to launch a collection of binaries
>>> under the same interpreter, even when they do not necessarily
>>> share a common extension or magic bits, or when their magic
>>> conflics with the operation of binfmt_elf. Some examples of its
>>> use would be to launch some executables of various different
>>> architectures in a directory, or for running some native binaries
>>> under a sandbox (like firejail) automatically during their
>>> launch.
>> 
>> Could you expand on the use cases?

Likewise, I would also like to hear about the intended use cases.

> A non-security use case would be to run the binary (without
> modification) with a different ELF interpreter (assuming this allows
> to override binfmt_elf, but self-sandboxing would need that as well).
> This would make it easier to use older or newer libcs for select
> binaries on the system.  Right now, one has to write wrappers for
> that, and the explicit dynamic linker invocation is not completely
> transparent to the application.

I think this is a slightly contrived use case. Normally you would just
use patchelf, or build the binary with a specific dynamic loader e.g.
-Wl,--dynamic-linker=file. What is restricting the use case from modifying
the binary or running under the new loader? Or to put it another way,
what requires the interpreter selection to be fully dynamic?

Why isn't the answer: Use ELF everywhere and specify the interpreter
you actually want? Likewise if you know you want to one-off run a native
binary in a sandbox or alternate loader then use a userspace tool to do
that?

Does the convenience gained justify the complexity we now have to explain
to our users "Oh, and watch out for the xattr keywords that change how
your application is started?" I'm doubtful. I haven't seen enough use
cases here to justify a fully dynamic interpreter selection.

It feels like a better use of our energy would be to make exec by the
kernel and exec by the dynamic loader look as similar as possible so you
can use one or the other without the application knowing the difference.

My worry is that this is a new knob for something we have previously
all expected to be a part of the binaries static definition.
What impact will have on security validation? Is the binary running
with the intended runtime? I assume that setting the xattr keyword 
requires write for the file in question, and that no capabilities
mismatch means we could set the keyword but have less than write
permissions on the file.

There would also be a non-zero performance cost to this and it would
be borne by all ELF/misc binaries at startup for what is a debugging
feature. This assumes we extend binfmt_elf to look for the same xattr
keyword to override the loader.

This ignores the fact that the alternate loader also needs to have
it's own ldconfig cache, implementation-dependent lookup paths etc,
all of which have to be keyed off the xattr keyword specified dynamic
loader. All of that is tractable though and can be done in userspace
keyed from the selected dynamic loader. Buy why? Why not use a mount
namespace and different loader e.g. lxc, docker, etc, or specify a
loader that is a wrapper and does this for you?

I'm not convinced this is a good idea, but I'm open to learning about
more use cases.

-- 
Cheers,
Carlos.

Re: [PATCH] binfmt_misc: allow selecting the interpreter based on xattr keywords

2016-08-26 Thread Carlos O'Donell

On 08/26/2016 10:55 AM, Florian Weimer wrote:
> On 08/25/2016 06:15 PM, James Bottomley wrote:
>> On Sun, 2016-08-21 at 21:01 -0700, Josh Max wrote:
>>> This patch allows binfmt_misc to select the interpeter for
>>> arbitrary binaries by comparing a specified registered keyword
>>> with the value of a specified binary's extended attribute
>>> (user.binfmt.interp), and then launching the program with the
>>> registered interpreter.
>>> 
>>> This is useful when wanting to launch a collection of binaries
>>> under the same interpreter, even when they do not necessarily
>>> share a common extension or magic bits, or when their magic
>>> conflics with the operation of binfmt_elf. Some examples of its
>>> use would be to launch some executables of various different
>>> architectures in a directory, or for running some native binaries
>>> under a sandbox (like firejail) automatically during their
>>> launch.
>> 
>> Could you expand on the use cases?

Likewise, I would also like to hear about the intended use cases.

> A non-security use case would be to run the binary (without
> modification) with a different ELF interpreter (assuming this allows
> to override binfmt_elf, but self-sandboxing would need that as well).
> This would make it easier to use older or newer libcs for select
> binaries on the system.  Right now, one has to write wrappers for
> that, and the explicit dynamic linker invocation is not completely
> transparent to the application.

I think this is a slightly contrived use case. Normally you would just
use patchelf, or build the binary with a specific dynamic loader e.g.
-Wl,--dynamic-linker=file. What is restricting the use case from modifying
the binary or running under the new loader? Or to put it another way,
what requires the interpreter selection to be fully dynamic?

Why isn't the answer: Use ELF everywhere and specify the interpreter
you actually want? Likewise if you know you want to one-off run a native
binary in a sandbox or alternate loader then use a userspace tool to do
that?

Does the convenience gained justify the complexity we now have to explain
to our users "Oh, and watch out for the xattr keywords that change how
your application is started?" I'm doubtful. I haven't seen enough use
cases here to justify a fully dynamic interpreter selection.

It feels like a better use of our energy would be to make exec by the
kernel and exec by the dynamic loader look as similar as possible so you
can use one or the other without the application knowing the difference.

My worry is that this is a new knob for something we have previously
all expected to be a part of the binaries static definition.
What impact will have on security validation? Is the binary running
with the intended runtime? I assume that setting the xattr keyword 
requires write for the file in question, and that no capabilities
mismatch means we could set the keyword but have less than write
permissions on the file.

There would also be a non-zero performance cost to this and it would
be borne by all ELF/misc binaries at startup for what is a debugging
feature. This assumes we extend binfmt_elf to look for the same xattr
keyword to override the loader.

This ignores the fact that the alternate loader also needs to have
it's own ldconfig cache, implementation-dependent lookup paths etc,
all of which have to be keyed off the xattr keyword specified dynamic
loader. All of that is tractable though and can be done in userspace
keyed from the selected dynamic loader. Buy why? Why not use a mount
namespace and different loader e.g. lxc, docker, etc, or specify a
loader that is a wrapper and does this for you?

I'm not convinced this is a good idea, but I'm open to learning about
more use cases.

-- 
Cheers,
Carlos.

Re: [RFC patch 4/7] futex: Add support for attached futexes

2016-04-05 Thread Carlos O'Donell

On 04/03/2016 07:30 AM, Linus Torvalds wrote:
> On Sun, Apr 3, 2016 at 6:16 AM, Ingo Molnar  wrote:
>>
>> So an ABI distinction and offloading the decision to every single 
>> application that
>> wants to use it and hardcode it into actual application source code via an 
>> ABI is
>> pretty much the _WORST_ way to go about it IMHO...
>>
>> So how about this: don't add any ABI details, but make futexes auto-attached 
>> on
>> NUMA systems (and obviously PREEMPT_RT systems)?
> 
> I agree.
> 
> Do *not* make this a visible new ABI.

Agreed.

We had similar requests in glibc to add APIs to tweak the parameters of
the elision for locks backed by hardware transactional memory.

The person submitting the patches always thinks this is a great API
because it allows them to write tests to verify their own work (which
means it still might be useful for internal testing or developing
auto-tuning).

Users have no clue what to do with the API and worse the state space of
the parameters is immense. You can't possibly do any kind of sensible
optimization without knowing a lot about the hardware.

So no public API was ever added in glibc for pthread_mutex_lock elision
parameters. Either the parameters work by default or you have to post
patches to change the auto-tuning used internally in glibc.

> So automatically using a local hashtable (for private mutexes - I
> think people need to just accept that a shared mutex is more costly)
> according to some heuristic is definitely the way to go. And yes, the
> heuristic may be well be - at least to start - "this is a preempt-RT
> system" (for people who clearly care about having predictable
> latencies) or "this is actually a multi-node NUMA system, and I have
> heaps of memory".

Agreed.

> Then, add a tunable (for root, not per-futex) to allow people to tweak it.

Agreed.

-- 
Cheers,
Carlos.

Re: [RFC patch 4/7] futex: Add support for attached futexes

2016-04-05 Thread Carlos O'Donell

On 04/03/2016 07:30 AM, Linus Torvalds wrote:
> On Sun, Apr 3, 2016 at 6:16 AM, Ingo Molnar  wrote:
>>
>> So an ABI distinction and offloading the decision to every single 
>> application that
>> wants to use it and hardcode it into actual application source code via an 
>> ABI is
>> pretty much the _WORST_ way to go about it IMHO...
>>
>> So how about this: don't add any ABI details, but make futexes auto-attached 
>> on
>> NUMA systems (and obviously PREEMPT_RT systems)?
> 
> I agree.
> 
> Do *not* make this a visible new ABI.

Agreed.

We had similar requests in glibc to add APIs to tweak the parameters of
the elision for locks backed by hardware transactional memory.

The person submitting the patches always thinks this is a great API
because it allows them to write tests to verify their own work (which
means it still might be useful for internal testing or developing
auto-tuning).

Users have no clue what to do with the API and worse the state space of
the parameters is immense. You can't possibly do any kind of sensible
optimization without knowing a lot about the hardware.

So no public API was ever added in glibc for pthread_mutex_lock elision
parameters. Either the parameters work by default or you have to post
patches to change the auto-tuning used internally in glibc.

> So automatically using a local hashtable (for private mutexes - I
> think people need to just accept that a shared mutex is more costly)
> according to some heuristic is definitely the way to go. And yes, the
> heuristic may be well be - at least to start - "this is a preempt-RT
> system" (for people who clearly care about having predictable
> latencies) or "this is actually a multi-node NUMA system, and I have
> heaps of memory".

Agreed.

> Then, add a tunable (for root, not per-futex) to allow people to tweak it.

Agreed.

-- 
Cheers,
Carlos.

Re: [PATCH v2 1/2] posix-timers: Prevents overrun counter overflow

2015-01-24 Thread Carlos O'Donell

On 01/24/2015 12:48 PM, Daniel Church wrote:
> +#define DELAYTIMER_MAX_DEFAULT 100

Why did you chose 1 million?

Have you reviewed what constant userspace, particularly glibc, is using?

Cheers,
Carlos.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/2] posix-timers: Exposes DELAYTIMER_MAX constant used to govern overruns

2015-01-24 Thread Carlos O'Donell

On 01/24/2015 02:20 PM, Thomas Gleixner wrote:
> On Sat, 24 Jan 2015, Daniel Church wrote:
> 
>> POSIX.1-2001 specification of timer_getoverrun() supports constant
>> DELAYTIMER_MAX which prevents overflow and caps overrun count.  Exposes
>> delaytimer_max value to userland via /proc/sys/kernel/delaytimer_max such
> 
> I know that you try to match the posix name, but when looking at the
> prctl w/o knowing about DELAYTIMER_MAX, it's non intuitive.
> 
> Something like posixtimer_max_overruns or if you insist on the
> delaytimer_max part, then let us prepend it with posixtimer_ at least.
> 
> New prctls require an update of Documentation/sysctl/kernel.txt.

Agreed, the name should be self-explanatory and point at POSIX.

The documentation should be detailed and explain exactly what this
is for, what are valid values, and why you would ever want to change
them.

Cheers,
Carlos.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/2] posix-timers: Prevents overrun counter overflow

2015-01-24 Thread Carlos O'Donell

On 01/24/2015 12:48 PM, Daniel Church wrote:
 +#define DELAYTIMER_MAX_DEFAULT 100

Why did you chose 1 million?

Have you reviewed what constant userspace, particularly glibc, is using?

Cheers,
Carlos.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/2] posix-timers: Exposes DELAYTIMER_MAX constant used to govern overruns

2015-01-24 Thread Carlos O'Donell

On 01/24/2015 02:20 PM, Thomas Gleixner wrote:
 On Sat, 24 Jan 2015, Daniel Church wrote:
 
 POSIX.1-2001 specification of timer_getoverrun() supports constant
 DELAYTIMER_MAX which prevents overflow and caps overrun count.  Exposes
 delaytimer_max value to userland via /proc/sys/kernel/delaytimer_max such
 
 I know that you try to match the posix name, but when looking at the
 prctl w/o knowing about DELAYTIMER_MAX, it's non intuitive.
 
 Something like posixtimer_max_overruns or if you insist on the
 delaytimer_max part, then let us prepend it with posixtimer_ at least.
 
 New prctls require an update of Documentation/sysctl/kernel.txt.

Agreed, the name should be self-explanatory and point at POSIX.

The documentation should be detailed and explain exactly what this
is for, what are valid values, and why you would ever want to change
them.

Cheers,
Carlos.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] arch: uapi: asm: mman.h: Support MADV_FREE for madvise()

2014-12-01 Thread Carlos O'Donell

On Mon, Dec 1, 2014 at 4:35 PM, Geert Uytterhoeven  wrote:
> On Mon, Dec 1, 2014 at 9:52 PM, Max Filippov  wrote:
>> On Mon, Dec 1, 2014 at 11:43 PM, Chen Gang  wrote:
>>> At present, kernel supports madvise(MADV_FREE), so can benefit to all
>>> related architectures (can grep MADV_WILLNEED or MADV_REMOVE in "arch/"
>>> to know about all related architectures).
>>
>> A similar patch has been posted a while ago:
>>
>> http://www.spinics.net/lists/linux-mm/msg81538.html
>
> Would it be possible to use the same number everywhere?

Yes please. It's ridiculous that we still need patches like this.

I proposed unifying all this two years ago, but didn't follow up.

>From glibc's perspective it would be simpler if we started using the
same number everywhere.

http://www.spinics.net/lists/linux-api/msg02064.html

Cheers,
Carlos.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] arch: uapi: asm: mman.h: Support MADV_FREE for madvise()

2014-12-01 Thread Carlos O'Donell

On Mon, Dec 1, 2014 at 4:35 PM, Geert Uytterhoeven ge...@linux-m68k.org wrote:
 On Mon, Dec 1, 2014 at 9:52 PM, Max Filippov jcmvb...@gmail.com wrote:
 On Mon, Dec 1, 2014 at 11:43 PM, Chen Gang gang.chen.5...@gmail.com wrote:
 At present, kernel supports madvise(MADV_FREE), so can benefit to all
 related architectures (can grep MADV_WILLNEED or MADV_REMOVE in arch/
 to know about all related architectures).

 A similar patch has been posted a while ago:

 http://www.spinics.net/lists/linux-mm/msg81538.html

 Would it be possible to use the same number everywhere?

Yes please. It's ridiculous that we still need patches like this.

I proposed unifying all this two years ago, but didn't follow up.

From glibc's perspective it would be simpler if we started using the
same number everywhere.

http://www.spinics.net/lists/linux-api/msg02064.html

Cheers,
Carlos.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: futex(2) man page update help request

2014-08-04 Thread Carlos O'Donell

On 05/15/2014 04:19 PM, Michael Kerrisk (man-pages) wrote:
> On 05/15/2014 04:14 PM, Thomas Gleixner wrote:
>> On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote:
>>> And that universe would love to have your documentation of
>>> FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-),
>>
>> I give you almost the full treatment, but I leave REQUEUE_PI to Darren
>> and FUTEX_WAKE_OP to Jakub. :)
> 
> Thanks Thomas--that's fantastic! Hopefully, Darren and Jakub fill in those
> missing pieces...

Michael,

Do you need any help getting these additional futex error codes
into the linux kernel man pages project? Thomas provided the
missing bits and Darren commented... what else do we need?

I'm asking because I want to point other Red Hat engineers at
these pages to say: "these are the canonical error codes." 

We're trying to cleanup the userspace side of things.

Cheers,
Carlos.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: futex(2) man page update help request

2014-08-04 Thread Carlos O'Donell

On 05/15/2014 04:19 PM, Michael Kerrisk (man-pages) wrote:
 On 05/15/2014 04:14 PM, Thomas Gleixner wrote:
 On Thu, 15 May 2014, Michael Kerrisk (man-pages) wrote:
 And that universe would love to have your documentation of
 FUTEX_WAKE_BITSET and FUTEX_WAIT_BITSET ;-),

 I give you almost the full treatment, but I leave REQUEUE_PI to Darren
 and FUTEX_WAKE_OP to Jakub. :)
 
 Thanks Thomas--that's fantastic! Hopefully, Darren and Jakub fill in those
 missing pieces...

Michael,

Do you need any help getting these additional futex error codes
into the linux kernel man pages project? Thomas provided the
missing bits and Darren commented... what else do we need?

I'm asking because I want to point other Red Hat engineers at
these pages to say: these are the canonical error codes. 

We're trying to cleanup the userspace side of things.

Cheers,
Carlos.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH resend #7] fcntl-linux.h: add new definitions and manual updates for open file description locks

2014-07-29 Thread Carlos O'Donell

On 07/23/2014 02:21 PM, Jeff Layton wrote:
> From: Jeff Layton 

Thanks for resending. Sorry for the delay.

Your use of 3 different emails caused me to miss the recent
resends. That's my fault and tied to the way I'm tracking
everything from patchwork using the first email you used.

I am assuming that this work is under Red Hat's copyright
status since you submitted it originally from @redhat.com.

I've committed this to trunk with small tweaks and the following
NEWS entry:

* Support for file description locks is added to systems running the
  Linux kernel. The standard file locking interfaces are extended to
  operate on file descriptions, not file descriptors, via the use of
  F_OFD_GETLK, F_OFD_SETLK, and F_OFD_SETLKW. File description locks
  are associated with an open file instead of a process.

This will be in 2.20 when we cut the branch.
 
> Open file description locks have been merged into the Linux kernel for
> v3.15.  Add the appropriate command-value definitions and an update to
> the manual that describes their usage.
> 
> ChangeLog:
> 
> 2014-04-24  Jeff Layton  
> 
>   [BZ#16839]
>   * manual/llio.texi: add section about open file description locks
> 
>   * manual/examples/ofdlocks.c:
> example of open file description lock usage
> 
>   * sysdeps/unix/sysv/linux/bits/fcntl-linux.h:
> (F_OFD_GETLK, F_OFD_SETLK, F_OFD_SETLKW): New macros.

I simplified it a bit.

A space in the ChangeLog indicates a distinct commit, which this is not,
so I lump them together.

This is what went in:

2014-07-29  Jeff Layton  

[BZ #16839]
* manual/llio.texi: Add section about open file description locks.
* manual/examples/ofdlocks.c: Example of open file description
lock usage.
* sysdeps/unix/sysv/linux/bits/fcntl-linux.h: Define F_OFD_GETLK,
F_OFD_SETLK, and F_OFD_SETLKW.

As the committer I added #16839 to the fixed bug list following:

https://sourceware.org/glibc/wiki/Committer%20checklist

> ---
>  manual/examples/ofdlocks.c |  77 +
>  manual/llio.texi   | 241 
> -
>  sysdeps/unix/sysv/linux/bits/fcntl-linux.h |  17 ++
>  3 files changed, 332 insertions(+), 3 deletions(-)
>  create mode 100644 manual/examples/ofdlocks.c
> 
> diff --git a/manual/examples/ofdlocks.c b/manual/examples/ofdlocks.c
> new file mode 100644
> index ..85e193cdabe6
> --- /dev/null
> +++ b/manual/examples/ofdlocks.c
> @@ -0,0 +1,77 @@
> +/* Open File Description Locks Usage Example
> +   Copyright (C) 1991-2014 Free Software Foundation, Inc.
> +
> +   This program is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU General Public License
> +   as published by the Free Software Foundation; either version 2
> +   of the License, or (at your option) any later version.
> +
> +   This program is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +   GNU General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with this program; if not, see .
> +*/
> +
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define FILENAME "/tmp/foo"
> +#define NUM_THREADS  3
> +#define ITERATIONS   5
> +
> +void *
> +thread_start (void *arg)
> +{
> +  int i, fd, len;
> +  long tid = (long) arg;
> +  char buf[256];
> +  struct flock lck = {
> +.l_whence = SEEK_SET,
> +.l_start = 0,
> +.l_len = 1,
> +  };
> +
> +  fd = open ("/tmp/foo", O_RDWR | O_CREAT, 0666);
> +
> +  for (i = 0; i < ITERATIONS; i++)
> +{
> +  lck.l_type = F_WRLCK;
> +  fcntl (fd, F_OFD_SETLKW, );
> +
> +  len = sprintf (buf, "%d: tid=%ld fd=%d\n", i, tid, fd);
> +
> +  lseek (fd, 0, SEEK_END);
> +  write (fd, buf, len);
> +  fsync (fd);
> +
> +  lck.l_type = F_UNLCK;
> +  fcntl (fd, F_OFD_SETLK, );
> +
> +  /* sleep to ensure lock is yielded to another thread */
> +  usleep (1);
> +}
> +  pthread_exit (NULL);
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  long i;
> +  pthread_t threads[NUM_THREADS];
> +
> +  truncate (FILENAME, 0);
> +
> +  for (i = 0; i < NUM_THREADS; i++)
> +pthread_create ([i], NULL, thread_start, (void *) i);
> +
> +  pthread_exit (NULL);
> +  return 0;
> +}
> diff --git a/manual/llio.texi b/manual/llio.texi
> index 6f8adfc607d7..864060dc7140 100644
> --- a/manual/llio.texi
> +++ b/manual/llio.texi
> @@ -57,6 +57,10 @@ directly.)
>   flags associated with open files.
>  * File Locks::  Fcntl commands for implementing
>   file locking.
> +* Open File Description Locks:: Fcntl commands for

Re: [PATCH resend #7] fcntl-linux.h: add new definitions and manual updates for open file description locks

2014-07-29 Thread Carlos O'Donell

On 07/23/2014 02:21 PM, Jeff Layton wrote:
 From: Jeff Layton jlay...@poochiereds.net

Thanks for resending. Sorry for the delay.

Your use of 3 different emails caused me to miss the recent
resends. That's my fault and tied to the way I'm tracking
everything from patchwork using the first email you used.

I am assuming that this work is under Red Hat's copyright
status since you submitted it originally from @redhat.com.

I've committed this to trunk with small tweaks and the following
NEWS entry:

* Support for file description locks is added to systems running the
  Linux kernel. The standard file locking interfaces are extended to
  operate on file descriptions, not file descriptors, via the use of
  F_OFD_GETLK, F_OFD_SETLK, and F_OFD_SETLKW. File description locks
  are associated with an open file instead of a process.

This will be in 2.20 when we cut the branch.
 
 Open file description locks have been merged into the Linux kernel for
 v3.15.  Add the appropriate command-value definitions and an update to
 the manual that describes their usage.
 
 ChangeLog:
 
 2014-04-24  Jeff Layton  jlay...@poochiereds.net
 
   [BZ#16839]
   * manual/llio.texi: add section about open file description locks
 
   * manual/examples/ofdlocks.c:
 example of open file description lock usage
 
   * sysdeps/unix/sysv/linux/bits/fcntl-linux.h:
 (F_OFD_GETLK, F_OFD_SETLK, F_OFD_SETLKW): New macros.

I simplified it a bit.

A space in the ChangeLog indicates a distinct commit, which this is not,
so I lump them together.

This is what went in:

2014-07-29  Jeff Layton  jlay...@poochiereds.net

[BZ #16839]
* manual/llio.texi: Add section about open file description locks.
* manual/examples/ofdlocks.c: Example of open file description
lock usage.
* sysdeps/unix/sysv/linux/bits/fcntl-linux.h: Define F_OFD_GETLK,
F_OFD_SETLK, and F_OFD_SETLKW.

As the committer I added #16839 to the fixed bug list following:

https://sourceware.org/glibc/wiki/Committer%20checklist

 ---
  manual/examples/ofdlocks.c |  77 +
  manual/llio.texi   | 241 
 -
  sysdeps/unix/sysv/linux/bits/fcntl-linux.h |  17 ++
  3 files changed, 332 insertions(+), 3 deletions(-)
  create mode 100644 manual/examples/ofdlocks.c
 
 diff --git a/manual/examples/ofdlocks.c b/manual/examples/ofdlocks.c
 new file mode 100644
 index ..85e193cdabe6
 --- /dev/null
 +++ b/manual/examples/ofdlocks.c
 @@ -0,0 +1,77 @@
 +/* Open File Description Locks Usage Example
 +   Copyright (C) 1991-2014 Free Software Foundation, Inc.
 +
 +   This program is free software; you can redistribute it and/or
 +   modify it under the terms of the GNU General Public License
 +   as published by the Free Software Foundation; either version 2
 +   of the License, or (at your option) any later version.
 +
 +   This program is distributed in the hope that it will be useful,
 +   but WITHOUT ANY WARRANTY; without even the implied warranty of
 +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +   GNU General Public License for more details.
 +
 +   You should have received a copy of the GNU General Public License
 +   along with this program; if not, see http://www.gnu.org/licenses/.
 +*/
 +
 +#define _GNU_SOURCE
 +#include stdio.h
 +#include sys/types.h
 +#include sys/stat.h
 +#include unistd.h
 +#include fcntl.h
 +#include pthread.h
 +
 +#define FILENAME /tmp/foo
 +#define NUM_THREADS  3
 +#define ITERATIONS   5
 +
 +void *
 +thread_start (void *arg)
 +{
 +  int i, fd, len;
 +  long tid = (long) arg;
 +  char buf[256];
 +  struct flock lck = {
 +.l_whence = SEEK_SET,
 +.l_start = 0,
 +.l_len = 1,
 +  };
 +
 +  fd = open (/tmp/foo, O_RDWR | O_CREAT, 0666);
 +
 +  for (i = 0; i  ITERATIONS; i++)
 +{
 +  lck.l_type = F_WRLCK;
 +  fcntl (fd, F_OFD_SETLKW, lck);
 +
 +  len = sprintf (buf, %d: tid=%ld fd=%d\n, i, tid, fd);
 +
 +  lseek (fd, 0, SEEK_END);
 +  write (fd, buf, len);
 +  fsync (fd);
 +
 +  lck.l_type = F_UNLCK;
 +  fcntl (fd, F_OFD_SETLK, lck);
 +
 +  /* sleep to ensure lock is yielded to another thread */
 +  usleep (1);
 +}
 +  pthread_exit (NULL);
 +}
 +
 +int
 +main (int argc, char **argv)
 +{
 +  long i;
 +  pthread_t threads[NUM_THREADS];
 +
 +  truncate (FILENAME, 0);
 +
 +  for (i = 0; i  NUM_THREADS; i++)
 +pthread_create (threads[i], NULL, thread_start, (void *) i);
 +
 +  pthread_exit (NULL);
 +  return 0;
 +}
 diff --git a/manual/llio.texi b/manual/llio.texi
 index 6f8adfc607d7..864060dc7140 100644
 --- a/manual/llio.texi
 +++ b/manual/llio.texi
 @@ -57,6 +57,10 @@ directly.)
   flags associated with open files.
  * File Locks::  Fcntl commands for implementing
   file locking.
 +* Open File Description Locks:: Fcntl

Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity

2014-05-16 Thread Carlos O'Donell

On 05/15/2014 04:25 AM, Peter Zijlstra wrote:
> On Wed, May 14, 2014 at 04:59:58PM -0400, Carlos O'Donell wrote:
>> I will make my personal opinion clear:
>>
>> - Internal defects should raise immediate assertions.
>>
>> - Real problems like resource availability, deadlocks, and
>>   other recoverable errors should result in the API returning
>>   an appropriate error code that must not diverge from the POSIX
>>   definitions for those codes (when such a definition exists).
>>
>> I'm not a believer in "only the hot path matters", there are such
>> things as robustness and error detection, and they matter.
> 
> Awesome. In case of doubt though, I would prefer a return to an assert,
> just in case userspace actually does know wtf its doing ;-)

No. In that case the person who knows attaches a debugger to determine
why the internal state is inconsistent. That may require kernel or glibc
debugging and asserting as close to the point of corruption is the only
useful behaviour. I know it's painful, but the number of people who know
what they are doing is vanishingly small compared to the other set.

> Granted, that seems to be very rare, but still, its entirely annoying
> for those few people who do care to get dead programs.
> 
> Alternatively, we could have something like you have for the allocator
> (which is, afaik, also considered a hot path) these env variables like
> MALLOC_CHECK_ to influence this edge behaviour.

We are considering a runtime tunnables framework to unify all of these
kinds of tweaks into a stable API. Given that asserting or not asserting
does not impact the standards conformance we could make that a tunnable
with the default being to assert. The tunnables framework is still pie
in the sky because we need a low-overhead framework to check the global
tunnables. However, we need them, as I've mentioned before as an example
we have an ancient 40MB stack cache in glibc for thread stack reuse that
nobody remembers why it was tuned to that value. Magic.

Cheers,
Carlos.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity

2014-05-16 Thread Carlos O'Donell

On 05/15/2014 04:07 AM, Peter Zijlstra wrote:
> On Wed, May 14, 2014 at 05:17:35PM -0400, Carlos O'Donell wrote:
>>> No, its perfectly fine to have a lock sequence abort with -EDEADLK.
>>> Userspace should release its locks and re-attempt.
>>
>> I agree. If I can prove that it's actually a deadlock, and
>> that unlock/relock will work to fix it, then we can arrange for glibc
>> to return EDEADLK.
> 
> The only reason the kernel would return EDEADLK is because its walked
> the lock graph and determined its well, a deadlock.

Perfect. No further comments from me then.

Cheers,
Carlos.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity

2014-05-16 Thread Carlos O'Donell

On 05/14/2014 06:28 AM, Thomas Gleixner wrote:
> On Wed, 14 May 2014, Peter Zijlstra wrote:
>> On Wed, May 14, 2014 at 11:53:44AM +0200, Thomas Gleixner wrote:
 What error would we return?

 This particular case is a serious error for which we have no good error 
 code
 to return to userspace. It's an implementation defect, a bug, we should 
 probably
 assert instead of pausing.
>>>
>>> Errm.
>>>
>>> http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mutex_lock.html
>>>
>>>  The pthread_mutex_lock() function may fail if:
>>>
>>>   [EDEADLK]
>>> The current thread already owns the mutex. 
>>>
>>> That's a exactly the error code, which the kernel returns when it
>>> detects a deadlock. 
>>>
>>> And glibc returns EDEADLK at a lot of places already. So in that case
>>> it's not a serious error? Because it's detected by glibc. You can't be
>>> serious about that.
>>>
>>> So why is a kernel detected deadlock different? Because it detects not
>>> only AA, it detects ABBA and more. But it's still a dead lock. And
>>> while posix spec only talks about AA, it's the very same issue.
>>>
>>> So why not propagate this to the caller so he gets an alert right away
>>> instead of letting him attach a debugger, and scratch his head and
>>> lookup glibc source to find out why the hell glibc called pause.
>>
>>   
>> http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html
> 
> Yuck. I should not have used the first link Gurgle brought up.

For the record the correct link is for POSIX Issue 7 (Issue 8 under 
development).

http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutex_lock.html

The Issue 7 version has a nice table :}

Cheers,
Carlos.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity

2014-05-16 Thread Carlos O'Donell

On 05/14/2014 07:11 PM, Thomas Gleixner wrote:
> On Wed, 14 May 2014, Carlos O'Donell wrote:
>> On 05/14/2014 05:22 AM, Peter Zijlstra wrote:
>>>>> I believe the thinking goes that if we get to here, then the lock is in an
>>>>> inconsistent state (between kernel and userspace). I don't have an answer 
>>>>> for
>>>>> why pausing forever would be preferable to returning an error however...
>>>>
>>>> What error would we return?
>>>
>>> EDEADLK is a valid user return for pthread_mutex_lock() as per:
>>>
>>>   
>>> http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html
>>
>> How is that correct? It isn't a deadlock we've detected but inconsistent
>> state between glibc and the kernel. In this case glibc should assert.
>> Delaying indefinitely with pause() never seems correct (despite that being
>> what we do today).
> 
> If there is inconsistent state detected then the kernel will return
> -EPERM or -EINVAL. So lets put inconsistent state aside.

OK.
 
> In glibc you only can detect the simple AA dead lock, i.e lock owner
> tries to lock the lock it owns again. Trivial, right ?

Agreed.

> But glibc has no idea which lock chains are involved and might lead to
> a dead lock caused by nested locking, simplest and most popular being
> ABBA.

OK.

> The kernel can (if the implementation is fixed, patch is available
> already) very well detect ABBA and even more complex nested lock
> deadlocks. So it rightfully returns -EDEADLK and that is completely
> correct versus the spec and the call site can do something about it.

OK.

> And that's not different from the glibc detected AA deadlock at
> all. It's just detected by a different mechanism.

OK.

> On kernel side we currently provide this service only for the PI
> futexes because we have a kernel side state representation as long as
> the user space state is not corrupted.

OK.

> Back then when it was implemented the dead lock detection actually
> worked and was agreed on by both sides - kernel and glibc - to be
> usefull and essential to the whole endavour.

I agree that ignoring the situation of corrupted or inconsistent
state we should be returning EDEADLK to userspace.

We'll cleanup glibc.

Cheers,
Carlos.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity

2014-05-16 Thread Carlos O'Donell

On 05/14/2014 07:11 PM, Thomas Gleixner wrote:
 On Wed, 14 May 2014, Carlos O'Donell wrote:
 On 05/14/2014 05:22 AM, Peter Zijlstra wrote:
 I believe the thinking goes that if we get to here, then the lock is in an
 inconsistent state (between kernel and userspace). I don't have an answer 
 for
 why pausing forever would be preferable to returning an error however...

 What error would we return?

 EDEADLK is a valid user return for pthread_mutex_lock() as per:

   
 http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html

 How is that correct? It isn't a deadlock we've detected but inconsistent
 state between glibc and the kernel. In this case glibc should assert.
 Delaying indefinitely with pause() never seems correct (despite that being
 what we do today).
 
 If there is inconsistent state detected then the kernel will return
 -EPERM or -EINVAL. So lets put inconsistent state aside.

OK.
 
 In glibc you only can detect the simple AA dead lock, i.e lock owner
 tries to lock the lock it owns again. Trivial, right ?

Agreed.

 But glibc has no idea which lock chains are involved and might lead to
 a dead lock caused by nested locking, simplest and most popular being
 ABBA.

OK.

 The kernel can (if the implementation is fixed, patch is available
 already) very well detect ABBA and even more complex nested lock
 deadlocks. So it rightfully returns -EDEADLK and that is completely
 correct versus the spec and the call site can do something about it.

OK.

 And that's not different from the glibc detected AA deadlock at
 all. It's just detected by a different mechanism.

OK.

 On kernel side we currently provide this service only for the PI
 futexes because we have a kernel side state representation as long as
 the user space state is not corrupted.

OK.

 Back then when it was implemented the dead lock detection actually
 worked and was agreed on by both sides - kernel and glibc - to be
 usefull and essential to the whole endavour.

I agree that ignoring the situation of corrupted or inconsistent
state we should be returning EDEADLK to userspace.

We'll cleanup glibc.

Cheers,
Carlos.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity

2014-05-16 Thread Carlos O'Donell

On 05/14/2014 06:28 AM, Thomas Gleixner wrote:
 On Wed, 14 May 2014, Peter Zijlstra wrote:
 On Wed, May 14, 2014 at 11:53:44AM +0200, Thomas Gleixner wrote:
 What error would we return?

 This particular case is a serious error for which we have no good error 
 code
 to return to userspace. It's an implementation defect, a bug, we should 
 probably
 assert instead of pausing.

 Errm.

 http://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mutex_lock.html

  The pthread_mutex_lock() function may fail if:

   [EDEADLK]
 The current thread already owns the mutex. 

 That's a exactly the error code, which the kernel returns when it
 detects a deadlock. 

 And glibc returns EDEADLK at a lot of places already. So in that case
 it's not a serious error? Because it's detected by glibc. You can't be
 serious about that.

 So why is a kernel detected deadlock different? Because it detects not
 only AA, it detects ABBA and more. But it's still a dead lock. And
 while posix spec only talks about AA, it's the very same issue.

 So why not propagate this to the caller so he gets an alert right away
 instead of letting him attach a debugger, and scratch his head and
 lookup glibc source to find out why the hell glibc called pause.

   
 http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html
 
 Yuck. I should not have used the first link Gurgle brought up.

For the record the correct link is for POSIX Issue 7 (Issue 8 under 
development).

http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutex_lock.html

The Issue 7 version has a nice table :}

Cheers,
Carlos.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity

2014-05-16 Thread Carlos O'Donell

On 05/15/2014 04:07 AM, Peter Zijlstra wrote:
 On Wed, May 14, 2014 at 05:17:35PM -0400, Carlos O'Donell wrote:
 No, its perfectly fine to have a lock sequence abort with -EDEADLK.
 Userspace should release its locks and re-attempt.

 I agree. If I can prove that it's actually a deadlock, and
 that unlock/relock will work to fix it, then we can arrange for glibc
 to return EDEADLK.
 
 The only reason the kernel would return EDEADLK is because its walked
 the lock graph and determined its well, a deadlock.

Perfect. No further comments from me then.

Cheers,
Carlos.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/3] futex/rtmutex: Fix issues exposed by trinity

2014-05-16 Thread Carlos O'Donell

On 05/15/2014 04:25 AM, Peter Zijlstra wrote:
 On Wed, May 14, 2014 at 04:59:58PM -0400, Carlos O'Donell wrote:
 I will make my personal opinion clear:

 - Internal defects should raise immediate assertions.

 - Real problems like resource availability, deadlocks, and
   other recoverable errors should result in the API returning
   an appropriate error code that must not diverge from the POSIX
   definitions for those codes (when such a definition exists).

 I'm not a believer in only the hot path matters, there are such
 things as robustness and error detection, and they matter.
 
 Awesome. In case of doubt though, I would prefer a return to an assert,
 just in case userspace actually does know wtf its doing ;-)

No. In that case the person who knows attaches a debugger to determine
why the internal state is inconsistent. That may require kernel or glibc
debugging and asserting as close to the point of corruption is the only
useful behaviour. I know it's painful, but the number of people who know
what they are doing is vanishingly small compared to the other set.

 Granted, that seems to be very rare, but still, its entirely annoying
 for those few people who do care to get dead programs.
 
 Alternatively, we could have something like you have for the allocator
 (which is, afaik, also considered a hot path) these env variables like
 MALLOC_CHECK_ to influence this edge behaviour.

We are considering a runtime tunnables framework to unify all of these
kinds of tweaks into a stable API. Given that asserting or not asserting
does not impact the standards conformance we could make that a tunnable
with the default being to assert. The tunnables framework is still pie
in the sky because we need a low-overhead framework to check the global
tunnables. However, we need them, as I've mentioned before as an example
we have an ancient 40MB stack cache in glibc for thread stack reuse that
nobody remembers why it was tuned to that value. Magic.

Cheers,
Carlos.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: futex(2) man page update help request

2014-05-15 Thread Carlos O'Donell

On 05/14/2014 08:28 PM, H. Peter Anvin wrote:
> On 05/14/2014 01:56 PM, Davidlohr Bueso wrote:
>>>
 However, unless I'm sorely mistaken, the larger problem is that glibc
 removed the futex() call entirely, so these man pages don't describe
>>>
>>> I don't think futex() ever was in glibc--that's by design, and
>>> completely understandable: no user-space application would want to
>>> directly use futex(). 
>>
>> That's actually not quite true. There are plenty of software efforts out
>> there that use futex calls directly to implement userspace serialization
>> mechanisms as an alternative to the bulky sysv semaphores. I worked
>> closely with an in-memory DB project that makes heavy use of them. Not
>> everyone can simply rely on pthreads.
>>
> 
> More fundamentally, futex(2), like clone(2), are things that can be
> legitimately by user space without automatically breaking all of glibc.
>  There are some other things where that is *not* true, because glibc
> relies on being able to mediate all accesses to a kernel facility, but
> not here.

Careful there. There is *some* danger in using clone(2) because of the
coordination required to implement thread-local storage. I'm sure you're
aware of this, but I'd like the record to show that we're going to need
clear documentation of what's considered safe given the known
implementations.

Cheers,
Carlos.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: futex(2) man page update help request

2014-05-15 Thread Carlos O'Donell

On 05/15/2014 09:49 AM, Michael Kerrisk (man-pages) wrote:
> On Thu, May 15, 2014 at 3:22 PM, Peter Zijlstra  wrote:
>> On Thu, May 15, 2014 at 09:18:22AM -0400, Carlos O'Donell wrote:
>>> On 05/15/2014 04:14 AM, Peter Zijlstra wrote:
>>>> On Wed, May 14, 2014 at 04:23:38PM -0400, Carlos O'Donell wrote:
>>>>> There are other syscalls like gettid() that have a:
>>>>> NOTE: There is no glibc wrapper for this system call; see NOTES.
>>>>
>>>> Yes, can we finally fix that please? It gets tedious having to endlessly
>>>> copy/paste that thing around.
>>>
>>> What exactly would you like fixed?
>>
>> Not having gettid() in glibc.
> 
> Get in the line ;-).
> http://sourceware.org/bugzilla/show_bug.cgi?id=6399

I have no objections to this, but I absolutely object to this without
someone documenting and gathering consensus for consistent terminology
to be used between the kernel and glibc.

The relevant comment is here:
https://sourceware.org/bugzilla/show_bug.cgi?id=6399#c26

I'd like to see a glibc manual patch for the threads.texi file, which
can be completely linux-specific, to document gettid() and nomenclature.
It should talk about the nomenclature used to discuss these interfaces
and explain when it is or isn't valid to use a task id and with what 
functions.

For example does gettid *really* return a pid_t as considered by
userspace? It's not a full out process...

Cheers,
Carlos.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: futex(2) man page update help request

2014-05-15 Thread Carlos O'Donell

On 05/15/2014 04:14 AM, Peter Zijlstra wrote:
> On Wed, May 14, 2014 at 04:23:38PM -0400, Carlos O'Donell wrote:
>> There are other syscalls like gettid() that have a:
>> NOTE: There is no glibc wrapper for this system call; see NOTES.
> 
> Yes, can we finally fix that please? It gets tedious having to endlessly
> copy/paste that thing around.

What exactly would you like fixed?

Cheers,
Carlos.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: futex(2) man page update help request

2014-05-15 Thread Carlos O'Donell

On 05/15/2014 04:14 AM, Peter Zijlstra wrote:
 On Wed, May 14, 2014 at 04:23:38PM -0400, Carlos O'Donell wrote:
 There are other syscalls like gettid() that have a:
 NOTE: There is no glibc wrapper for this system call; see NOTES.
 
 Yes, can we finally fix that please? It gets tedious having to endlessly
 copy/paste that thing around.

What exactly would you like fixed?

Cheers,
Carlos.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: futex(2) man page update help request

2014-05-15 Thread Carlos O'Donell

On 05/15/2014 09:49 AM, Michael Kerrisk (man-pages) wrote:
 On Thu, May 15, 2014 at 3:22 PM, Peter Zijlstra pet...@infradead.org wrote:
 On Thu, May 15, 2014 at 09:18:22AM -0400, Carlos O'Donell wrote:
 On 05/15/2014 04:14 AM, Peter Zijlstra wrote:
 On Wed, May 14, 2014 at 04:23:38PM -0400, Carlos O'Donell wrote:
 There are other syscalls like gettid() that have a:
 NOTE: There is no glibc wrapper for this system call; see NOTES.

 Yes, can we finally fix that please? It gets tedious having to endlessly
 copy/paste that thing around.

 What exactly would you like fixed?

 Not having gettid() in glibc.
 
 Get in the line ;-).
 http://sourceware.org/bugzilla/show_bug.cgi?id=6399

I have no objections to this, but I absolutely object to this without
someone documenting and gathering consensus for consistent terminology
to be used between the kernel and glibc.

The relevant comment is here:
https://sourceware.org/bugzilla/show_bug.cgi?id=6399#c26

I'd like to see a glibc manual patch for the threads.texi file, which
can be completely linux-specific, to document gettid() and nomenclature.
It should talk about the nomenclature used to discuss these interfaces
and explain when it is or isn't valid to use a task id and with what 
functions.

For example does gettid *really* return a pid_t as considered by
userspace? It's not a full out process...

Cheers,
Carlos.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: futex(2) man page update help request

2014-05-15 Thread Carlos O'Donell

On 05/14/2014 08:28 PM, H. Peter Anvin wrote:
 On 05/14/2014 01:56 PM, Davidlohr Bueso wrote:

 However, unless I'm sorely mistaken, the larger problem is that glibc
 removed the futex() call entirely, so these man pages don't describe

 I don't think futex() ever was in glibc--that's by design, and
 completely understandable: no user-space application would want to
 directly use futex(). 

 That's actually not quite true. There are plenty of software efforts out
 there that use futex calls directly to implement userspace serialization
 mechanisms as an alternative to the bulky sysv semaphores. I worked
 closely with an in-memory DB project that makes heavy use of them. Not
 everyone can simply rely on pthreads.

 
 More fundamentally, futex(2), like clone(2), are things that can be
 legitimately by user space without automatically breaking all of glibc.
  There are some other things where that is *not* true, because glibc
 relies on being able to mediate all accesses to a kernel facility, but
 not here.

Careful there. There is *some* danger in using clone(2) because of the
coordination required to implement thread-local storage. I'm sure you're
aware of this, but I'd like the record to show that we're going to need
clear documentation of what's considered safe given the known
implementations.

Cheers,
Carlos.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 130 matches

Mail list logo