Re: [PATCH v2 1/6] aarch64: Fix missing BTI protection from dependencies [BZ #26926]

2020-12-11 Thread Szabolcs Nagy
The 12/10/2020 14:51, Adhemerval Zanella wrote:
> On 27/11/2020 10:19, Szabolcs Nagy via Libc-alpha wrote:
> > The _dl_open_check and _rtld_main_check hooks are not called on the
> > dependencies of a loaded module, so BTI protection was missed on
> > every module other than the main executable and directly dlopened
> > libraries.
> > 
> > The fix just iterates over dependencies to enable BTI.
> > 
> > Fixes bug 26926.
> 
> LGTM, modulus the argument name change.
> 
> I also think it would be better to add a testcase, for both DT_NEEDED
> and dlopen case.

thanks, i committed this with fixed argument name as attached.

i plan to do tests later once i understand the BTI semantics
(right now it's not clear if in the presence of some W^X
policy BTI may be silently ignored or not).
>From 72739c79f61989a76b7dd719f34fcfb7b8eadde9 Mon Sep 17 00:00:00 2001
From: Szabolcs Nagy 
Date: Fri, 20 Nov 2020 15:27:06 +
Subject: [PATCH] aarch64: Fix missing BTI protection from dependencies [BZ #26926]

The _dl_open_check and _rtld_main_check hooks are not called on the
dependencies of a loaded module, so BTI protection was missed on
every module other than the main executable and directly dlopened
libraries.

The fix just iterates over dependencies to enable BTI.

Fixes bug 26926.
---
 sysdeps/aarch64/dl-bti.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/sysdeps/aarch64/dl-bti.c b/sysdeps/aarch64/dl-bti.c
index 196e462520..56c097210a 100644
--- a/sysdeps/aarch64/dl-bti.c
+++ b/sysdeps/aarch64/dl-bti.c
@@ -51,11 +51,24 @@ enable_bti (struct link_map *map, const char *program)
   return 0;
 }
 
-/* Enable BTI for L if required.  */
+/* Enable BTI for L and its dependencies.  */
 
 void
 _dl_bti_check (struct link_map *l, const char *program)
 {
-  if (GLRO(dl_aarch64_cpu_features).bti && l->l_mach.bti)
+  if (!GLRO(dl_aarch64_cpu_features).bti)
+return;
+
+  if (l->l_mach.bti)
 enable_bti (l, program);
+
+  unsigned int i = l->l_searchlist.r_nlist;
+  while (i-- > 0)
+{
+  struct link_map *dep = l->l_initfini[i];
+  if (dep->l_init_called)
+	continue;
+  if (dep->l_mach.bti)
+	enable_bti (dep, program);
+}
 }
-- 
2.17.1



Re: [PATCH v2 3/6] elf: Fix failure handling in _dl_map_object_from_fd

2020-12-11 Thread Szabolcs Nagy
The 12/10/2020 15:25, Adhemerval Zanella wrote:
> On 27/11/2020 10:20, Szabolcs Nagy via Libc-alpha wrote:
> > There are many failure paths that call lose to do local cleanups
> > in _dl_map_object_from_fd, but it did not clean everything.
> > 
> > Handle l_phdr, l_libname and mapped segments in the common failure
> > handling code.
> > 
> > There are various bits that may not be cleaned properly on failure
> > (e.g. executable stack, tlsid, incomplete dl_map_segments).
> > ---
> >  elf/dl-load.c | 24 +++-
> >  1 file changed, 15 insertions(+), 9 deletions(-)
> > 
> > diff --git a/elf/dl-load.c b/elf/dl-load.c
> > index 21e55deb19..9c71b7562c 100644
> > --- a/elf/dl-load.c
> > +++ b/elf/dl-load.c
> > @@ -914,8 +914,15 @@ lose (int code, int fd, const char *name, char 
> > *realname, struct link_map *l,
> >/* The file might already be closed.  */
> >if (fd != -1)
> >  (void) __close_nocancel (fd);
> > +  if (l != NULL && l->l_map_start != 0)
> > +_dl_unmap_segments (l);
> >if (l != NULL && l->l_origin != (char *) -1l)
> >  free ((char *) l->l_origin);
> > +  if (l != NULL && !l->l_libname->dont_free)
> > +free (l->l_libname);
> > +  if (l != NULL && l->l_phdr_allocated)
> > +free ((void *) l->l_phdr);
> > +
> >free (l);
> >free (realname);
> >  
> > @@ -1256,7 +1263,11 @@ _dl_map_object_from_fd (const char *name, const char 
> > *origname, int fd,
> >  errstring = _dl_map_segments (l, fd, header, type, loadcmds, nloadcmds,
> >   maplength, has_holes, loader);
> >  if (__glibc_unlikely (errstring != NULL))
> > -  goto call_lose;
> > +  {
> > +   /* Mappings can be in an inconsistent state: avoid unmap.  */
> > +   l->l_map_start = l->l_map_end = 0;
> > +   goto call_lose;
> > +  }
> >  
> >  /* Process program headers again after load segments are mapped in
> > case processing requires accessing those segments.  Scan program
> 
> In this case I am failing to see who would be responsible to unmap 
> l_map_start int the type == ET_DYN where first mmap succeeds but
> with a later mmap failure in any load command.

failures are either cleaned up locally in this function
via lose or after a clean return via dlclose.

failures that are not cleaned up will leak resources.

_dl_map_segments failure is not cleaned up (the mappings
are in an unknown state). however after a successful
_dl_map_segments later failures can clean the mappings
and that's what i fixed here.

i did not try to fix transitive design bugs (such as
leaks in _dl_map_segments) that would require interface
change or local cleanups in those other functions.

> > @@ -1294,14 +1305,6 @@ _dl_map_object_from_fd (const char *name, const char 
> > *origname, int fd,
> >|| (__glibc_unlikely (l->l_flags_1 & DF_1_PIE)
> >   && __glibc_unlikely ((mode & __RTLD_OPENEXEC) == 0)))
> >  {
> > -  /* We are not supposed to load this object.  Free all resources.  */
> > -  _dl_unmap_segments (l);
> > -
> > -  if (!l->l_libname->dont_free)
> > -   free (l->l_libname);
> > -
> > -  if (l->l_phdr_allocated)
> > -   free ((void *) l->l_phdr);
> >  
> >if (l->l_flags_1 & DF_1_PIE)
> > errstring
> > @@ -1392,6 +1395,9 @@ cannot enable executable stack as shared object 
> > requires");
> >/* Signal that we closed the file.  */
> >fd = -1;
> >  
> > +  /* Failures before this point are handled locally via lose.
> > + No more failures are allowed in this function until return.  */
> > +
> >/* If this is ET_EXEC, we should have loaded it as lt_executable.  */
> >assert (type != ET_EXEC || l->l_type == lt_executable);
> >  
> > 
> 
> Ok.


Re: [PATCH v2 0/6] aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831]

2020-12-07 Thread Szabolcs Nagy
The 12/03/2020 17:30, Catalin Marinas wrote:
> On Fri, Nov 27, 2020 at 01:19:16PM +0000, Szabolcs Nagy wrote:
> > This is v2 of
> > https://sourceware.org/pipermail/libc-alpha/2020-November/119305.html
> > 
> > To enable BTI support, re-mmap executable segments instead of
> > mprotecting them in case mprotect is seccomp filtered.
> > 
> > I would like linux to change to map the main exe with PROT_BTI when
> > that is marked as BTI compatible. From the linux side i heard the
> > following concerns about this:
> > - it's an ABI change so requires some ABI bump. (this is fine with
> >   me, i think glibc does not care about backward compat as nothing
> >   can reasonably rely on the current behaviour, but if we have a
> >   new bit in auxv or similar then we can save one mprotect call.)
> 
> I'm not concerned about the ABI change but there are workarounds like a
> new auxv bit.
> 
> > - in case we discover compatibility issues with user binaries it's
> >   better if userspace can easily disable BTI (e.g. removing the
> >   mprotect based on some env var, but if kernel adds PROT_BTI and
> >   mprotect is filtered then we have no reliable way to remove that
> >   from executables. this problem already exists for static linked
> >   exes, although admittedly those are less of a compat concern.)
> 
> This is our main concern. For static binaries, the linker could detect,
> in theory, potential issues when linking and not set the corresponding
> ELF information.
> 
> At runtime, a dynamic linker could detect issues and avoid enabling BTI.
> In both cases, it's a (static or dynamic) linker decision that belongs
> in user-space.

note that the marking is tied to an elf module: if the static
linker can be trusted to produce correct marking then both the
static and dynamic linking cases work, otherwise neither works.
(the dynamic linker cannot detect bti issues, just apply user
supplied policy.)

1) if we consider bti part of the semantics of a marked module
then it should be always on if the system supports it and
ideally the loader of the module should deal with PROT_BTI.
(and if the marking is wrong then the binary is wrong.)

2) if we consider the marking to be a compatibility indicator
and let userspace policy to decide what to do with it then the
static exe and vdso cases should be handled by that policy too.
(this makes sense if we expect that there are reasons to turn
bti off for a process independently of markings. this requires
the static linking startup code to do the policy decision and
self-apply PROT_BTI early.)

the current code does not fit either case well, but i was
planning to do (1). and ideally PROT_BTI would be added
reliably, but a best effort only PROT_BTI works too, however
it limits our ability to report real mprotect failures.

> > - ideally PROT_BTI would be added via a new syscall that does not
> >   interfere with PROT_EXEC filtering. (this does not conflict with
> >   the current patches: even with a new syscall we need a fallback.)
> 
> This can be discussed as a long term solution.
> 
> > - solve it in systemd (e.g. turn off the filter, use better filter):
> >   i would prefer not to have aarch64 (or BTI) specific policy in
> >   user code. and there was no satisfying way to do this portably.
> 
> I agree. I think the best for now (as a back-portable glibc fix) is to
> ignore the mprotect(PROT_EXEC|PROT_BTI) error that the dynamic loader
> gets. BTI will be disabled if MDWX is enabled.

ok.

we got back to the original proposal: silently ignore mprotect
failures. i'm still considering the mmap solution for libraries
only: at least then libraries are handled reliably on current
setups, but i will have to think about whether attack targets
are mainly in libraries like libc or in executables.

> 
> In the meantime, we should start (continue) looking at a solution that
> works for both systemd and the kernel and be generic enough for other
> architectures. The stateless nature of the current SECCOMP approach is
> not suitable for this W^X policy. Kees had some suggestions here but the
> thread seems to have died:
> 
> https://lore.kernel.org/kernel-hardening/202010221256.A4F95FD11@keescook/

it sounded like better W^X enforcement won't happen any time soon.


[PATCH v3 2/2] aarch64: Use mmap to add PROT_BTI instead of mprotect [BZ #26831]

2020-12-02 Thread Szabolcs Nagy
Re-mmap executable segments if possible instead of using mprotect
to add PROT_BTI. This allows using BTI protection with security
policies that prevent mprotect with PROT_EXEC.

If the fd of the ELF module is not available because it was kernel
mapped then mprotect is used and failures are ignored.  To protect
the main executable even when mprotect is filtered the linux kernel
 will have to be changed to add PROT_BTI to it.

The delayed failure reporting is mainly needed because currently
_dl_process_gnu_properties does not propagate failures such that
the required cleanups happen. Using the link_map_machine struct for
error propagation is not ideal, but this seemed to be the least
intrusive solution.

Fixes bug 26831.
---
v3:
- split the last patch in two.
- pushed to nsz/btifix-v3 branch.

 sysdeps/aarch64/dl-bti.c  | 54 ++-
 sysdeps/aarch64/dl-prop.h |  8 +-
 sysdeps/aarch64/linkmap.h |  2 +-
 3 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/sysdeps/aarch64/dl-bti.c b/sysdeps/aarch64/dl-bti.c
index 67d63c8a73..ff26c98ccf 100644
--- a/sysdeps/aarch64/dl-bti.c
+++ b/sysdeps/aarch64/dl-bti.c
@@ -19,9 +19,17 @@
 #include 
 #include 
 #include 
+#include 
 
-static void
-enable_bti (struct link_map *map, const char *program)
+/* See elf/dl-load.h.  */
+#ifndef MAP_COPY
+# define MAP_COPY (MAP_PRIVATE | MAP_DENYWRITE)
+#endif
+
+/* Enable BTI protection for MAP.  */
+
+void
+_dl_bti_protect (struct link_map *map, int fd)
 {
   const size_t pagesz = GLRO(dl_pagesize);
   const ElfW(Phdr) *phdr;
@@ -41,19 +49,31 @@ enable_bti (struct link_map *map, const char *program)
if (phdr->p_flags & PF_W)
  prot |= PROT_WRITE;
 
-   if (__mprotect (start, len, prot) < 0)
- {
-   if (program)
- _dl_fatal_printf ("%s: mprotect failed to turn on BTI\n",
-   map->l_name);
-   else
- _dl_signal_error (errno, map->l_name, "dlopen",
-   N_("mprotect failed to turn on BTI"));
- }
+   if (fd == -1)
+ /* Ignore failures for kernel mapped binaries.  */
+ __mprotect (start, len, prot);
+   else
+ map->l_mach.bti_fail = __mmap (start, len, prot,
+MAP_FIXED|MAP_COPY|MAP_FILE,
+fd, off) == MAP_FAILED;
   }
 }
 
-/* Enable BTI for MAP and its dependencies.  */
+
+static void
+bti_failed (struct link_map *l, const char *program)
+{
+  if (program)
+_dl_fatal_printf ("%s: %s: failed to turn on BTI protection\n",
+ program, l->l_name);
+  else
+/* Note: the errno value is not available any more.  */
+_dl_signal_error (0, l->l_name, "dlopen",
+ N_("failed to turn on BTI protection"));
+}
+
+
+/* Report BTI protection failures for MAP and its dependencies.  */
 
 void
 _dl_bti_check (struct link_map *map, const char *program)
@@ -61,16 +81,14 @@ _dl_bti_check (struct link_map *map, const char *program)
   if (!GLRO(dl_aarch64_cpu_features).bti)
 return;
 
-  if (map->l_mach.bti)
-enable_bti (map, program);
+  if (map->l_mach.bti_fail)
+bti_failed (map, program);
 
   unsigned int i = map->l_searchlist.r_nlist;
   while (i-- > 0)
 {
   struct link_map *l = map->l_initfini[i];
-  if (l->l_init_called)
-   continue;
-  if (l->l_mach.bti)
-   enable_bti (l, program);
+  if (l->l_mach.bti_fail)
+   bti_failed (l, program);
 }
 }
diff --git a/sysdeps/aarch64/dl-prop.h b/sysdeps/aarch64/dl-prop.h
index 2016d1472e..e926e54984 100644
--- a/sysdeps/aarch64/dl-prop.h
+++ b/sysdeps/aarch64/dl-prop.h
@@ -19,6 +19,8 @@
 #ifndef _DL_PROP_H
 #define _DL_PROP_H
 
+extern void _dl_bti_protect (struct link_map *, int) attribute_hidden;
+
 extern void _dl_bti_check (struct link_map *, const char *)
 attribute_hidden;
 
@@ -43,6 +45,10 @@ static inline int
 _dl_process_gnu_property (struct link_map *l, int fd, uint32_t type,
  uint32_t datasz, void *data)
 {
+  if (!GLRO(dl_aarch64_cpu_features).bti)
+/* Skip note processing.  */
+return 0;
+
   if (type == GNU_PROPERTY_AARCH64_FEATURE_1_AND)
 {
   /* Stop if the property note is ill-formed.  */
@@ -51,7 +57,7 @@ _dl_process_gnu_property (struct link_map *l, int fd, 
uint32_t type,
 
   unsigned int feature_1 = *(unsigned int *) data;
   if (feature_1 & GNU_PROPERTY_AARCH64_FEATURE_1_BTI)
-   l->l_mach.bti = true;
+   _dl_bti_protect (l, fd);
 
   /* Stop if we processed the property note.  */
   return 0;
diff --git a/sysdeps/aarch64/linkmap.h b/sysdeps/aarch64/linkmap.h
index 847a03ace2..b3f7663b07 100644
--- a/sysdeps/aarch64/linkmap.h
+++ b/sysdeps/aarch64/linkmap.h
@@ -22,5 +22,5 @@ struct link_map_machine
 {
   ElfW(Addr) plt;/* Address of .plt */
   void *tlsdesc_table;   /* Address of TLS descriptor hash table.  */
-  bool bti;   

[PATCH v3 1/2] aarch64: align address for BTI protection [BZ #26988]

2020-12-02 Thread Szabolcs Nagy
Handle unaligned executable load segments (the bfd linker is not
expected to produce such binaries, but other linkers may).

Computing the mapping bounds follows _dl_map_object_from_fd more
closely now.

Fixes bug 26988.
---
v3:
- split the last patch in two so this bug is fixed separately.
- pushed to nsz/btifix-v3 branch.

 sysdeps/aarch64/dl-bti.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/sysdeps/aarch64/dl-bti.c b/sysdeps/aarch64/dl-bti.c
index 8f4728adce..67d63c8a73 100644
--- a/sysdeps/aarch64/dl-bti.c
+++ b/sysdeps/aarch64/dl-bti.c
@@ -20,19 +20,22 @@
 #include 
 #include 
 
-static int
+static void
 enable_bti (struct link_map *map, const char *program)
 {
+  const size_t pagesz = GLRO(dl_pagesize);
   const ElfW(Phdr) *phdr;
-  unsigned prot;
 
   for (phdr = map->l_phdr; phdr < >l_phdr[map->l_phnum]; ++phdr)
 if (phdr->p_type == PT_LOAD && (phdr->p_flags & PF_X))
   {
-   void *start = (void *) (phdr->p_vaddr + map->l_addr);
-   size_t len = phdr->p_memsz;
+   size_t vstart = ALIGN_DOWN (phdr->p_vaddr, pagesz);
+   size_t vend = ALIGN_UP (phdr->p_vaddr + phdr->p_filesz, pagesz);
+   off_t off = ALIGN_DOWN (phdr->p_offset, pagesz);
+   void *start = (void *) (vstart + map->l_addr);
+   size_t len = vend - vstart;
 
-   prot = PROT_EXEC | PROT_BTI;
+   unsigned prot = PROT_EXEC | PROT_BTI;
if (phdr->p_flags & PF_R)
  prot |= PROT_READ;
if (phdr->p_flags & PF_W)
@@ -48,7 +51,6 @@ enable_bti (struct link_map *map, const char *program)
N_("mprotect failed to turn on BTI"));
  }
   }
-  return 0;
 }
 
 /* Enable BTI for MAP and its dependencies.  */
-- 
2.17.1



Re: [PATCH v2 0/6] aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831]

2020-11-30 Thread Szabolcs Nagy
The 11/27/2020 13:19, Szabolcs Nagy via Libc-alpha wrote:
> This is v2 of
> https://sourceware.org/pipermail/libc-alpha/2020-November/119305.html
> 
> To enable BTI support, re-mmap executable segments instead of
> mprotecting them in case mprotect is seccomp filtered.
> 
> I would like linux to change to map the main exe with PROT_BTI when
> that is marked as BTI compatible. From the linux side i heard the
> following concerns about this:
> - it's an ABI change so requires some ABI bump. (this is fine with
>   me, i think glibc does not care about backward compat as nothing
>   can reasonably rely on the current behaviour, but if we have a
>   new bit in auxv or similar then we can save one mprotect call.)
> - in case we discover compatibility issues with user binaries it's
>   better if userspace can easily disable BTI (e.g. removing the
>   mprotect based on some env var, but if kernel adds PROT_BTI and
>   mprotect is filtered then we have no reliable way to remove that
>   from executables. this problem already exists for static linked
>   exes, although admittedly those are less of a compat concern.)
> - ideally PROT_BTI would be added via a new syscall that does not
>   interfere with PROT_EXEC filtering. (this does not conflict with
>   the current patches: even with a new syscall we need a fallback.)
> - solve it in systemd (e.g. turn off the filter, use better filter):
>   i would prefer not to have aarch64 (or BTI) specific policy in
>   user code. and there was no satisfying way to do this portably.
> 
> Other concerns about the approach:
> - mmap is more expensive than mprotect: in my measurements using
>   mmap instead of mprotect is 3-8x slower (and after mmap pages
>   have to be faulted in again), but e.g. the exec time of a program
>   with 4 deps only increases by < 8% due to the 4 new mmaps. (the
>   kernel side resource usage may increase too, i didnt look at that.)

i tested glibc build time with mprotect vs mmap
which should be exec heavy.

the real time overhead was < 0.2% on a particular
4 core system with linux 5.3 ubuntu kernel, which
i consider to be small.

(used PROT_EXEC without PROT_BTI for the measurement).


> - _dl_signal_error is not valid from the _dl_process_gnu_property
>   hook. The v2 set addresses this problem: i could either propagate
>   the errors up until they can be handled or solve it in the aarch64
>   backend by first recording failures and then dealing with them in
>   _dl_open_check. I choose the latter, but did some refactorings in
>   _dl_map_object_from_fd that makes the former possible too.
> 
> v2:
> - [1/6]: New patch that fixes a missed BTI bug found during v2.
> - [2-3/6]: New, _dl_map_object_from_fd failure handling improvements,
>   these are independent of the rest of the series.
> - [4/6]: Move the note handling to a different place (after l_phdr
>   setup, but before fd is closed).
> - [5/6]: Rebased.
> - [6/6]: First record errors and only report them later. (this fixes
>   various failure handling issues.)
> 
> Szabolcs Nagy (6):
>   aarch64: Fix missing BTI protection from dependencies [BZ #26926]
>   elf: lose is closely tied to _dl_map_object_from_fd
>   elf: Fix failure handling in _dl_map_object_from_fd
>   elf: Move note processing after l_phdr is updated
>   elf: Pass the fd to note processing
>   aarch64: Use mmap to add PROT_BTI instead of mprotect [BZ #26831]
> 
>  elf/dl-load.c  | 110 -
>  elf/rtld.c |   4 +-
>  sysdeps/aarch64/dl-bti.c   |  74 ++---
>  sysdeps/aarch64/dl-prop.h  |  14 +++--
>  sysdeps/aarch64/linkmap.h  |   2 +-
>  sysdeps/generic/dl-prop.h  |   6 +-
>  sysdeps/generic/ldsodefs.h |   5 +-
>  sysdeps/x86/dl-prop.h  |   6 +-
>  8 files changed, 135 insertions(+), 86 deletions(-)
> 
> -- 
> 2.17.1
> 


[PATCH v2 6/6] aarch64: Use mmap to add PROT_BTI instead of mprotect [BZ #26831]

2020-11-27 Thread Szabolcs Nagy
Re-mmap executable segments if possible instead of using mprotect
to add PROT_BTI. This allows using BTI protection with security
policies that prevent mprotect with PROT_EXEC.

If the fd of the ELF module is not available because it was kernel
mapped then mprotect is used and failures are ignored.  To protect
the main executable even when mprotect is filtered the linux kernel
will have to be changed to add PROT_BTI to it.

Computing the mapping bounds follows _dl_map_object_from_fd more
closely now.

The delayed failure reporting is mainly needed because currently
_dl_process_gnu_properties does not propagate failures such that
the required cleanups happen. Using the link_map_machine struct for
error propagation is not ideal, but this seemed to be the least
intrusive solution.

Fixes bug 26831.
---
 sysdeps/aarch64/dl-bti.c  | 67 +--
 sysdeps/aarch64/dl-prop.h |  8 -
 sysdeps/aarch64/linkmap.h |  2 +-
 3 files changed, 52 insertions(+), 25 deletions(-)

diff --git a/sysdeps/aarch64/dl-bti.c b/sysdeps/aarch64/dl-bti.c
index 8f4728adce..34b5294f92 100644
--- a/sysdeps/aarch64/dl-bti.c
+++ b/sysdeps/aarch64/dl-bti.c
@@ -19,39 +19,62 @@
 #include 
 #include 
 #include 
+#include 
 
-static int
-enable_bti (struct link_map *map, const char *program)
+/* See elf/dl-load.h.  */
+#ifndef MAP_COPY
+# define MAP_COPY (MAP_PRIVATE | MAP_DENYWRITE)
+#endif
+
+/* Enable BTI protection for MAP.  */
+
+void
+_dl_bti_protect (struct link_map *map, int fd)
 {
+  const size_t pagesz = GLRO(dl_pagesize);
   const ElfW(Phdr) *phdr;
-  unsigned prot;
 
   for (phdr = map->l_phdr; phdr < >l_phdr[map->l_phnum]; ++phdr)
 if (phdr->p_type == PT_LOAD && (phdr->p_flags & PF_X))
   {
-   void *start = (void *) (phdr->p_vaddr + map->l_addr);
-   size_t len = phdr->p_memsz;
+   size_t vstart = ALIGN_DOWN (phdr->p_vaddr, pagesz);
+   size_t vend = ALIGN_UP (phdr->p_vaddr + phdr->p_filesz, pagesz);
+   off_t off = ALIGN_DOWN (phdr->p_offset, pagesz);
+   void *start = (void *) (vstart + map->l_addr);
+   size_t len = vend - vstart;
 
-   prot = PROT_EXEC | PROT_BTI;
+   /* Add PROT_BTI.  */
+   unsigned prot = PROT_EXEC | PROT_BTI;
if (phdr->p_flags & PF_R)
  prot |= PROT_READ;
if (phdr->p_flags & PF_W)
  prot |= PROT_WRITE;
 
-   if (__mprotect (start, len, prot) < 0)
- {
-   if (program)
- _dl_fatal_printf ("%s: mprotect failed to turn on BTI\n",
-   map->l_name);
-   else
- _dl_signal_error (errno, map->l_name, "dlopen",
-   N_("mprotect failed to turn on BTI"));
- }
+   if (fd == -1)
+ /* Ignore failures for kernel mapped binaries.  */
+ __mprotect (start, len, prot);
+   else
+ map->l_mach.bti_fail = __mmap (start, len, prot,
+MAP_FIXED|MAP_COPY|MAP_FILE,
+fd, off) == MAP_FAILED;
   }
-  return 0;
 }
 
-/* Enable BTI for MAP and its dependencies.  */
+
+static void
+bti_failed (struct link_map *l, const char *program)
+{
+  if (program)
+_dl_fatal_printf ("%s: %s: failed to turn on BTI protection\n",
+ program, l->l_name);
+  else
+/* Note: the errno value is not available any more.  */
+_dl_signal_error (0, l->l_name, "dlopen",
+ N_("failed to turn on BTI protection"));
+}
+
+
+/* Report BTI protection failures for MAP and its dependencies.  */
 
 void
 _dl_bti_check (struct link_map *map, const char *program)
@@ -59,16 +82,14 @@ _dl_bti_check (struct link_map *map, const char *program)
   if (!GLRO(dl_aarch64_cpu_features).bti)
 return;
 
-  if (map->l_mach.bti)
-enable_bti (map, program);
+  if (map->l_mach.bti_fail)
+bti_failed (map, program);
 
   unsigned int i = map->l_searchlist.r_nlist;
   while (i-- > 0)
 {
   struct link_map *l = map->l_initfini[i];
-  if (l->l_init_called)
-   continue;
-  if (l->l_mach.bti)
-   enable_bti (l, program);
+  if (l->l_mach.bti_fail)
+   bti_failed (l, program);
 }
 }
diff --git a/sysdeps/aarch64/dl-prop.h b/sysdeps/aarch64/dl-prop.h
index 2016d1472e..e926e54984 100644
--- a/sysdeps/aarch64/dl-prop.h
+++ b/sysdeps/aarch64/dl-prop.h
@@ -19,6 +19,8 @@
 #ifndef _DL_PROP_H
 #define _DL_PROP_H
 
+extern void _dl_bti_protect (struct link_map *, int) attribute_hidden;
+
 extern void _dl_bti_check (struct link_map *, const char *)
 attribute_hidden;
 
@@ -43,6 +45,10 @@ static inline int
 _dl_process_gnu_property (struct link_map *l, int fd, uint32_t type,
  uint32_t datasz, void *data)
 {
+  if (!GLRO(dl_aarch64_cpu_features).bti)
+/* Skip note processing.  */
+return 0;
+
   if (type == GNU_PROPERTY_AARCH64_FEATURE_1_AND)
 {
   /* Stop if the property note is ill-formed.  */
@@ -51,7 +57,7 @@ 

[PATCH v2 5/6] elf: Pass the fd to note processing

2020-11-27 Thread Szabolcs Nagy
To handle GNU property notes on aarch64 some segments need to
be mmaped again, so the fd of the loaded ELF module is needed.

When the fd is not available (kernel loaded modules), then -1
is passed.

The fd is passed to both _dl_process_pt_gnu_property and
_dl_process_pt_note for consistency. Target specific note
processing functions are updated accordingly.
---
 elf/dl-load.c  | 12 +++-
 elf/rtld.c |  4 ++--
 sysdeps/aarch64/dl-prop.h  |  6 +++---
 sysdeps/generic/dl-prop.h  |  6 +++---
 sysdeps/generic/ldsodefs.h |  5 +++--
 sysdeps/x86/dl-prop.h  |  6 +++---
 6 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/elf/dl-load.c b/elf/dl-load.c
index b0d65f32cc..74039f22a6 100644
--- a/elf/dl-load.c
+++ b/elf/dl-load.c
@@ -837,10 +837,12 @@ _dl_init_paths (const char *llp, const char *source)
 
 /* Process PT_GNU_PROPERTY program header PH in module L after
PT_LOAD segments are mapped.  Only one NT_GNU_PROPERTY_TYPE_0
-   note is handled which contains processor specific properties.  */
+   note is handled which contains processor specific properties.
+   FD is -1 for the kernel mapped main executable otherwise it is
+   the fd used for loading module L.  */
 
 void
-_dl_process_pt_gnu_property (struct link_map *l, const ElfW(Phdr) *ph)
+_dl_process_pt_gnu_property (struct link_map *l, int fd, const ElfW(Phdr) *ph)
 {
   const ElfW(Nhdr) *note = (const void *) (ph->p_vaddr + l->l_addr);
   const ElfW(Addr) size = ph->p_memsz;
@@ -887,7 +889,7 @@ _dl_process_pt_gnu_property (struct link_map *l, const 
ElfW(Phdr) *ph)
  last_type = type;
 
  /* Target specific property processing.  */
- if (_dl_process_gnu_property (l, type, datasz, ptr) == 0)
+ if (_dl_process_gnu_property (l, fd, type, datasz, ptr) == 0)
return;
 
  /* Check the next property item.  */
@@ -1379,10 +1381,10 @@ cannot enable executable stack as shared object 
requires");
 switch (ph[-1].p_type)
   {
   case PT_NOTE:
-   _dl_process_pt_note (l, [-1]);
+   _dl_process_pt_note (l, fd, [-1]);
break;
   case PT_GNU_PROPERTY:
-   _dl_process_pt_gnu_property (l, [-1]);
+   _dl_process_pt_gnu_property (l, fd, [-1]);
break;
   }
 
diff --git a/elf/rtld.c b/elf/rtld.c
index c4ffc8d4b7..ec62567580 100644
--- a/elf/rtld.c
+++ b/elf/rtld.c
@@ -1540,10 +1540,10 @@ dl_main (const ElfW(Phdr) *phdr,
 switch (ph[-1].p_type)
   {
   case PT_NOTE:
-   _dl_process_pt_note (main_map, [-1]);
+   _dl_process_pt_note (main_map, -1, [-1]);
break;
   case PT_GNU_PROPERTY:
-   _dl_process_pt_gnu_property (main_map, [-1]);
+   _dl_process_pt_gnu_property (main_map, -1, [-1]);
break;
   }
 
diff --git a/sysdeps/aarch64/dl-prop.h b/sysdeps/aarch64/dl-prop.h
index b0785bda83..2016d1472e 100644
--- a/sysdeps/aarch64/dl-prop.h
+++ b/sysdeps/aarch64/dl-prop.h
@@ -35,13 +35,13 @@ _dl_open_check (struct link_map *m)
 }
 
 static inline void __attribute__ ((always_inline))
-_dl_process_pt_note (struct link_map *l, const ElfW(Phdr) *ph)
+_dl_process_pt_note (struct link_map *l, int fd, const ElfW(Phdr) *ph)
 {
 }
 
 static inline int
-_dl_process_gnu_property (struct link_map *l, uint32_t type, uint32_t datasz,
- void *data)
+_dl_process_gnu_property (struct link_map *l, int fd, uint32_t type,
+ uint32_t datasz, void *data)
 {
   if (type == GNU_PROPERTY_AARCH64_FEATURE_1_AND)
 {
diff --git a/sysdeps/generic/dl-prop.h b/sysdeps/generic/dl-prop.h
index f1cf576fe3..df27ff8e6a 100644
--- a/sysdeps/generic/dl-prop.h
+++ b/sysdeps/generic/dl-prop.h
@@ -37,15 +37,15 @@ _dl_open_check (struct link_map *m)
 }
 
 static inline void __attribute__ ((always_inline))
-_dl_process_pt_note (struct link_map *l, const ElfW(Phdr) *ph)
+_dl_process_pt_note (struct link_map *l, int fd, const ElfW(Phdr) *ph)
 {
 }
 
 /* Called for each property in the NT_GNU_PROPERTY_TYPE_0 note of L,
processing of the properties continues until this returns 0.  */
 static inline int __attribute__ ((always_inline))
-_dl_process_gnu_property (struct link_map *l, uint32_t type, uint32_t datasz,
- void *data)
+_dl_process_gnu_property (struct link_map *l, int fd, uint32_t type,
+ uint32_t datasz, void *data)
 {
   return 0;
 }
diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
index b1da03cafe..89eab4719d 100644
--- a/sysdeps/generic/ldsodefs.h
+++ b/sysdeps/generic/ldsodefs.h
@@ -933,8 +933,9 @@ extern void _dl_rtld_di_serinfo (struct link_map *loader,
 Dl_serinfo *si, bool counting);
 
 /* Process PT_GNU_PROPERTY program header PH in module L after
-   PT_LOAD segments are mapped.  */
-void _dl_process_pt_gnu_property (struct link_map *l, const ElfW(Phdr) *ph);
+   PT_LOAD segments are mapped from file FD.  */
+void 

[PATCH v2 4/6] elf: Move note processing after l_phdr is updated

2020-11-27 Thread Szabolcs Nagy
Program headers are processed in two pass: after the first pass
load segments are mmapped so in the second pass target specific
note processing logic can access the notes.

The second pass is moved later so various link_map fields are
set up that may be useful for note processing such as l_phdr.
The second pass should be before the fd is closed so that is
available.
---
 elf/dl-load.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/elf/dl-load.c b/elf/dl-load.c
index 9c71b7562c..b0d65f32cc 100644
--- a/elf/dl-load.c
+++ b/elf/dl-load.c
@@ -1268,21 +1268,6 @@ _dl_map_object_from_fd (const char *name, const char 
*origname, int fd,
l->l_map_start = l->l_map_end = 0;
goto call_lose;
   }
-
-/* Process program headers again after load segments are mapped in
-   case processing requires accessing those segments.  Scan program
-   headers backward so that PT_NOTE can be skipped if PT_GNU_PROPERTY
-   exits.  */
-for (ph = [l->l_phnum]; ph != phdr; --ph)
-  switch (ph[-1].p_type)
-   {
-   case PT_NOTE:
- _dl_process_pt_note (l, [-1]);
- break;
-   case PT_GNU_PROPERTY:
- _dl_process_pt_gnu_property (l, [-1]);
- break;
-   }
   }
 
   if (l->l_ld == 0)
@@ -1386,6 +1371,21 @@ cannot enable executable stack as shared object 
requires");
   if (l->l_tls_initimage != NULL)
 l->l_tls_initimage = (char *) l->l_tls_initimage + l->l_addr;
 
+  /* Process program headers again after load segments are mapped in
+ case processing requires accessing those segments.  Scan program
+ headers backward so that PT_NOTE can be skipped if PT_GNU_PROPERTY
+ exits.  */
+  for (ph = >l_phdr[l->l_phnum]; ph != l->l_phdr; --ph)
+switch (ph[-1].p_type)
+  {
+  case PT_NOTE:
+   _dl_process_pt_note (l, [-1]);
+   break;
+  case PT_GNU_PROPERTY:
+   _dl_process_pt_gnu_property (l, [-1]);
+   break;
+  }
+
   /* We are done mapping in the file.  We no longer need the descriptor.  */
   if (__glibc_unlikely (__close_nocancel (fd) != 0))
 {
-- 
2.17.1



[PATCH v2 3/6] elf: Fix failure handling in _dl_map_object_from_fd

2020-11-27 Thread Szabolcs Nagy
There are many failure paths that call lose to do local cleanups
in _dl_map_object_from_fd, but it did not clean everything.

Handle l_phdr, l_libname and mapped segments in the common failure
handling code.

There are various bits that may not be cleaned properly on failure
(e.g. executable stack, tlsid, incomplete dl_map_segments).
---
 elf/dl-load.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/elf/dl-load.c b/elf/dl-load.c
index 21e55deb19..9c71b7562c 100644
--- a/elf/dl-load.c
+++ b/elf/dl-load.c
@@ -914,8 +914,15 @@ lose (int code, int fd, const char *name, char *realname, 
struct link_map *l,
   /* The file might already be closed.  */
   if (fd != -1)
 (void) __close_nocancel (fd);
+  if (l != NULL && l->l_map_start != 0)
+_dl_unmap_segments (l);
   if (l != NULL && l->l_origin != (char *) -1l)
 free ((char *) l->l_origin);
+  if (l != NULL && !l->l_libname->dont_free)
+free (l->l_libname);
+  if (l != NULL && l->l_phdr_allocated)
+free ((void *) l->l_phdr);
+
   free (l);
   free (realname);
 
@@ -1256,7 +1263,11 @@ _dl_map_object_from_fd (const char *name, const char 
*origname, int fd,
 errstring = _dl_map_segments (l, fd, header, type, loadcmds, nloadcmds,
  maplength, has_holes, loader);
 if (__glibc_unlikely (errstring != NULL))
-  goto call_lose;
+  {
+   /* Mappings can be in an inconsistent state: avoid unmap.  */
+   l->l_map_start = l->l_map_end = 0;
+   goto call_lose;
+  }
 
 /* Process program headers again after load segments are mapped in
case processing requires accessing those segments.  Scan program
@@ -1294,14 +1305,6 @@ _dl_map_object_from_fd (const char *name, const char 
*origname, int fd,
   || (__glibc_unlikely (l->l_flags_1 & DF_1_PIE)
  && __glibc_unlikely ((mode & __RTLD_OPENEXEC) == 0)))
 {
-  /* We are not supposed to load this object.  Free all resources.  */
-  _dl_unmap_segments (l);
-
-  if (!l->l_libname->dont_free)
-   free (l->l_libname);
-
-  if (l->l_phdr_allocated)
-   free ((void *) l->l_phdr);
 
   if (l->l_flags_1 & DF_1_PIE)
errstring
@@ -1392,6 +1395,9 @@ cannot enable executable stack as shared object 
requires");
   /* Signal that we closed the file.  */
   fd = -1;
 
+  /* Failures before this point are handled locally via lose.
+ No more failures are allowed in this function until return.  */
+
   /* If this is ET_EXEC, we should have loaded it as lt_executable.  */
   assert (type != ET_EXEC || l->l_type == lt_executable);
 
-- 
2.17.1



[PATCH v2 1/6] aarch64: Fix missing BTI protection from dependencies [BZ #26926]

2020-11-27 Thread Szabolcs Nagy
The _dl_open_check and _rtld_main_check hooks are not called on the
dependencies of a loaded module, so BTI protection was missed on
every module other than the main executable and directly dlopened
libraries.

The fix just iterates over dependencies to enable BTI.

Fixes bug 26926.
---
 sysdeps/aarch64/dl-bti.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/sysdeps/aarch64/dl-bti.c b/sysdeps/aarch64/dl-bti.c
index 196e462520..8f4728adce 100644
--- a/sysdeps/aarch64/dl-bti.c
+++ b/sysdeps/aarch64/dl-bti.c
@@ -51,11 +51,24 @@ enable_bti (struct link_map *map, const char *program)
   return 0;
 }
 
-/* Enable BTI for L if required.  */
+/* Enable BTI for MAP and its dependencies.  */
 
 void
-_dl_bti_check (struct link_map *l, const char *program)
+_dl_bti_check (struct link_map *map, const char *program)
 {
-  if (GLRO(dl_aarch64_cpu_features).bti && l->l_mach.bti)
-enable_bti (l, program);
+  if (!GLRO(dl_aarch64_cpu_features).bti)
+return;
+
+  if (map->l_mach.bti)
+enable_bti (map, program);
+
+  unsigned int i = map->l_searchlist.r_nlist;
+  while (i-- > 0)
+{
+  struct link_map *l = map->l_initfini[i];
+  if (l->l_init_called)
+   continue;
+  if (l->l_mach.bti)
+   enable_bti (l, program);
+}
 }
-- 
2.17.1



[PATCH v2 2/6] elf: lose is closely tied to _dl_map_object_from_fd

2020-11-27 Thread Szabolcs Nagy
Simple refactoring to keep failure handling next to
_dl_map_object_from_fd.
---
 elf/dl-load.c | 48 
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/elf/dl-load.c b/elf/dl-load.c
index f3201e7c14..21e55deb19 100644
--- a/elf/dl-load.c
+++ b/elf/dl-load.c
@@ -835,30 +835,6 @@ _dl_init_paths (const char *llp, const char *source)
 }
 
 
-static void
-__attribute__ ((noreturn, noinline))
-lose (int code, int fd, const char *name, char *realname, struct link_map *l,
-  const char *msg, struct r_debug *r, Lmid_t nsid)
-{
-  /* The file might already be closed.  */
-  if (fd != -1)
-(void) __close_nocancel (fd);
-  if (l != NULL && l->l_origin != (char *) -1l)
-free ((char *) l->l_origin);
-  free (l);
-  free (realname);
-
-  if (r != NULL)
-{
-  r->r_state = RT_CONSISTENT;
-  _dl_debug_state ();
-  LIBC_PROBE (map_failed, 2, nsid, r);
-}
-
-  _dl_signal_error (code, name, NULL, msg);
-}
-
-
 /* Process PT_GNU_PROPERTY program header PH in module L after
PT_LOAD segments are mapped.  Only one NT_GNU_PROPERTY_TYPE_0
note is handled which contains processor specific properties.  */
@@ -930,6 +906,30 @@ _dl_process_pt_gnu_property (struct link_map *l, const 
ElfW(Phdr) *ph)
 }
 
 
+static void
+__attribute__ ((noreturn, noinline))
+lose (int code, int fd, const char *name, char *realname, struct link_map *l,
+  const char *msg, struct r_debug *r, Lmid_t nsid)
+{
+  /* The file might already be closed.  */
+  if (fd != -1)
+(void) __close_nocancel (fd);
+  if (l != NULL && l->l_origin != (char *) -1l)
+free ((char *) l->l_origin);
+  free (l);
+  free (realname);
+
+  if (r != NULL)
+{
+  r->r_state = RT_CONSISTENT;
+  _dl_debug_state ();
+  LIBC_PROBE (map_failed, 2, nsid, r);
+}
+
+  _dl_signal_error (code, name, NULL, msg);
+}
+
+
 /* Map in the shared object NAME, actually located in REALNAME, and already
opened on FD.  */
 
-- 
2.17.1



[PATCH v2 0/6] aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831]

2020-11-27 Thread Szabolcs Nagy
This is v2 of
https://sourceware.org/pipermail/libc-alpha/2020-November/119305.html

To enable BTI support, re-mmap executable segments instead of
mprotecting them in case mprotect is seccomp filtered.

I would like linux to change to map the main exe with PROT_BTI when
that is marked as BTI compatible. From the linux side i heard the
following concerns about this:
- it's an ABI change so requires some ABI bump. (this is fine with
  me, i think glibc does not care about backward compat as nothing
  can reasonably rely on the current behaviour, but if we have a
  new bit in auxv or similar then we can save one mprotect call.)
- in case we discover compatibility issues with user binaries it's
  better if userspace can easily disable BTI (e.g. removing the
  mprotect based on some env var, but if kernel adds PROT_BTI and
  mprotect is filtered then we have no reliable way to remove that
  from executables. this problem already exists for static linked
  exes, although admittedly those are less of a compat concern.)
- ideally PROT_BTI would be added via a new syscall that does not
  interfere with PROT_EXEC filtering. (this does not conflict with
  the current patches: even with a new syscall we need a fallback.)
- solve it in systemd (e.g. turn off the filter, use better filter):
  i would prefer not to have aarch64 (or BTI) specific policy in
  user code. and there was no satisfying way to do this portably.

Other concerns about the approach:
- mmap is more expensive than mprotect: in my measurements using
  mmap instead of mprotect is 3-8x slower (and after mmap pages
  have to be faulted in again), but e.g. the exec time of a program
  with 4 deps only increases by < 8% due to the 4 new mmaps. (the
  kernel side resource usage may increase too, i didnt look at that.)
- _dl_signal_error is not valid from the _dl_process_gnu_property
  hook. The v2 set addresses this problem: i could either propagate
  the errors up until they can be handled or solve it in the aarch64
  backend by first recording failures and then dealing with them in
  _dl_open_check. I choose the latter, but did some refactorings in
  _dl_map_object_from_fd that makes the former possible too.

v2:
- [1/6]: New patch that fixes a missed BTI bug found during v2.
- [2-3/6]: New, _dl_map_object_from_fd failure handling improvements,
  these are independent of the rest of the series.
- [4/6]: Move the note handling to a different place (after l_phdr
  setup, but before fd is closed).
- [5/6]: Rebased.
- [6/6]: First record errors and only report them later. (this fixes
  various failure handling issues.)

Szabolcs Nagy (6):
  aarch64: Fix missing BTI protection from dependencies [BZ #26926]
  elf: lose is closely tied to _dl_map_object_from_fd
  elf: Fix failure handling in _dl_map_object_from_fd
  elf: Move note processing after l_phdr is updated
  elf: Pass the fd to note processing
  aarch64: Use mmap to add PROT_BTI instead of mprotect [BZ #26831]

 elf/dl-load.c  | 110 -
 elf/rtld.c |   4 +-
 sysdeps/aarch64/dl-bti.c   |  74 ++---
 sysdeps/aarch64/dl-prop.h  |  14 +++--
 sysdeps/aarch64/linkmap.h  |   2 +-
 sysdeps/generic/dl-prop.h  |   6 +-
 sysdeps/generic/ldsodefs.h |   5 +-
 sysdeps/x86/dl-prop.h  |   6 +-
 8 files changed, 135 insertions(+), 86 deletions(-)

-- 
2.17.1



Re: [PATCH 0/4] aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831]

2020-11-05 Thread Szabolcs Nagy
The 11/04/2020 09:20, Will Deacon wrote:
> On Tue, Nov 03, 2020 at 05:34:38PM +, Mark Brown wrote:
> > On Tue, Nov 03, 2020 at 10:25:37AM +, Szabolcs Nagy wrote:
> > 
> > > Re-mmap executable segments instead of mprotecting them in
> > > case mprotect is seccomp filtered.
> > 
> > > For the kernel mapped main executable we don't have the fd
> > > for re-mmap so linux needs to be updated to add BTI. (In the
> > > presence of seccomp filters for mprotect(PROT_EXEC) the libc
> > > cannot change BTI protection at runtime based on user space
> > > policy so it is better if the kernel maps BTI compatible
> > > binaries with PROT_BTI by default.)
> > 
> > Given that there were still some ongoing discussions on a more robust
> > kernel interface here and there seem to be a few concerns with this
> > series should we perhaps just take a step back and disable this seccomp
> > filter in systemd on arm64, at least for the time being?  That seems
> > safer than rolling out things that set ABI quickly, a big part of the
> > reason we went with having the dynamic linker enable PROT_BTI in the
> > first place was to give us more flexibility to handle any unforseen
> > consequences of enabling BTI that we run into.  We are going to have
> > similar issues with other features like MTE so we need to make sure that
> > whatever we're doing works with them too.
> > 
> > Also updated to Will's current e-mail address - Will, do you have
> > thoughts on what we should do here?
> 
> Changing the kernel to map the main executable with PROT_BTI by default is a
> user-visible change in behaviour and not without risk, so if we're going to
> do that then it needs to be opt-in because the current behaviour has been
> there since 5.8. I suppose we could shoe-horn in a cmdline option for 5.10
> (which will be the first LTS with BTI) but it would be better to put up with
> the current ABI if possible.

it's not clear to me how adding PROT_BTI in
the kernel would be observable in practice.

adding PROT_BTI to marked elf modules should
only have effect in cases when the process does
invalid operations and then there is no compat
requirement. if this is not the case then adding
PROT_BTI on static exe is already problematic.

if there is some issue with bti that makes
users want to turn it off, then they should do
it system wide or may be we can have a no-bti
option in userspace which uses mprotect to turn
it off (but since that has to be an explicit
opt-out i don't mind if the user also has to
disable the seccomp sandbox).

> Is there real value in this seccomp filter if it only looks at mprotect(),
> or was it just implemented because it's easy to do and sounds like a good
> idea?

i'm fine with just using mprotect and telling
users to remove the seccomp filter. but that
makes bti less attractive for deployment.



Re: [PATCH 0/4] aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831]

2020-11-04 Thread Szabolcs Nagy
The 11/04/2020 17:19, Topi Miettinen wrote:
> On 4.11.2020 16.35, Catalin Marinas wrote:
> > On Wed, Nov 04, 2020 at 11:55:57AM +0200, Topi Miettinen wrote:
> > > On 4.11.2020 11.29, Florian Weimer wrote:
> > > > * Will Deacon:
> > > > > Is there real value in this seccomp filter if it only looks at 
> > > > > mprotect(),
> > > > > or was it just implemented because it's easy to do and sounds like a 
> > > > > good
> > > > > idea?
> > > > 
> > > > It seems bogus to me.  Everyone will just create alias mappings instead,
> > > > just like they did for the similar SELinux feature.  See “Example code
> > > > to avoid execmem violations” in:
> > > > 
> > > > 
> > [...]
> > > > As you can see, this reference implementation creates a PROT_WRITE
> > > > mapping aliased to a PROT_EXEC mapping, so it actually reduces security
> > > > compared to something that generates the code in an anonymous mapping
> > > > and calls mprotect to make it executable.
> > [...]
> > > If a service legitimately needs executable and writable mappings (due to
> > > JIT, trampolines etc), it's easy to disable the filter whenever really
> > > needed with "MemoryDenyWriteExecute=no" (which is the default) in case of
> > > systemd or a TE rule like "allow type_t self:process { execmem };" for
> > > SELinux. But this shouldn't be the default case, since there are many
> > > services which don't need W
> > 
> > I think Drepper's point is that separate X and W mappings, with enough
> > randomisation, would be more secure than allowing W at the same
> > address (but, of course, less secure than not having W at all, though
> > that's not always possible).
> > 
> > > I'd also question what is the value of BTI if it can be easily 
> > > circumvented
> > > by removing PROT_BTI with mprotect()?
> > 
> > Well, BTI is a protection against JOP attacks. The assumption here is
> > that an attacker cannot invoke mprotect() to disable PROT_BTI. If it
> > can, it's probably not worth bothering with a subsequent JOP attack, it
> > can already call functions directly.
> 
> I suppose that the target for the attacker is to eventually perform system
> calls rather than looping forever in JOP/ROP gadgets.
> 
> > I see MDWX not as a way of detecting attacks but rather plugging
> > inadvertent security holes in certain programs. On arm64, such hardening
> > currently gets in the way of another hardening feature, BTI.
> 
> I don't think it has to get in the way at all. Why wouldn't something simple
> like this work:

PROT_BTI is only valid on binaries that are BTI compatible.
to detect that, the load segments must be already mapped.

AT_BTI does not solve this: we want to be able to load legacy
elf modules. (a BTI enforcement setting may be useful where
incompatible modules are rejected, but that cannot be the
default for backward compatibility reasons.)

> 
> diff --git a/elf/dl-load.c b/elf/dl-load.c
> index 646c5dca40..12a74d15e8 100644
> --- a/elf/dl-load.c
> +++ b/elf/dl-load.c
> @@ -1170,8 +1170,13 @@ _dl_map_object_from_fd (const char *name, const char
> *origname, int fd,
> c->prot |= PROT_READ;
>   if (ph->p_flags & PF_W)
> c->prot |= PROT_WRITE;
> - if (ph->p_flags & PF_X)
> + if (ph->p_flags & PF_X) {
> c->prot |= PROT_EXEC;
> +#ifdef PROT_BTI
> +   if (GLRO(dl_bti) & 1)
> + c->prot |= PROT_BTI;
> +#endif
> + }
>  #endif
>   break;
> 
> diff --git a/elf/dl-support.c b/elf/dl-support.c
> index 7704c101c5..22c7cc7b81 100644
> --- a/elf/dl-support.c
> +++ b/elf/dl-support.c
> @@ -222,7 +222,7 @@ __rtld_lock_define_initialized_recursive (,
> _dl_load_write_lock)
> 
> 
>  #ifdef HAVE_AUX_VECTOR
> -int _dl_clktck;
> +int _dl_clktck, _dl_bti;
> 
>  void
>  _dl_aux_init (ElfW(auxv_t) *av)
> @@ -294,6 +294,11 @@ _dl_aux_init (ElfW(auxv_t) *av)
>case AT_RANDOM:
> _dl_random = (void *) av->a_un.a_val;
> break;
> +#ifdef PROT_BTI
> +  case AT_BTI:
> +   _dl_bti = (void *) av->a_un.a_val;
> +   break;
> +#endif
>DL_PLATFORM_AUXV
>}
>if (seen == 0xf)
> 
> Kernel sets the aux vector to indicate that BTI should be enabled for all
> segments and main exe is already protected.
> 
> -Topi


Re: [PATCH 0/4] aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831]

2020-11-04 Thread Szabolcs Nagy
The 11/03/2020 23:41, Jeremy Linton wrote:
> On 11/3/20 11:34 AM, Mark Brown wrote:
> > On Tue, Nov 03, 2020 at 10:25:37AM +, Szabolcs Nagy wrote:
> > 
> > > Re-mmap executable segments instead of mprotecting them in
> > > case mprotect is seccomp filtered.
> > 
> > > For the kernel mapped main executable we don't have the fd
> > > for re-mmap so linux needs to be updated to add BTI. (In the
> > > presence of seccomp filters for mprotect(PROT_EXEC) the libc
> > > cannot change BTI protection at runtime based on user space
> > > policy so it is better if the kernel maps BTI compatible
> > > binaries with PROT_BTI by default.)
> > 
> > Given that there were still some ongoing discussions on a more robust
> > kernel interface here and there seem to be a few concerns with this
> > series should we perhaps just take a step back and disable this seccomp
> > filter in systemd on arm64, at least for the time being?  That seems
> > safer than rolling out things that set ABI quickly, a big part of the
> 
> So, that's a bigger hammer than I think is needed and punishes !BTI
> machines. I'm going to suggest that if we need to carry a temp patch its
> more like the glibc patch I mentioned in the Fedora defect. That patch
> simply logs a message, on the mprotect failures rather than aborting. Its
> fairly non-intrusive.
> 
> That leaves seccomp functional, and BTI generally functional except when
> seccomp is restricting it. I've also been asked that if a patch like that is
> needed, its (temporary?) merged to the glibc trunk, rather than just being
> carried by the distro's.

note that changing mprotect into mmap in glibc works
even if the kernel or systemd decides to do things
differently: currently the only wart is that on the
main exe we have to use mprotect and silently ignore
the failures. (but e.g. some return to libc attacks
are reliably prevented since libc.so guaranteed to
have PROT_BTI this way.)

the patchset is a bit more intrusive than i hoped
for, so if we expect to get a new syscall then
simply ignoring mprotect failures may be a better
temporary solution. (and i still need to do
benchmarks to see if the cost of re-mmap is not
much more than the cost of mprotect.)

changing the seccomp filter in systemd does not
help if there are other seccomp filters elsewhere
doing the same. i think we will have to avoid using
mprotect(PROT_EXEC) or at least ignore the failure.

> > reason we went with having the dynamic linker enable PROT_BTI in the
> > first place was to give us more flexibility to handle any unforseen
> > consequences of enabling BTI that we run into.  We are going to have
> > similar issues with other features like MTE so we need to make sure that
> > whatever we're doing works with them too.
> > 
> > Also updated to Will's current e-mail address - Will, do you have
> > thoughts on what we should do here?
> > 


Re: [PATCH 2/4] elf: Move note processing after l_phdr is updated [BZ #26831]

2020-11-03 Thread Szabolcs Nagy
The 11/03/2020 04:36, H.J. Lu wrote:
> On Tue, Nov 3, 2020 at 2:38 AM Florian Weimer  wrote:
> > * Szabolcs Nagy:
> >
> > > Program headers are processed in two pass: after the first pass
> > > load segments are mmapped so in the second pass target specific
> > > note processing logic can access the notes.
> > >
> > > The second pass is moved later so various link_map fields are
> > > set up that may be useful for note processing such as l_phdr.
> > > ---
> > >  elf/dl-load.c | 30 +++---
> > >  1 file changed, 15 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/elf/dl-load.c b/elf/dl-load.c
> > > index ceaab7f18e..673cf960a0 100644
> > > --- a/elf/dl-load.c
> > > +++ b/elf/dl-load.c
> > > @@ -1259,21 +1259,6 @@ _dl_map_object_from_fd (const char *name, const 
> > > char *origname, int fd,
> > > maplength, has_holes, loader);
> > >  if (__glibc_unlikely (errstring != NULL))
> > >goto call_lose;
> > > -
> > > -/* Process program headers again after load segments are mapped in
> > > -   case processing requires accessing those segments.  Scan program
> > > -   headers backward so that PT_NOTE can be skipped if PT_GNU_PROPERTY
> > > -   exits.  */
> > > -for (ph = [l->l_phnum]; ph != phdr; --ph)
> > > -  switch (ph[-1].p_type)
> > > - {
> > > - case PT_NOTE:
> > > -   _dl_process_pt_note (l, fd, [-1]);
> > > -   break;
> > > - case PT_GNU_PROPERTY:
> > > -   _dl_process_pt_gnu_property (l, fd, [-1]);
> > > -   break;
> > > - }
> > >}
> > >
> > >if (l->l_ld == 0)
> > > @@ -1481,6 +1466,21 @@ cannot enable executable stack as shared object 
> > > requires");
> > >  /* Assign the next available module ID.  */
> > >  l->l_tls_modid = _dl_next_tls_modid ();
> > >
> > > +  /* Process program headers again after load segments are mapped in
> > > + case processing requires accessing those segments.  Scan program
> > > + headers backward so that PT_NOTE can be skipped if PT_GNU_PROPERTY
> > > + exits.  */
> > > +  for (ph = >l_phdr[l->l_phnum]; ph != l->l_phdr; --ph)
> > > +switch (ph[-1].p_type)
> > > +  {
> > > +  case PT_NOTE:
> > > + _dl_process_pt_note (l, fd, [-1]);
> > > + break;
> > > +  case PT_GNU_PROPERTY:
> > > + _dl_process_pt_gnu_property (l, fd, [-1]);
> > > + break;
> > > +  }
> > > +
> > >  #ifdef DL_AFTER_LOAD
> > >DL_AFTER_LOAD (l);
> > >  #endif
> >
> > Is this still compatible with the CET requirements?
> >
> > I hope it is because the CET magic happens in _dl_open_check, so after
> > the the code in elf/dl-load.c has run.

i believe the note processing and later cet magic
are not affected by this code move.

but i did not test this with cet.

> 
> _dl_process_pt_note and _dl_process_pt_gnu_property may call
> _dl_signal_error.  Are we prepared to clean more things up when it
> happens?  I am investigating:

yeah, this is difficult to reason about.

it seems to me that after _dl_map_object returns there
may be _dl_map_object_deps which can fail in a way that
all of dlopen has to be rolled back, so if i move things
around in _dl_map_object that should not introduce new
issues.

but it is not clear to me how robust the dlopen code is
against arbitrary failure in dl_open_worker.

> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=26825
> 
> I don't think cleanup of _dl_process_pt_gnu_property failure is done
> properly.
> 
> -- 
> H.J.

-- 


[PATCH 4/4] aarch64: Remove the bti link_map field [BZ #26831]

2020-11-03 Thread Szabolcs Nagy
The bti link_map field is no longer necessary because PROT_BTI
is applied at note processing time immediately instead of in
_dl_open_check based on the bti field.

This is a separate patch that is not expected to be backported
to avoid changing the link_map layout that is libc internal ABI.
---
 sysdeps/aarch64/dl-prop.h | 5 +
 sysdeps/aarch64/linkmap.h | 1 -
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/sysdeps/aarch64/dl-prop.h b/sysdeps/aarch64/dl-prop.h
index 762bc93733..cf14381e4a 100644
--- a/sysdeps/aarch64/dl-prop.h
+++ b/sysdeps/aarch64/dl-prop.h
@@ -52,10 +52,7 @@ _dl_process_gnu_property (struct link_map *l, int fd, 
uint32_t type,
 
   unsigned int feature_1 = *(unsigned int *) data;
   if (feature_1 & GNU_PROPERTY_AARCH64_FEATURE_1_BTI)
-   {
- l->l_mach.bti = true;  /* No longer needed.  */
- _dl_bti_protect (l, fd);
-   }
+   _dl_bti_protect (l, fd);
 
   /* Stop if we processed the property note.  */
   return 0;
diff --git a/sysdeps/aarch64/linkmap.h b/sysdeps/aarch64/linkmap.h
index 847a03ace2..e921e77495 100644
--- a/sysdeps/aarch64/linkmap.h
+++ b/sysdeps/aarch64/linkmap.h
@@ -22,5 +22,4 @@ struct link_map_machine
 {
   ElfW(Addr) plt;/* Address of .plt */
   void *tlsdesc_table;   /* Address of TLS descriptor hash table.  */
-  bool bti;  /* Branch Target Identification is enabled.  */
 };
-- 
2.17.1



[PATCH 3/4] aarch64: Use mmap to add PROT_BTI instead of mprotect [BZ #26831]

2020-11-03 Thread Szabolcs Nagy
Re-mmap executable segments if possible instead of using mprotect
to add PROT_BTI. This allows using BTI protection with security
policies that prevent mprotect with PROT_EXEC.

If the fd of the ELF module is not available because it was kernel
mapped then mprotect is used and failures are ignored.  It is
expected that linux kernel will add PROT_BTI when mapping a module
(current linux as of version 5.9 does not do this).

Computing the mapping parameters follows the logic of
_dl_map_object_from_fd more closely now.

Fixes bug 26831.
---
 sysdeps/aarch64/dl-bti.c  | 46 ---
 sysdeps/aarch64/dl-prop.h | 14 +++-
 2 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/sysdeps/aarch64/dl-bti.c b/sysdeps/aarch64/dl-bti.c
index 196e462520..385f1731ca 100644
--- a/sysdeps/aarch64/dl-bti.c
+++ b/sysdeps/aarch64/dl-bti.c
@@ -19,43 +19,45 @@
 #include 
 #include 
 #include 
+#include   /* For MAP_COPY.  */
 
-static int
-enable_bti (struct link_map *map, const char *program)
+/* Enable BTI protection for MAP.  */
+
+void
+_dl_bti_protect (struct link_map *map, int fd)
 {
+  const size_t pagesz = GLRO(dl_pagesize);
   const ElfW(Phdr) *phdr;
-  unsigned prot;
 
   for (phdr = map->l_phdr; phdr < >l_phdr[map->l_phnum]; ++phdr)
 if (phdr->p_type == PT_LOAD && (phdr->p_flags & PF_X))
   {
-   void *start = (void *) (phdr->p_vaddr + map->l_addr);
-   size_t len = phdr->p_memsz;
-
-   prot = PROT_EXEC | PROT_BTI;
+   size_t vstart = ALIGN_DOWN (phdr->p_vaddr, pagesz);
+   size_t vend = ALIGN_UP (phdr->p_vaddr + phdr->p_filesz, pagesz);
+   off_t off = ALIGN_DOWN (phdr->p_offset, pagesz);
+   void *start = (void *) (vstart + map->l_addr);
+   size_t len = vend - vstart;
+
+   /* Add PROT_BTI.  */
+   unsigned prot = PROT_EXEC | PROT_BTI;
if (phdr->p_flags & PF_R)
  prot |= PROT_READ;
if (phdr->p_flags & PF_W)
  prot |= PROT_WRITE;
 
-   if (__mprotect (start, len, prot) < 0)
+   if (fd == -1)
+ {
+   /* Ignore failures: rely on the kernel adding PROT_BTI then.  */
+   __mprotect (start, len, prot);
+ }
+   else
  {
-   if (program)
- _dl_fatal_printf ("%s: mprotect failed to turn on BTI\n",
-   map->l_name);
-   else
+   void *p = __mmap (start, len, prot, MAP_FIXED|MAP_COPY|MAP_FILE,
+ fd, off);
+   if (p == MAP_FAILED)
  _dl_signal_error (errno, map->l_name, "dlopen",
-   N_("mprotect failed to turn on BTI"));
+   N_("failed to turn on BTI protection"));
  }
   }
   return 0;
 }
-
-/* Enable BTI for L if required.  */
-
-void
-_dl_bti_check (struct link_map *l, const char *program)
-{
-  if (GLRO(dl_aarch64_cpu_features).bti && l->l_mach.bti)
-enable_bti (l, program);
-}
diff --git a/sysdeps/aarch64/dl-prop.h b/sysdeps/aarch64/dl-prop.h
index 2016d1472e..762bc93733 100644
--- a/sysdeps/aarch64/dl-prop.h
+++ b/sysdeps/aarch64/dl-prop.h
@@ -19,19 +19,16 @@
 #ifndef _DL_PROP_H
 #define _DL_PROP_H
 
-extern void _dl_bti_check (struct link_map *, const char *)
-attribute_hidden;
+extern void _dl_bti_protect (struct link_map *, int) attribute_hidden;
 
 static inline void __attribute__ ((always_inline))
 _rtld_main_check (struct link_map *m, const char *program)
 {
-  _dl_bti_check (m, program);
 }
 
 static inline void __attribute__ ((always_inline))
 _dl_open_check (struct link_map *m)
 {
-  _dl_bti_check (m, NULL);
 }
 
 static inline void __attribute__ ((always_inline))
@@ -43,6 +40,10 @@ static inline int
 _dl_process_gnu_property (struct link_map *l, int fd, uint32_t type,
  uint32_t datasz, void *data)
 {
+  if (!GLRO(dl_aarch64_cpu_features).bti)
+/* Skip note processing.  */
+return 0;
+
   if (type == GNU_PROPERTY_AARCH64_FEATURE_1_AND)
 {
   /* Stop if the property note is ill-formed.  */
@@ -51,7 +52,10 @@ _dl_process_gnu_property (struct link_map *l, int fd, 
uint32_t type,
 
   unsigned int feature_1 = *(unsigned int *) data;
   if (feature_1 & GNU_PROPERTY_AARCH64_FEATURE_1_BTI)
-   l->l_mach.bti = true;
+   {
+ l->l_mach.bti = true;  /* No longer needed.  */
+ _dl_bti_protect (l, fd);
+   }
 
   /* Stop if we processed the property note.  */
   return 0;
-- 
2.17.1



[PATCH 2/4] elf: Move note processing after l_phdr is updated [BZ #26831]

2020-11-03 Thread Szabolcs Nagy
Program headers are processed in two pass: after the first pass
load segments are mmapped so in the second pass target specific
note processing logic can access the notes.

The second pass is moved later so various link_map fields are
set up that may be useful for note processing such as l_phdr.
---
 elf/dl-load.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/elf/dl-load.c b/elf/dl-load.c
index ceaab7f18e..673cf960a0 100644
--- a/elf/dl-load.c
+++ b/elf/dl-load.c
@@ -1259,21 +1259,6 @@ _dl_map_object_from_fd (const char *name, const char 
*origname, int fd,
  maplength, has_holes, loader);
 if (__glibc_unlikely (errstring != NULL))
   goto call_lose;
-
-/* Process program headers again after load segments are mapped in
-   case processing requires accessing those segments.  Scan program
-   headers backward so that PT_NOTE can be skipped if PT_GNU_PROPERTY
-   exits.  */
-for (ph = [l->l_phnum]; ph != phdr; --ph)
-  switch (ph[-1].p_type)
-   {
-   case PT_NOTE:
- _dl_process_pt_note (l, fd, [-1]);
- break;
-   case PT_GNU_PROPERTY:
- _dl_process_pt_gnu_property (l, fd, [-1]);
- break;
-   }
   }
 
   if (l->l_ld == 0)
@@ -1481,6 +1466,21 @@ cannot enable executable stack as shared object 
requires");
 /* Assign the next available module ID.  */
 l->l_tls_modid = _dl_next_tls_modid ();
 
+  /* Process program headers again after load segments are mapped in
+ case processing requires accessing those segments.  Scan program
+ headers backward so that PT_NOTE can be skipped if PT_GNU_PROPERTY
+ exits.  */
+  for (ph = >l_phdr[l->l_phnum]; ph != l->l_phdr; --ph)
+switch (ph[-1].p_type)
+  {
+  case PT_NOTE:
+   _dl_process_pt_note (l, fd, [-1]);
+   break;
+  case PT_GNU_PROPERTY:
+   _dl_process_pt_gnu_property (l, fd, [-1]);
+   break;
+  }
+
 #ifdef DL_AFTER_LOAD
   DL_AFTER_LOAD (l);
 #endif
-- 
2.17.1



[PATCH 1/4] elf: Pass the fd to note processing [BZ #26831]

2020-11-03 Thread Szabolcs Nagy
To handle GNU property notes on aarch64 some segments need to
be mmaped again, so the fd of the loaded ELF module is needed.

When the fd is not available (kernel loaded modules), then -1
is passed.

The fd is passed to both _dl_process_pt_gnu_property and
_dl_process_pt_note for consistency. Target specific note
processing functions are updated accordingly.
---
 elf/dl-load.c  | 12 +++-
 elf/rtld.c |  4 ++--
 sysdeps/aarch64/dl-prop.h  |  6 +++---
 sysdeps/generic/dl-prop.h  |  6 +++---
 sysdeps/generic/ldsodefs.h |  5 +++--
 sysdeps/x86/dl-prop.h  |  6 +++---
 6 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/elf/dl-load.c b/elf/dl-load.c
index f3201e7c14..ceaab7f18e 100644
--- a/elf/dl-load.c
+++ b/elf/dl-load.c
@@ -861,10 +861,12 @@ lose (int code, int fd, const char *name, char *realname, 
struct link_map *l,
 
 /* Process PT_GNU_PROPERTY program header PH in module L after
PT_LOAD segments are mapped.  Only one NT_GNU_PROPERTY_TYPE_0
-   note is handled which contains processor specific properties.  */
+   note is handled which contains processor specific properties.
+   FD is -1 for the kernel mapped main executable otherwise it is
+   the fd used for loading module L.  */
 
 void
-_dl_process_pt_gnu_property (struct link_map *l, const ElfW(Phdr) *ph)
+_dl_process_pt_gnu_property (struct link_map *l, int fd, const ElfW(Phdr) *ph)
 {
   const ElfW(Nhdr) *note = (const void *) (ph->p_vaddr + l->l_addr);
   const ElfW(Addr) size = ph->p_memsz;
@@ -911,7 +913,7 @@ _dl_process_pt_gnu_property (struct link_map *l, const 
ElfW(Phdr) *ph)
  last_type = type;
 
  /* Target specific property processing.  */
- if (_dl_process_gnu_property (l, type, datasz, ptr) == 0)
+ if (_dl_process_gnu_property (l, fd, type, datasz, ptr) == 0)
return;
 
  /* Check the next property item.  */
@@ -1266,10 +1268,10 @@ _dl_map_object_from_fd (const char *name, const char 
*origname, int fd,
   switch (ph[-1].p_type)
{
case PT_NOTE:
- _dl_process_pt_note (l, [-1]);
+ _dl_process_pt_note (l, fd, [-1]);
  break;
case PT_GNU_PROPERTY:
- _dl_process_pt_gnu_property (l, [-1]);
+ _dl_process_pt_gnu_property (l, fd, [-1]);
  break;
}
   }
diff --git a/elf/rtld.c b/elf/rtld.c
index 5d117d0d2c..6ba918338b 100644
--- a/elf/rtld.c
+++ b/elf/rtld.c
@@ -1531,10 +1531,10 @@ dl_main (const ElfW(Phdr) *phdr,
 switch (ph[-1].p_type)
   {
   case PT_NOTE:
-   _dl_process_pt_note (main_map, [-1]);
+   _dl_process_pt_note (main_map, -1, [-1]);
break;
   case PT_GNU_PROPERTY:
-   _dl_process_pt_gnu_property (main_map, [-1]);
+   _dl_process_pt_gnu_property (main_map, -1, [-1]);
break;
   }
 
diff --git a/sysdeps/aarch64/dl-prop.h b/sysdeps/aarch64/dl-prop.h
index b0785bda83..2016d1472e 100644
--- a/sysdeps/aarch64/dl-prop.h
+++ b/sysdeps/aarch64/dl-prop.h
@@ -35,13 +35,13 @@ _dl_open_check (struct link_map *m)
 }
 
 static inline void __attribute__ ((always_inline))
-_dl_process_pt_note (struct link_map *l, const ElfW(Phdr) *ph)
+_dl_process_pt_note (struct link_map *l, int fd, const ElfW(Phdr) *ph)
 {
 }
 
 static inline int
-_dl_process_gnu_property (struct link_map *l, uint32_t type, uint32_t datasz,
- void *data)
+_dl_process_gnu_property (struct link_map *l, int fd, uint32_t type,
+ uint32_t datasz, void *data)
 {
   if (type == GNU_PROPERTY_AARCH64_FEATURE_1_AND)
 {
diff --git a/sysdeps/generic/dl-prop.h b/sysdeps/generic/dl-prop.h
index f1cf576fe3..df27ff8e6a 100644
--- a/sysdeps/generic/dl-prop.h
+++ b/sysdeps/generic/dl-prop.h
@@ -37,15 +37,15 @@ _dl_open_check (struct link_map *m)
 }
 
 static inline void __attribute__ ((always_inline))
-_dl_process_pt_note (struct link_map *l, const ElfW(Phdr) *ph)
+_dl_process_pt_note (struct link_map *l, int fd, const ElfW(Phdr) *ph)
 {
 }
 
 /* Called for each property in the NT_GNU_PROPERTY_TYPE_0 note of L,
processing of the properties continues until this returns 0.  */
 static inline int __attribute__ ((always_inline))
-_dl_process_gnu_property (struct link_map *l, uint32_t type, uint32_t datasz,
- void *data)
+_dl_process_gnu_property (struct link_map *l, int fd, uint32_t type,
+ uint32_t datasz, void *data)
 {
   return 0;
 }
diff --git a/sysdeps/generic/ldsodefs.h b/sysdeps/generic/ldsodefs.h
index 382eeb9be0..702cb0f488 100644
--- a/sysdeps/generic/ldsodefs.h
+++ b/sysdeps/generic/ldsodefs.h
@@ -925,8 +925,9 @@ extern void _dl_rtld_di_serinfo (struct link_map *loader,
 Dl_serinfo *si, bool counting);
 
 /* Process PT_GNU_PROPERTY program header PH in module L after
-   PT_LOAD segments are mapped.  */
-void _dl_process_pt_gnu_property (struct link_map *l, const ElfW(Phdr) *ph);
+   PT_LOAD 

[PATCH 0/4] aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831]

2020-11-03 Thread Szabolcs Nagy
Re-mmap executable segments instead of mprotecting them in
case mprotect is seccomp filtered.

For the kernel mapped main executable we don't have the fd
for re-mmap so linux needs to be updated to add BTI. (In the
presence of seccomp filters for mprotect(PROT_EXEC) the libc
cannot change BTI protection at runtime based on user space
policy so it is better if the kernel maps BTI compatible
binaries with PROT_BTI by default.)

Szabolcs Nagy (4):
  elf: Pass the fd to note processing [BZ #26831]
  elf: Move note processing after l_phdr is updated [BZ #26831]
  aarch64: Use mmap to add PROT_BTI instead of mprotect [BZ #26831]
  aarch64: Remove the bti link_map field [BZ #26831]

 elf/dl-load.c  | 38 ---
 elf/rtld.c |  4 ++--
 sysdeps/aarch64/dl-bti.c   | 46 --
 sysdeps/aarch64/dl-prop.h  | 17 +++---
 sysdeps/aarch64/linkmap.h  |  1 -
 sysdeps/generic/dl-prop.h  |  6 ++---
 sysdeps/generic/ldsodefs.h |  5 +++--
 sysdeps/x86/dl-prop.h  |  6 ++---
 8 files changed, 64 insertions(+), 59 deletions(-)

-- 
2.17.1



Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-26 Thread Szabolcs Nagy
The 10/26/2020 16:24, Dave Martin via Libc-alpha wrote:
> Unrolling this discussion a bit, this problem comes from a few sources:
> 
> 1) systemd is trying to implement a policy that doesn't fit SECCOMP
> syscall filtering very well.
> 
> 2) The program is trying to do something not expressible through the
> syscall interface: really the intent is to set PROT_BTI on the page,
> with no intent to set PROT_EXEC on any page that didn't already have it
> set.
> 
> 
> This limitation of mprotect() was known when I originally added PROT_BTI,
> but at that time we weren't aware of a clear use case that would fail.
> 
> 
> Would it now help to add something like:
> 
> int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> {
>   int ret = -EINVAL;
>   mmap_write_lock(current->mm);
>   if (all vmas in [addr .. addr + len) have
>   their mprotect flags set to old_flags) {
> 
>   ret = mprotect(addr, len, new_flags);
>   }
>   
>   mmap_write_unlock(current->mm);
>   return ret;
> }

if more prot flags are introduced then the exact
match for old_flags may be restrictive and currently
there is no way to query these flags to figure out
how to toggle one prot flag in a future proof way,
so i don't think this solves the issue completely.

i think we might need a new api, given that aarch64
now has PROT_BTI and PROT_MTE while existing code
expects RWX only, but i don't know what api is best.

> libc would now be able to do
> 
>   mchangeprot(addr, len, PROT_EXEC | PROT_READ,
>   PROT_EXEC | PROT_READ | PROT_BTI);
> 
> while systemd's MDWX filter would reject the call if
> 
>   (new_flags & PROT_EXEC) &&
>   (!(old_flags & PROT_EXEC) || (new_flags & PROT_WRITE)
> 
> 
> 
> This won't magically fix current code, but something along these lines
> might be better going forward.
> 
> 
> Thoughts?
> 
> ---Dave


Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-23 Thread Szabolcs Nagy
The 10/22/2020 10:31, Catalin Marinas wrote:
> IIUC, the problem is with the main executable which is mapped by the
> kernel without PROT_BTI. The dynamic loader wants to set PROT_BTI but
> does not have the original file descriptor to be able to remap. Its only
> choice is mprotect() and this fails because of the MDWX policy.
> 
> Not sure whether the kernel has the right information but could it map
> the main executable with PROT_BTI if the corresponding PT_GNU_PROPERTY
> is found? The current ABI states it only sets PROT_BTI for the
> interpreter who'd be responsible for setting the PROT_BTI on the main
> executable. I can't tell whether it would break anything but it's worth
> a try:

i think it would work, but now i can't easily
tell from the libc if i have to do the mprotect
on the main exe or not.

i guess i can just always mprotect and ignore
the failure?

> 
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 4784011cecac..0a08fb9133e8 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -730,14 +730,6 @@ asmlinkage void __sched arm64_preempt_schedule_irq(void)
>  int arch_elf_adjust_prot(int prot, const struct arch_elf_state *state,
>bool has_interp, bool is_interp)
>  {
> - /*
> -  * For dynamically linked executables the interpreter is
> -  * responsible for setting PROT_BTI on everything except
> -  * itself.
> -  */
> - if (is_interp != has_interp)
> - return prot;
> -
>   if (!(state->flags & ARM64_ELF_BTI))
>   return prot;
>  
> 
> -- 
> Catalin

-- 


Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Szabolcs Nagy
The 10/22/2020 11:17, Topi Miettinen via Libc-alpha wrote:
> On 22.10.2020 10.54, Florian Weimer wrote:
> > * Lennart Poettering:
> > > Did you see Topi's comments on the systemd issue?
> > > 
> > > https://github.com/systemd/systemd/issues/17368#issuecomment-710485532
> > > 
> > > I think I agree with this: it's a bit weird to alter the bits after
> > > the fact. Can't glibc set up everything right from the begining? That
> > > would keep both concepts working.
> > 
> > The dynamic loader has to process the LOAD segments to get to the ELF
> > note that says to enable BTI.  Maybe we could do a first pass and load
> > only the segments that cover notes.  But that requires lots of changes
> > to generic code in the loader.
> 
> What if the loader always enabled BTI for PROT_EXEC pages, but then when
> discovering that this was a mistake, mprotect() the pages without BTI? Then
> both BTI and MDWX would work and the penalty of not getting MDWX would fall
> to non-BTI programs. What's the expected proportion of BTI enabled code vs.
> disabled in the future, is it perhaps expected that a distro would enable
> the flag globally so eventually only a few legacy programs might be
> unprotected?

i thought mprotect(PROT_EXEC) would get filtered
with or without bti, is that not the case?

then i guess we can do the protection that way
around, but then i don't see why the filter cannot
treat PROT_EXEC|PROT_BTI the same as PROT_EXEC.



Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Szabolcs Nagy
The 10/22/2020 09:18, Lennart Poettering wrote:
> On Mi, 21.10.20 22:44, Jeremy Linton (jeremy.lin...@arm.com) wrote:
> 
> > Hi,
> >
> > There is a problem with glibc+systemd on BTI enabled systems. Systemd
> > has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
> > PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
> > being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
> > caught by the seccomp filter, resulting in service failures.
> >
> > So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
> > This is obviously not desirable.
> >
> > Various changes have been suggested, replacing the mprotect with mmap calls
> > having PROT_BTI set on the original mapping, re-mmapping the segments,
> > implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
> > and various modification to seccomp to allow particular mprotect cases to
> > bypass the filters. In each case there seems to be an undesirable attribute
> > to the solution.
> >
> > So, whats the best solution?
> 
> Did you see Topi's comments on the systemd issue?
> 
> https://github.com/systemd/systemd/issues/17368#issuecomment-710485532
> 
> I think I agree with this: it's a bit weird to alter the bits after
> the fact. Can't glibc set up everything right from the begining? That
> would keep both concepts working.

that's hard to do and does not work for the main exe currently
(which is mmaped by the kernel).

(it's hard to do because to know that the elf module requires
bti the PT_GNU_PROPERTY notes have to be accessed that are
often in the executable load segment, so either you mmap that
or have to read that, but the latter has a lot more failure
modes, so if i have to get the mmap flags right i'd do a mmap
and then re-mmap if the flags were not right)


Re: [PATCH v2] module: Harden STRICT_MODULE_RWX

2020-08-12 Thread Szabolcs Nagy
The 08/12/2020 18:37, Ard Biesheuvel wrote:
> module_frob_arch_sections
> 
> On Wed, 12 Aug 2020 at 18:00, Jessica Yu  wrote:
> >
> > +++ Szabolcs Nagy [12/08/20 15:15 +0100]:
> > >The 08/12/2020 13:56, Will Deacon wrote:
> > >> On Wed, Aug 12, 2020 at 12:40:05PM +0200, pet...@infradead.org wrote:
> > >> > On Wed, Aug 12, 2020 at 10:56:56AM +0200, Ard Biesheuvel wrote:
> > >> > > The module .lds has BYTE(0) in the section contents to prevent the
> > >> > > linker from pruning them entirely. The (NOLOAD) is there to ensure
> > >> > > that this byte does not end up in the .ko, which is more a matter of
> > >> > > principle than anything else, so we can happily drop that if it 
> > >> > > helps.
> > >> > >
> > >> > > However, this should only affect the PROGBITS vs NOBITS designation,
> > >> > > and so I am not sure whether it makes a difference.
> > >> > >
> > >> > > Depending on where the w^x check occurs, we might simply override the
> > >> > > permissions of these sections, and strip the writable permission if 
> > >> > > it
> > >> > > is set in the PLT handling init code, which manipulates the metadata
> > >> > > of all these 3 sections before the module space is vmalloc'ed.
> > >> >
> > >> > What's curious is that this seems the result of some recent binutils
> > >> > change. Every build with binutils-2.34 (or older) does not seem to
> > >> > generate these as WAX, but has the much more sensible WA.
> > >> >
> > >> > I suppose we can change the kernel check and 'allow' W^X for 0 sized
> > >> > sections, but I think we should still figure out why binutils-2.35 is
> > >> > now generating WAX sections all of a sudden, it might come bite us
> > >> > elsewhere.
> > >>
> > >> Agreed, I think it's important to figure out what's going on here before 
> > >> we
> > >> try to bodge around it.
> > >>
> > >> Adding Szabolcs, in case he has any ideas.
> > >>
> > >> To save him reading the whole thread, here's a summary:
> > >>
> > >> AArch64 kernel modules built with binutils 2.35 end up with a couple of
> > >> ELF sections marked as SHF_WRITE | SHF_ALLOC | SHF_EXECINSTR:
> > >>
> > >> [ 5] .plt PROGBITS 0388 01d000 08 00 WAX  0   0  1
> > >> [ 6] .init.plt NOBITS 0390 01d008 08 00  WA  0   0  1
> > >> [ 7] .text.ftrace_trampoline PROGBITS 0398 01d008 08 00 
> > >> WAX  0   0  1
> > >>
> > >> This results in the module being rejected by our loader, because we don't
> > >> permit writable, executable mappings.
> > >>
> > >> Our linker script for these entries uses NOLOAD, so it's odd to see 
> > >> PROGBITS
> > >> appearing in the readelf output above (and older binutils emits NOBITS
> > >> sections). Anyway, here's the linker script:
> > >>
> > >> SECTIONS {
> > >>  .plt (NOLOAD) : { BYTE(0) }
> > >>  .init.plt (NOLOAD) : { BYTE(0) }
> > >>  .text.ftrace_trampoline (NOLOAD) : { BYTE(0) }
> > >> }
> > >>
> > >> It appears that the name of the section influences the behaviour, as
> > >> Jessica observed [1] that sections named .text.* end up with PROGBITS,
> > >> whereas random naming such as ".test" ends up with NOBITS, as before.
> > >>
> > >> We've looked at the changelog between binutils 2.34 and 2.35, but nothing
> > >> stands out. Any clues? Is this intentional binutils behaviour?
> > >
> > >for me it bisects to
> > >
> > >https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=8c803a2dd7d3d742a3d0071914f557ef465afe71
> > >
> > >i will have to investigate further what's going on.
> >
> > Thanks for the hint. I'm almost certain it's due to this excerpt from
> > the changelog: "we now init sh_type and sh_flags for all known ABI sections
> > in _bfd_elf_new_section_hook."
> >
> > Indeed, .plt and .text.* are listed as special sections in bfd/elf.c.
> > The former requires an exact match and the latter only has to match
> > the prefix ".text." Since the code considers ".plt" and
> > ".text.ftrace_trampoline" special sections, their sh_type and sh_flags
> > are now set by default. Now I guess the question is whether this can
> > be overriden by a linker script..
> >
> 
> If this is even possible to begin with, doing this in a way that is
> portable across the various linkers that we claim to support is going
> to be tricky.
> 
> I suppose this is the downside of using partially linked objects as
> our module format - using ordinary shared libraries (along with the
> appropriate dynamic relocations which are mostly portable across
> architectures) would get rid of most of the PLT and trampoline issues,
> and of a lot of complex static relocation processing code.
> 
> I know there is little we can do at this point, apart from ignoring
> the permissions - perhaps we should just defer the w^x check until
> after calling module_frob_arch_sections()?

i opened

https://sourceware.org/bugzilla/show_bug.cgi?id=26378

and waiting for binutils maintainers if this is intentional.


Re: [PATCH v2] module: Harden STRICT_MODULE_RWX

2020-08-12 Thread Szabolcs Nagy
The 08/12/2020 13:56, Will Deacon wrote:
> On Wed, Aug 12, 2020 at 12:40:05PM +0200, pet...@infradead.org wrote:
> > On Wed, Aug 12, 2020 at 10:56:56AM +0200, Ard Biesheuvel wrote:
> > > The module .lds has BYTE(0) in the section contents to prevent the
> > > linker from pruning them entirely. The (NOLOAD) is there to ensure
> > > that this byte does not end up in the .ko, which is more a matter of
> > > principle than anything else, so we can happily drop that if it helps.
> > > 
> > > However, this should only affect the PROGBITS vs NOBITS designation,
> > > and so I am not sure whether it makes a difference.
> > > 
> > > Depending on where the w^x check occurs, we might simply override the
> > > permissions of these sections, and strip the writable permission if it
> > > is set in the PLT handling init code, which manipulates the metadata
> > > of all these 3 sections before the module space is vmalloc'ed.
> > 
> > What's curious is that this seems the result of some recent binutils
> > change. Every build with binutils-2.34 (or older) does not seem to
> > generate these as WAX, but has the much more sensible WA.
> > 
> > I suppose we can change the kernel check and 'allow' W^X for 0 sized
> > sections, but I think we should still figure out why binutils-2.35 is
> > now generating WAX sections all of a sudden, it might come bite us
> > elsewhere.
> 
> Agreed, I think it's important to figure out what's going on here before we
> try to bodge around it.
> 
> Adding Szabolcs, in case he has any ideas.
> 
> To save him reading the whole thread, here's a summary:
> 
> AArch64 kernel modules built with binutils 2.35 end up with a couple of
> ELF sections marked as SHF_WRITE | SHF_ALLOC | SHF_EXECINSTR:
> 
> [ 5] .plt PROGBITS 0388 01d000 08 00 WAX  0   0  1
> [ 6] .init.plt NOBITS 0390 01d008 08 00  WA  0   0  1
> [ 7] .text.ftrace_trampoline PROGBITS 0398 01d008 08 00 WAX  
> 0   0  1
> 
> This results in the module being rejected by our loader, because we don't
> permit writable, executable mappings.
> 
> Our linker script for these entries uses NOLOAD, so it's odd to see PROGBITS
> appearing in the readelf output above (and older binutils emits NOBITS
> sections). Anyway, here's the linker script:
> 
> SECTIONS {
>   .plt (NOLOAD) : { BYTE(0) }
>   .init.plt (NOLOAD) : { BYTE(0) }
>   .text.ftrace_trampoline (NOLOAD) : { BYTE(0) }
> }
> 
> It appears that the name of the section influences the behaviour, as
> Jessica observed [1] that sections named .text.* end up with PROGBITS,
> whereas random naming such as ".test" ends up with NOBITS, as before.
> 
> We've looked at the changelog between binutils 2.34 and 2.35, but nothing
> stands out. Any clues? Is this intentional binutils behaviour?

for me it bisects to

https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=8c803a2dd7d3d742a3d0071914f557ef465afe71

i will have to investigate further what's going on.

> 
> Thanks,
> 
> Will
> 
> [1] https://lore.kernel.org/r/20200812114127.ga10...@linux-8ccs.fritz.box


Re: [PATCH glibc 1/3] glibc: Perform rseq registration at C startup and thread creation (v20)

2020-06-18 Thread Szabolcs Nagy
The 06/11/2020 20:26, Joseph Myers wrote:
> On Thu, 11 Jun 2020, Mathieu Desnoyers wrote:
> > I managed to get a repository up and running for librseq, and have 
> > integrated
> > the rseq.2 man page with comments from Michael Kerrisk here:
> > 
> > https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2
> > 
> > Is that a suitable URL ? Can we simply point to it from glibc's manual ?
> 
> Yes, that seems something reasonable to link to.

is there work to make the usage of rseq critical
sections portable? (e.g. transactional memory
critical section has syntax in gcc, but that
doesn't require straight line code with
begin/end/abort labels in a particular layout.)

the macros and inline asm in rseq-*.h are not
too nice, but if they can completely hide the
non-portable bits then i guess that works.


Re: [PATCH v5 1/3] open: add close_range()

2020-06-06 Thread Szabolcs Nagy
* Kyle Evans  [2020-06-05 21:54:56 -0500]:
> On Fri, Jun 5, 2020 at 9:55 AM Szabolcs Nagy  wrote:
> > this api needs a documentation patch if there isn't yet.
> >
> > currently there is no libc interface contract in place that
> > says which calls may use libc internal fds e.g. i've seen
> >
> >   openlog(...) // opens libc internal syslog fd
> >   ...
> >   fork()
> >   closefrom(...) // close syslog fd
> >   open(...) // something that reuses the closed fd
> >   syslog(...) // unsafe: uses the wrong fd
> >   execve(...)
> >
> > syslog uses a libc internal fd that the user trampled on and
> > this can go bad in many ways depending on what libc apis are
> > used between closefrom (or equivalent) and exec.
> >
> 
> Documentation is good. :-) I think you'll find that while this example
> seems to be innocuous on FreeBSD (and likely other *BSD), this is an
> atypical scenario and generally not advised.  You would usually not
> start closing until you're actually ready to exec/fail.

it's a recent bug https://bugs.kde.org/show_bug.cgi?id=420921

but not the first closefrom bug i saw: it is a fundamentally
unsafe operation that frees resources owned by others.

> > > The code snippet above is one way of working around the problem that file
> > > descriptors are not cloexec by default. This is aggravated by the fact 
> > > that
> > > we can't just switch them over without massively regressing userspace. For
> >
> > why is a switch_to_cloexec_range worse than close_range?
> > the former seems safer to me. (and allows libc calls
> > to be made between such switch and exec: libc internal
> > fds have to be cloexec anyway)
> >
> 
> I wouldn't say it's worse, but it only solves half the problem. While
> closefrom -> exec is the most common usage by a long shot, there are
> also times (e.g. post-fork without intent to exec for a daemon/service
> type) that you want to go ahead and close everything except maybe a
> pipe fd that you've opened for IPC. While uncommon, there's no reason
> this needs to devolve into a loop to close 'all the fds' when you can
> instead introduce close_range to solve both the exec case and other
> less common scenarios.

the syslog example shows why post-fork closefrom without
intent to exec does not work: there is no contract about
which api calls behave like syslog, so calling anything
after closefrom can be broken.

libc can introduce new api contracts e.g. that some apis
don't use fds internally or after a closefrom call some
apis behave differently, if this is the expected direction
then it would be nice to propose that on the libc-coord
at openwall.com list.

> Coordination with libc is generally not much of an issue, because this
> is really one of the last things you do before exec() or swiftly
> failing miserably. Applications that currently loop over all fd <=
> maxfd and close(fd) right now are subject to the very same
> constraints, this is just a much more efficient way and
> debugger-friendly way to accomplish it. You've absolutely not lived
> life until you've had to watch thousands of close() calls painfully
> scroll by in truss/strace.

applications do a 'close all fds' operation because there
is no alternative. (i think there are better ways to do
this than looping: you can poll on the fds and only close
the ones that didnt POLLNVAL, this should be more portable
than /proc, but it's besides my point) optimizing this
operation may not be the only way to achive whatever those
applications are trying to do.

if closefrom only works before exec then that should be
documented and callers that do otherwise fixed, if
important users do things between closefrom and exec then
i think a different design is needed with libc maintainers
involved.


Re: [PATCH v5 1/3] open: add close_range()

2020-06-05 Thread Szabolcs Nagy
* Christian Brauner  [2020-06-02 22:42:17 +0200]:
> This adds the close_range() syscall. It allows to efficiently close a range
> of file descriptors up to all file descriptors of a calling task.
> 
> I've also coordinated with some FreeBSD developers who got in touch with
> me (Cced below). FreeBSD intends to add the same syscall once we merged it.
> Quite a bunch of projects in userspace are waiting on this syscall
> including Python and systemd.
> 
> The syscall came up in a recent discussion around the new mount API and
> making new file descriptor types cloexec by default. During this
> discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> syscall in this manner has been requested by various people over time.
> 
> First, it helps to close all file descriptors of an exec()ing task. This
> can be done safely via (quoting Al's example from [1] verbatim):
> 
> /* that exec is sensitive */
> unshare(CLONE_FILES);
> /* we don't want anything past stderr here */
> close_range(3, ~0U);
> execve();

this api needs a documentation patch if there isn't yet.

currently there is no libc interface contract in place that
says which calls may use libc internal fds e.g. i've seen

  openlog(...) // opens libc internal syslog fd
  ...
  fork()
  closefrom(...) // close syslog fd
  open(...) // something that reuses the closed fd
  syslog(...) // unsafe: uses the wrong fd
  execve(...)

syslog uses a libc internal fd that the user trampled on and
this can go bad in many ways depending on what libc apis are
used between closefrom (or equivalent) and exec.

> The code snippet above is one way of working around the problem that file
> descriptors are not cloexec by default. This is aggravated by the fact that
> we can't just switch them over without massively regressing userspace. For

why is a switch_to_cloexec_range worse than close_range?
the former seems safer to me. (and allows libc calls
to be made between such switch and exec: libc internal
fds have to be cloexec anyway)

> a whole class of programs having an in-kernel method of closing all file
> descriptors is very helpful (e.g. demons, service managers, programming
> language standard libraries, container managers etc.).
> (Please note, unshare(CLONE_FILES) should only be needed if the calling
> task is multi-threaded and shares the file descriptor table with another
> thread in which case two threads could race with one thread allocating file
> descriptors and the other one closing them via close_range(). For the
> general case close_range() before the execve() is sufficient.)
> 
> Second, it allows userspace to avoid implementing closing all file
> descriptors by parsing through /proc//fd/* and calling close() on each
> file descriptor. From looking at various large(ish) userspace code bases
> this or similar patterns are very common in:
> - service managers (cf. [4])
> - libcs (cf. [6])
> - container runtimes (cf. [5])
> - programming language runtimes/standard libraries
>   - Python (cf. [2])
>   - Rust (cf. [7], [8])
> As Dmitry pointed out there's even a long-standing glibc bug about missing
> kernel support for this task (cf. [3]).
> In addition, the syscall will also work for tasks that do not have procfs
> mounted and on kernels that do not have procfs support compiled in. In such
> situations the only way to make sure that all file descriptors are closed
> is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> OPEN_MAX trickery (cf. comment [8] on Rust).

close_range still seems like a bad operation to expose.

if users really want closing behaviour (instead of marking
fds cloexec) then they likely need coordination with libc
and other libraries.

e.g. this usage does not work:

  maxfd = findmaxfd();
  call_that_may_leak_fds();
  close_range(maxfd,~0U);

as far as i can tell only the close right before exec works.


Re: sys/sysinfo.h clash with linux/kernel.h

2020-06-03 Thread Szabolcs Nagy


i think the linux-api list is the right place for this
so adding it on cc.

* Rich Felker  [2020-06-02 17:37:05 -0400]:
> linux/kernel.h is a uapi header that does almost nothing but define
> some internal-use alignment macros and -- oddly -- include
> linux/sysinfo.h to provide a definition of struct sysinfo. It's
> included only from 6 places in the kernel uapi headers:
> 
> include/uapi/linux/lightnvm.h
> include/uapi/linux/ethtool.h
> include/uapi/linux/sysctl.h
> include/uapi/linux/netlink.h
> include/uapi/linux/netfilter/x_tables.h
> include/uapi/linux/mroute6.h
> 
> However, it's also included from glibc's sys/sysinfo.h to provide
> struct sysinfo (glibc depends on the kernel for the definition). On
> musl, this produces a conflicting definition if both sys/sysinfo.h and
> any of the above 6 headers are included in the same file.
> 
> I think the underlying problem here is that the same header is used
> for two very disjoint purposes: by glibc as the provider of struct
> sysinfo, and by other kernel headers as provider of the alignment
> macros.
> 
> The glibc use is effectively a permanent contract that can't be
> changed, so what I'd like to do is move the macros out to a separate
> header (maybe linux/something_macros.h?) and have linux/kernel.h and
> the above 6 uapi headers all include that. Then nothing but
> linux/kernel.h would pull in linux/sysinfo.h.

i think providing a patch would make this happen faster.

ideally uapi would be reorganized such that it's clear
what headers are supposed to be compatible with inclusion
together with libc headers and what headers may conflict.

> 
> Note that in practice this is a rather hard issue to hit since almost
> nothing needs sysinfo() at the same time as the above uapi interfaces.
> However it did come up in toybox, which is how I first (just today)
> learned about the conflict, so it seems like something that should be
> fixed.
> 
> Rich


Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4

2020-05-20 Thread Szabolcs Nagy
The 05/19/2020 22:31, Arnd Bergmann wrote:
> On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella
>  wrote:
> > On 19/05/2020 16:54, Arnd Bergmann wrote:
> > > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call 
> > > last
> > > month: https://github.com/richfelker/musl-cross-make/issues/96 and
> > > https://github.com/raspberrypi/linux/issues/3579
> > >
> > > As Will Deacon pointed out, this was never reported on the mailing list,
> > > so I'll try to summarize what we know, so this can hopefully be resolved 
> > > soon.
> > >
> > > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
> > >kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
> > >clock_gettime64(CLOCK_REALTIME)
> >
> > Does it happen with other clocks as well?
> 
> Unclear.
> 
> > > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could
> > >   see no relevant changes compared to a mainline kernel.
> >
> > Is this bug reproducible with mainline kernel or mainline kernel can't be
> > booted on bcm2711?
> 
> Mainline linux-5.6 should boot on that machine but might not have
> all the other features, so I think users tend to use the raspberry pi
> kernel sources for now.
> 
> > > - From the report, I see that the returned time value is larger than the
> > >   expected time, by 3.4 to 14.5 million seconds in four samples, my
> > >   guess is that a random number gets added in at some point.
> >
> > What kind code are you using to reproduce it? It is threaded or issue
> > clock_gettime from signal handlers?
> 
> The reproducer is very simple without threads or signals,
> see the start of https://github.com/richfelker/musl-cross-make/issues/96
> 
> It does rely on calling into the musl wrapper, not the direct vdso
> call.
> 
> > > - From other sources, I found that the Raspberry Pi clocksource runs
> > >   at 54 MHz, with a mask value of 0xff. From these numbers
> > >   I would expect that reading a completely random hardware register
> > >   value would result in an offset up to 1.33 billion seconds, which is
> > >   around factor 100 more than the error we see, though similar.
> > >
> > > - The test case calls the musl clock_gettime() function, which falls back 
> > > to
> > >   the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit
> > >   clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 
> > > does
> > >   not show the bug.
> > >
> > > - The behavior was not reproduced on the same user space in qemu,
> > >   though I cannot tell whether the exact same kernel binary was used.
> > >
> > > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to
> > >   implement clock_gettime(), but earlier versions did not. I have not
> > >   seen any reports of this bug, which could be explained by users
> > >   generally being on older versions.
> > >
> > > - As far as I can tell, there are no reports of this bug from other users,
> > >   and so far nobody could reproduce it.

note: i could not reproduce it in qemu-system with these configs:

qemu-system-aarch64 + arm64 kernel + compat vdso
qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel
qemu-system-arm + cpu max + 32bit arm kernel

so i think it's something specific to that user's setup
(maybe rpi hw bug or gcc miscompiled the vdso or something
with that particular linux, i built my own linux 5.6 because
i did not know the exact kernel version where the bug was seen)

i don't have access to rpi (or other cortex-a53 where i
can install my own kernel) so this is as far as i got.

> > >
> > > - The current musl git tree has been patched to not call clock_gettime64
> > >on ARM because of this problem, so it cannot be used for reproducing 
> > > it.
> >
> > So should glibc follow musl and remove arm clock_gettime6y4 vDSO support
> > or this bug is localized to an specific kernel version running on an
> > specific hardware?
> 
> I hope we can figure out what is actually going on soon, there is probably
> no need to change glibc before we have.
> 
>   Arnd

-- 


Re: [PATCH] ARM: pass -msoft-float to gcc earlier

2020-05-20 Thread Szabolcs Nagy
The 05/19/2020 17:38, Nick Desaulniers wrote:
> sorry, hit tab/enter too soon...
> 
> On Tue, May 19, 2020 at 5:37 PM Nick Desaulniers
>  wrote:
> >
> > On Tue, May 19, 2020 at 3:09 PM Arnd Bergmann  wrote:
> > >
> > > Szabolcs Nagy ran into a kernel build failure with a custom gcc
> > > toochain that sets -mfpu=auto -mfloat-abi=hard:
> > >
> > >  /tmp/ccmNdcdf.s:1898: Error: selected processor does not support `cpsid 
> > > i' in ARM mode
> > >
> > > The problem is that $(call cc-option, -march=armv7-a) fails before the
> > > kernel overrides the gcc options to also pass -msoft-float.
> >
> > The call to `$(call cc-option, -march=armv7-a) is th
> 
> The call to `$(call cc-option, -march=armv7-a) is the one that fails or...?

the flag -march=armv7-a is invalid if the float abi
is hard and no fpu is specified (since gcc-8).

either an fpu should be specified or -march=armv7-a+fp
(my toolchain was configured with the latter and it does
not work if the kernel overrides it with -march=armv7-a)

because of this cc-option failure the kernel goes on to
pass nonsense flags everywhere (-march=armv5t) and some
compilation eventually fails with an asm error.

> > >
> > > Move the option to the beginning the Makefile, before we call
> >
> > beginning of the
> >
> > > cc-option for the first time.
> > >
> > > Reported-by: Szabolcs Nagy 
> > > Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87302
> > > Signed-off-by: Arnd Bergmann 
> >
> > Moving this looks harmless enough, though it's not clear to me how the
> > failure you're describing would occur.  I don't see calls to as-instr
> > in arch/arm/Makefile.  Which object is being built before -msoft-float
> > is being set?
> 
> ... ^
> 
> >
> > > ---
> > >  arch/arm/Makefile | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm/Makefile b/arch/arm/Makefile
> > > index 7d5cd0f85461..e428ea6eb0fa 100644
> > > --- a/arch/arm/Makefile
> > > +++ b/arch/arm/Makefile
> > > @@ -16,6 +16,8 @@ LDFLAGS_vmlinux   += --be8
> > >  KBUILD_LDFLAGS_MODULE  += --be8
> > >  endif
> > >
> > > +KBUILD_CFLAGS  += -msoft-float
> > > +
> > >  ifeq ($(CONFIG_ARM_MODULE_PLTS),y)
> > >  KBUILD_LDS_MODULE  += $(srctree)/arch/arm/kernel/module.lds
> > >  endif
> > > @@ -135,7 +137,7 @@ AFLAGS_ISA  :=$(CFLAGS_ISA)
> > >  endif
> > >
> > >  # Need -Uarm for gcc < 3.x
> > > -KBUILD_CFLAGS  +=$(CFLAGS_ABI) $(CFLAGS_ISA) $(arch-y) $(tune-y) $(call 
> > > cc-option,-mshort-load-bytes,$(call cc-option,-malignment-traps,)) 
> > > -msoft-float -Uarm
> > > +KBUILD_CFLAGS  +=$(CFLAGS_ABI) $(CFLAGS_ISA) $(arch-y) $(tune-y) $(call 
> > > cc-option,-mshort-load-bytes,$(call cc-option,-malignment-traps,)) -Uarm
> > >  KBUILD_AFLAGS  +=$(CFLAGS_ABI) $(AFLAGS_ISA) $(arch-y) $(tune-y) 
> > > -include asm/unified.h -msoft-float
> > >
> > >  CHECKFLAGS += -D__arm__
> > > --
> > > 2.26.2
> > >
> 
> -- 
> Thanks,
> ~Nick Desaulniers

-- 


Re: [PATCH glibc 5/9] glibc: Perform rseq(2) registration at C startup and thread creation (v17)

2020-04-29 Thread Szabolcs Nagy
The 04/29/2020 10:18, Florian Weimer wrote:
> * Szabolcs Nagy:
> 
> > The 04/28/2020 10:58, Mathieu Desnoyers wrote:
> >> - On Apr 28, 2020, at 8:54 AM, Florian Weimer f...@deneb.enyo.de wrote:
> >> > That one definitely should work.
> >> > 
> >> > I expect you might see this if libgcc_s.so.1 is installed into a
> >> > multiarch subdirectory that upstream glibc does not search.  (The
> >> > Debian patches are unfortunately not upstream.)
> >> 
> >> My test environment is a Ubuntu 18.04.1 LTS.
> >> 
> >> > 
> >> > I think on my system, the built glibc can find the system libgcc_s via
> >> > /etc/ld.so.cache, so I haven't seen this issue yet.
> >> 
> >> On my system, libgcc_s is provided here:
> >> 
> >> /lib/x86_64-linux-gnu/libgcc_s.so.1
> >> 
> >> by this package:
> >> 
> >> Package: libgcc1
> >> Architecture: amd64
> >> Version: 1:8.4.0-1ubuntu1~18.04
> >
> > before running the tests
> >
> > cp `$CC --print-file-name libgcc_s.so.1` glibc/build/dir
> > cp `$CC --print-file-name libstdc++.so.6` glibc/build/dir
> >
> > so those toolchain libs are in the search path
> > of the newly built libc when running tests.
> 
> Do you actually see the need for these steps yourself?
> 
> I guess the correct fix would be to upstream the Debian multiarch
> changes and activate them automatically with a configure check on
> systems that use multiarch paths.

cancel tests work for me on an ubuntu system because
of /etc/ld.so.cache, but that may not be present
or the system may not be glibc based at all.

i always do the cp because i build gcc myself (usually
close to current master) and don't install it to the
system path which means at compile time and runtime
different libraries are used if i dont copy


Re: [PATCH glibc 5/9] glibc: Perform rseq(2) registration at C startup and thread creation (v17)

2020-04-29 Thread Szabolcs Nagy
The 04/28/2020 10:58, Mathieu Desnoyers wrote:
> - On Apr 28, 2020, at 8:54 AM, Florian Weimer f...@deneb.enyo.de wrote:
> > That one definitely should work.
> > 
> > I expect you might see this if libgcc_s.so.1 is installed into a
> > multiarch subdirectory that upstream glibc does not search.  (The
> > Debian patches are unfortunately not upstream.)
> 
> My test environment is a Ubuntu 18.04.1 LTS.
> 
> > 
> > I think on my system, the built glibc can find the system libgcc_s via
> > /etc/ld.so.cache, so I haven't seen this issue yet.
> 
> On my system, libgcc_s is provided here:
> 
> /lib/x86_64-linux-gnu/libgcc_s.so.1
> 
> by this package:
> 
> Package: libgcc1
> Architecture: amd64
> Version: 1:8.4.0-1ubuntu1~18.04

before running the tests

cp `$CC --print-file-name libgcc_s.so.1` glibc/build/dir
cp `$CC --print-file-name libstdc++.so.6` glibc/build/dir

so those toolchain libs are in the search path
of the newly built libc when running tests.


Re: [PATCH glibc] Linux: Include in under __USE_MISC

2019-07-22 Thread Szabolcs Nagy
On 22/07/2019 14:44, Florian Weimer wrote:
> * Szabolcs Nagy:
> 
>> (note: in musl these ioctl macros are in sys/ioctl.h
>> which is not a posix header so namespace rules are
>> less strict than for sys/socket.h and users tend to
>> include it for ioctl())
> 
>  can be confusing because some of the constants may depend
> on types that aren't declared by including the header.  This makes their
> macros unusable.  Defining ioctl constants in headers which also provide
> the matching types avoids that problem at least, also it can introduce
> namespace issues.

yeah, the way we deal with that is we hard code the numbers
since the size of struct is abi, they cannot change.


Re: [PATCH glibc] Linux: Include in under __USE_MISC

2019-07-22 Thread Szabolcs Nagy
On 22/07/2019 12:34, Arnd Bergmann wrote:
> On Mon, Jul 22, 2019 at 1:31 PM Florian Weimer  wrote:
>>
>> Historically,  (which is included from )
>> provided ioctl operations for sockets.  User code accessed them
>> through .  The kernel UAPI headers have removed these
>> definitions in favor of .  This commit makes them
>> available via  again.
> 
> Looks good to me.
> 
> I wonder if we should still do these two changes in the kernel:
> 
> - include asm/socket.h from linux/socket.h for consistency
> - move the defines that got moved from asm/sockios.h to linux/sockios.h
>   back to the previous location to help anyone who is user
>   newer kernel headers with older glibc headers.

does user code actually expect these in sys/socket.h
or in asm/socket.h ?

man 7 socket
mentions SIOCGSTAMP but does not say the header.

man 2 ioctl_list
specifies include/asm-i386/socket.h as the location.

if user code tends to directly include asm/socket.h
then i think it's better to undo the kernel header
change than to put things in sys/socket.h.

(note: in musl these ioctl macros are in sys/ioctl.h
which is not a posix header so namespace rules are
less strict than for sys/socket.h and users tend to
include it for ioctl())



Re: [PATCH v4 1/2] arm64: Define Documentation/arm64/tagged-address-abi.txt

2019-06-13 Thread Szabolcs Nagy
On 13/06/2019 15:03, Vincenzo Frascino wrote:
> On 13/06/2019 13:28, Szabolcs Nagy wrote:
>> On 13/06/2019 12:16, Vincenzo Frascino wrote:
>>> On 13/06/2019 11:14, Szabolcs Nagy wrote:
>>>> On 13/06/2019 10:20, Catalin Marinas wrote:
>>>>> On Wed, Jun 12, 2019 at 05:30:34PM +0100, Szabolcs Nagy wrote:
>>>>>> On 12/06/2019 15:21, Vincenzo Frascino wrote:
>>>>>>> +  - a mapping below sbrk(0) done by the process itself
>>>>>>
>>>>>> doesn't the mmap rule cover this?
>>>>>
>>>>> IIUC it doesn't cover it as that's memory mapped by the kernel
>>>>> automatically on access vs a pointer returned by mmap(). The statement
>>>>> above talks about how the address is obtained by the user.
>>>>
>>>> ok i read 'mapping below sbrk' as an mmap (possibly MAP_FIXED)
>>>> that happens to be below the heap area.
>>>>
>>>> i think "below sbrk(0)" is not the best term to use: there
>>>> may be address range below the heap area that can be mmapped
>>>> and thus below sbrk(0) and sbrk is a posix api not a linux
>>>> syscall, the libc can implement it with mmap or whatever.
>>>>
>>>> i'm not sure what the right term for 'heap area' is
>>>> (the address range between syscall(__NR_brk,0) at
>>>> program startup and its current value?)
>>>>
>>>
>>> I used sbrk(0) with the meaning of "end of the process's data segment" not
>>> implying that this is a syscall, but just as a useful way to identify the 
>>> mapping.
>>> I agree that it is a posix function implemented by libc but when it is used 
>>> with
>>> 0 finds the current location of the program break, which can be changed by 
>>> brk()
>>> and depending on the new address passed to this syscall can have the effect 
>>> of
>>> allocating or deallocating memory.
>>>
>>> Will changing sbrk(0) with "end of the process's data segment" make it more 
>>> clear?
>>
>> i don't understand what's the relevance of the *end*
>> of the data segment.
>>
>> i'd expect the text to say something about the address
>> range of the data segment.
>>
>> i can do
>>
>> mmap((void*)65536, 65536, PROT_READ|PROT_WRITE, 
>> MAP_FIXED|MAP_SHARED|MAP_ANON, -1, 0);
>>
>> and it will be below the end of the data segment.
>>
> 
> As far as I understand the data segment "lives" below the program break, hence
> it is a way of describing the range from which the user can obtain a valid
> tagged pointer.>
> Said that, I am not really sure on how do you want me to document this (my aim
> is for this to be clear to the userspace developers). Could you please propose
> something?

[...], it is in the memory ranges privately owned by a
userspace process and it is obtained in one of the
following ways:

- mmap done by the process itself, [...]

- brk syscall done by the process itself.
  (i.e. the heap area between the initial location
  of the program break at process creation and its
  current location.)

- any memory mapped by the kernel [...]

the data segment that's part of the process image is
already covered by the last point.


Re: [PATCH v1 1/2] fork: add clone3

2019-05-30 Thread Szabolcs Nagy
* Christian Brauner  [2019-05-29 17:22:36 +0200]:
> This adds the clone3 system call.
> 
> As mentioned several times already (cf. [7], [8]) here's the promised
> patchset for clone3().
> 
> We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
> free flag from clone().
> 
> Independent of the CLONE_PIDFD patchset a time namespace has been discussed
> at Linux Plumber Conference last year and has been sent out and reviewed
> (cf. [5]). It is expected that it will go upstream in the not too distant
> future. However, it relies on the addition of the CLONE_NEWTIME flag to
> clone(). The only other good candidate - CLONE_DETACHED - is currently not
> recyclable as we have identified at least two large or widely used
> codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
> CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
> blocked. clone3() has the advantage that it will unblock this patchset
> again.
> 
> The idea is to keep clone3() very simple and close to the original clone(),
> specifically, to keep on supporting old clone()-based workloads.
> We know there have been various creative proposals how a new process
> creation syscall or even api is supposed to look like. Some people even
> going so far as to argue that the traditional fork()+exec() split should be
> abandoned in favor of an in-kernel version of spawn(). Independent of
> whether or not we personally think spawn() is a good idea this patchset has
> and does not want to have anything to do with this.
> One stance we take is that there's no real good alternative to
> clone()+exec() and we need and want to support this model going forward;
> independent of spawn().
> The following requirements guided clone3():
> - bump the number of available flags
> - move arguments that are currently passed as separate arguments
>   in clone() into a dedicated struct clone_args
>   - choose a struct layout that is easy to handle on 32 and on 64 bit
>   - choose a struct layout that is extensible
>   - give new flags that currently need to abuse another flag's dedicated
> return argument in clone() their own dedicated return argument
> (e.g. CLONE_PIDFD)
>   - use a separate kernel internal struct kernel_clone_args that is
> properly typed according to current kernel conventions in fork.c and is
> different from  the uapi struct clone_args
> - port _do_fork() to use kernel_clone_args so that all process creation
>   syscalls such as fork(), vfork(), clone(), and clone3() behave identical
>   (Arnd suggested, that we can probably also port do_fork() itself in a
>separate patchset.)
> - ease of transition for userspace from clone() to clone3()
>   This very much means that we do *not* remove functionality that userspace
>   currently relies on as the latter is a good way of creating a syscall
>   that won't be adopted.
> - do not try to be clever or complex: keep clone3() as dumb as possible
> 
> In accordance with Linus suggestions, clone3() has the following signature:
> 
> /* uapi */
> struct clone_args {
> __aligned_u64 flags;
> __aligned_u64 pidfd;
> __aligned_u64 parent_tidptr;
> __aligned_u64 child_tidptr;
> __aligned_u64 stack;
> __aligned_u64 stack_size;
> __aligned_u64 tls;
> };

is this new linux syscall api style to pass pointers as u64?

i think it will look a bit ugly in userspace where cast
to u64 would signextend pointers on most 32bit targets, so
user code would have to do something like

arg.ptr = (uint64_t)(uintptr_t)ptr;

such ugliness can be hidden by the libc with a different
struct definition, but it will require bigendian and alignment
hackery (or translation in libc, but that does not really work
when user calls raw syscall).

> 
> /* kernel internal */
> struct kernel_clone_args {
> u64 flags;
> int __user *pidfd;
> int __user *parent_tidptr;
> int __user *child_tidptr;
> unsigned long stack;
> unsigned long stack_size;
> unsigned long tls;
> };
> 
> long sys_clone3(struct clone_args __user *uargs, size_t size)
> 
> clone3() cleanly supports all of the supported flags from clone() and thus
> all legacy workloads.
> The advantage of sticking close to the old clone() is the low cost for
> userspace to switch to this new api. Quite a lot of userspace apis (e.g.
> pthreads) are based on the clone() syscall. With the new clone3() syscall
> supporting all of the old workloads and opening up the ability to add new
> features should make switching to it for userspace more appealing. In
> essence, glibc can just write a simple wrapper to switch from clone() to
> clone3().
> 
> There has been some interest in this patchset already. We have received a
> patch from the CRIU corner for clone3() that would set the PID/TID of a
> restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
> 
> /* References */
> [1]: 

Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v9)

2019-04-23 Thread Szabolcs Nagy
On 22/04/2019 18:56, Mathieu Desnoyers wrote:
> diff --git a/sysdeps/unix/sysv/linux/aarch64/bits/rseq.h 
> b/sysdeps/unix/sysv/linux/aarch64/bits/rseq.h
> new file mode 100644
> index 00..e538668612
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/aarch64/bits/rseq.h
> @@ -0,0 +1,44 @@
> +/* Restartable Sequences Linux aarch64 architecture header.
> +
> +   Copyright (C) 2019 Free Software Foundation, Inc.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   .  */
> +
> +#ifndef _SYS_RSEQ_H
> +# error "Never use  directly; include  instead."
> +#endif
> +
> +/* RSEQ_SIG is a signature required before each abort handler code.
> +
> +   It is a 32-bit value that maps to actual architecture code compiled
> +   into applications and libraries. It needs to be defined for each
> +   architecture. When choosing this value, it needs to be taken into
> +   account that generating invalid instructions may have ill effects on
> +   tools like objdump, and may also have impact on the CPU speculative
> +   execution efficiency in some cases.
> +
> +   aarch64 -mbig-endian generates mixed endianness code vs data:
> +   little-endian code and big-endian data. Ensure the RSEQ_SIG signature
> +   matches code endianness.  */
> +
> +#define RSEQ_SIG_CODE0xd428bc00  /* BRK #0x45E0.  */
> +
> +#ifdef __AARCH64EB__
> +#define RSEQ_SIG_DATA0x00bc28d4  /* BRK #0x45E0.  */
> +#else
> +#define RSEQ_SIG_DATARSEQ_SIG_CODE
> +#endif
> +
> +#define RSEQ_SIG RSEQ_SIG_DATA
> diff --git a/sysdeps/unix/sysv/linux/aarch64/libc.abilist 
> b/sysdeps/unix/sysv/linux/aarch64/libc.abilist
> index 9c330f325e..331f39e41a 100644
> --- a/sysdeps/unix/sysv/linux/aarch64/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/aarch64/libc.abilist
> @@ -2141,3 +2141,5 @@ GLIBC_2.28 thrd_yield F
>  GLIBC_2.29 getcpu F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
> +GLIBC_2.30 __rseq_abi T 0x20
> +GLIBC_2.30 __rseq_handled D 0x4

the aarch64 changes are ok.


Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)

2019-04-23 Thread Szabolcs Nagy
On 18/04/2019 19:17, Mathieu Desnoyers wrote:
> - On Apr 18, 2019, at 1:37 PM, Szabolcs Nagy szabolcs.n...@arm.com wrote:
>> you have to add a documentation comment somewhere
>> explaining if RSEQ_SIG is the value that's passed to
>> the kernel and then aarch64 asm code has to use
>>
>> .inst endianfixup(RSEQ_SIG) // or
>> .word RSEQ_SIG
> 
> Using ".word" won't allow objdump to show the instruction it
> maps to. It will consider it as data. So .inst is preferred here.

is there some specific reason you prefer .inst?

disassembling a canary value as data (that is
never executed, but loaded and compared by the
kernel as data) sounds more semantically correct
to me than showing it as an instruction.

i guess having it as an instruction can avoid
issues if some tools dislike .word in .text,
but otherwise .word seems better.


Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)

2019-04-18 Thread Szabolcs Nagy
On 18/04/2019 18:10, Mathieu Desnoyers wrote:
> 
> - On Apr 18, 2019, at 12:07 PM, Szabolcs Nagy szabolcs.n...@arm.com wrote:
> 
>> On 18/04/2019 16:41, Mathieu Desnoyers wrote:
>>> - On Apr 18, 2019, at 11:33 AM, Szabolcs Nagy szabolcs.n...@arm.com 
>>> wrote:
>>>
>>>> On 18/04/2019 14:17, Mathieu Desnoyers wrote:
>>>>> - On Apr 17, 2019, at 3:56 PM, Mathieu Desnoyers
>>>>> mathieu.desnoy...@efficios.com wrote:
>>>>>> - On Apr 17, 2019, at 12:17 PM, Joseph Myers jos...@codesourcery.com 
>>>>>> wrote:
>>>>>>> On Wed, 17 Apr 2019, Mathieu Desnoyers wrote:
>>>>>>>
>>>>>>>>> +/* RSEQ_SIG is a signature required before each abort handler code.
>>>>>>>>> +
>>>>>>>>> +   It is a 32-bit value that maps to actual architecture code 
>>>>>>>>> compiled
>>>>>>>>> +   into applications and libraries. It needs to be defined for each
>>>>>>>>> +   architecture. When choosing this value, it needs to be taken into
>>>>>>>>> +   account that generating invalid instructions may have ill effects 
>>>>>>>>> on
>>>>>>>>> +   tools like objdump, and may also have impact on the CPU 
>>>>>>>>> speculative
>>>>>>>>> +   execution efficiency in some cases.  */
>>>>>>>>> +
>>>>>>>>> +#define RSEQ_SIG 0xd428bc00  /* BRK #0x45E0.  */
>>>>>>>>
>>>>>>>> After further investigation, we should probably do the following
>>>>>>>> to handle compiling with -mbig-endian on aarch64, which generates
>>>>>>>> binaries with mixed code vs data endianness (little endian code,
>>>>>>>> big endian data):
>>>>>>>
>>>>>>> First, the comment on RSEQ_SIG should specify whether it is to be
>>>>>>> interpreted in the code or the data endianness.
>>>>>>
>>>>>> Right. The signature passed as argument to the rseq registration
>>>>>> system call needs to be in data endianness (currently exposed kernel
>>>>>> ABI).
>>>>>>
>>>>>> Ideally for userspace, we want to define a signature in code endianness
>>>>>> that happens to nicely match specific code patterns.
>>>> ...
>>>>> For aarch64, I think we can simply do:
>>>>>
>>>>> /*
>>>>>  * aarch64 -mbig-endian generates mixed endianness code vs data:
>>>>>  * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
>>>>>  * matches code endianness.
>>>>>  */
>>>>> #define RSEQ_SIG_CODE   0xd428bc00  /* BRK #0x45E0.  */
>>>>>
>>>>> #ifdef __ARM_BIG_ENDIAN
>>>>> #define RSEQ_SIG_DATA   0x00bc28d4  /* BRK #0x45E0.  */
>>>>> #else
>>>>> #define RSEQ_SIG_DATA   RSEQ_SIG_CODE
>>>>> #endif
>>>>>
>>>>> #define RSEQ_SIGRSEQ_SIG_DATA
>>>>>
>>>>> Feedback is most welcome,
>>>>
>>>> so the RSEQ_SIG value is supposed to be used with .word
>>>> in asm instead of .inst?
>>>
>>> We want a .inst so it translates into a valid trap instruction.
>>> It's better to trap in case program execution reaches this
>>> by mistake (makes debugging easier).
>>
>> that does not make sense to me.
>>
>> ".inst" is an asm directive that requires a the value to
>> be the same on BE and LE (normal insn encoding).
>>
>> ".word" is an asm directive that requires the value to
>> use swapped encoding on BE (if it's used in the instruction
>> stream it will create a data mapping symbol and disasm to
>> .word value instead of the instruction mnemonics).
>>
>> so which one is it?
> 
> We declare the signature with ".inst" in assembler.
> 
> However, we also need to pass that 32-bit signature value as
> argument to the rseq system call when registering rseq.
> 
> The signature comparison is performed by the kernel before
> moving the instruction pointer to the abort handler. It compares
> the signature received as parameter by sys_rseq (data) to the
> 4-byte signature preceding the abort IP.
> 
> On aarch64 big endian, AFAIU the signature in the code is in
> little endian, and the signature value passed as argument to
> the rseq system call is in big endian. One way to handle this
> is to swap the byte order of the signature "data" representation
> passed as argument to sys_rseq.

you have to add a documentation comment somewhere
explaining if RSEQ_SIG is the value that's passed to
the kernel and then aarch64 asm code has to use

 .inst endianfixup(RSEQ_SIG) // or
 .word RSEQ_SIG

or if RSEQ_SIG is used as

 .inst RSEQ_SIG

in aarch64 asm and then endianfixup(RSEQ_SIG) should
be passed to the syscall.

either way it can be a brk 0x45e0 on both LE and BE,
but in the latter case you have to document this in
arch independent way, since the syscall api must be
portable (i assume "RSEQ_SIG" is part of the api).


Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)

2019-04-18 Thread Szabolcs Nagy
On 18/04/2019 16:41, Mathieu Desnoyers wrote:
> - On Apr 18, 2019, at 11:33 AM, Szabolcs Nagy szabolcs.n...@arm.com wrote:
> 
>> On 18/04/2019 14:17, Mathieu Desnoyers wrote:
>>> - On Apr 17, 2019, at 3:56 PM, Mathieu Desnoyers
>>> mathieu.desnoy...@efficios.com wrote:
>>>> - On Apr 17, 2019, at 12:17 PM, Joseph Myers jos...@codesourcery.com 
>>>> wrote:
>>>>> On Wed, 17 Apr 2019, Mathieu Desnoyers wrote:
>>>>>
>>>>>>> +/* RSEQ_SIG is a signature required before each abort handler code.
>>>>>>> +
>>>>>>> +   It is a 32-bit value that maps to actual architecture code compiled
>>>>>>> +   into applications and libraries. It needs to be defined for each
>>>>>>> +   architecture. When choosing this value, it needs to be taken into
>>>>>>> +   account that generating invalid instructions may have ill effects on
>>>>>>> +   tools like objdump, and may also have impact on the CPU speculative
>>>>>>> +   execution efficiency in some cases.  */
>>>>>>> +
>>>>>>> +#define RSEQ_SIG 0xd428bc00/* BRK #0x45E0.  */
>>>>>>
>>>>>> After further investigation, we should probably do the following
>>>>>> to handle compiling with -mbig-endian on aarch64, which generates
>>>>>> binaries with mixed code vs data endianness (little endian code,
>>>>>> big endian data):
>>>>>
>>>>> First, the comment on RSEQ_SIG should specify whether it is to be
>>>>> interpreted in the code or the data endianness.
>>>>
>>>> Right. The signature passed as argument to the rseq registration
>>>> system call needs to be in data endianness (currently exposed kernel
>>>> ABI).
>>>>
>>>> Ideally for userspace, we want to define a signature in code endianness
>>>> that happens to nicely match specific code patterns.
>> ...
>>> For aarch64, I think we can simply do:
>>>
>>> /*
>>>  * aarch64 -mbig-endian generates mixed endianness code vs data:
>>>  * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
>>>  * matches code endianness.
>>>  */
>>> #define RSEQ_SIG_CODE   0xd428bc00  /* BRK #0x45E0.  */
>>>
>>> #ifdef __ARM_BIG_ENDIAN
>>> #define RSEQ_SIG_DATA   0x00bc28d4  /* BRK #0x45E0.  */
>>> #else
>>> #define RSEQ_SIG_DATA   RSEQ_SIG_CODE
>>> #endif
>>>
>>> #define RSEQ_SIGRSEQ_SIG_DATA
>>>
>>> Feedback is most welcome,
>>
>> so the RSEQ_SIG value is supposed to be used with .word
>> in asm instead of .inst?
> 
> We want a .inst so it translates into a valid trap instruction.
> It's better to trap in case program execution reaches this
> by mistake (makes debugging easier).

that does not make sense to me.

".inst" is an asm directive that requires a the value to
be the same on BE and LE (normal insn encoding).

".word" is an asm directive that requires the value to
use swapped encoding on BE (if it's used in the instruction
stream it will create a data mapping symbol and disasm to
.word value instead of the instruction mnemonics).

so which one is it?

>> i don't think we use __ARM_* in public headers currently,
>> but hopefully aarch64_be compilers implement it.
> 
> Can I use #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__  then ?

hm, i'd either use #ifdef __AARCH64EB__ (since we already use it)
or the portable #include endian.h and __BYTE_ORDER == __BIG_ENDIAN


Re: [PATCH 2/5] glibc: sched_getcpu(): use rseq cpu_id TLS on Linux (v2)

2019-04-18 Thread Szabolcs Nagy
On 16/04/2019 18:32, Mathieu Desnoyers wrote:
> --- a/sysdeps/unix/sysv/linux/sched_getcpu.c
> +++ b/sysdeps/unix/sysv/linux/sched_getcpu.c
> @@ -37,3 +37,26 @@ sched_getcpu (void)
>return -1;
>  #endif
>  }
> +
> +#ifdef __NR_rseq
> +#include 
> +#endif
> +
> +#if defined __NR_rseq && defined RSEQ_SIG
> +extern __attribute__ ((tls_model ("initial-exec")))
> +__thread volatile struct rseq __rseq_abi;

i'd expect sys/rseq.h to provide this declaration.

> +
> +int
> +sched_getcpu (void)
> +{
> +  int cpu_id = __rseq_abi.cpu_id;
> +
> +  return cpu_id >= 0 ? cpu_id : vsyscall_sched_getcpu ();
> +}
> +#else
> +int
> +sched_getcpu (void)
> +{
> +  return vsyscall_sched_getcpu ();
> +}
> +#endif
> -- 2.17.1
> 



Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)

2019-04-18 Thread Szabolcs Nagy
On 18/04/2019 14:17, Mathieu Desnoyers wrote:
> - On Apr 17, 2019, at 3:56 PM, Mathieu Desnoyers 
> mathieu.desnoy...@efficios.com wrote:
>> - On Apr 17, 2019, at 12:17 PM, Joseph Myers jos...@codesourcery.com 
>> wrote:
>>> On Wed, 17 Apr 2019, Mathieu Desnoyers wrote:
>>>
> +/* RSEQ_SIG is a signature required before each abort handler code.
> +
> +   It is a 32-bit value that maps to actual architecture code compiled
> +   into applications and libraries. It needs to be defined for each
> +   architecture. When choosing this value, it needs to be taken into
> +   account that generating invalid instructions may have ill effects on
> +   tools like objdump, and may also have impact on the CPU speculative
> +   execution efficiency in some cases.  */
> +
> +#define RSEQ_SIG 0xd428bc00  /* BRK #0x45E0.  */

 After further investigation, we should probably do the following
 to handle compiling with -mbig-endian on aarch64, which generates
 binaries with mixed code vs data endianness (little endian code,
 big endian data):
>>>
>>> First, the comment on RSEQ_SIG should specify whether it is to be
>>> interpreted in the code or the data endianness.
>>
>> Right. The signature passed as argument to the rseq registration
>> system call needs to be in data endianness (currently exposed kernel
>> ABI).
>>
>> Ideally for userspace, we want to define a signature in code endianness
>> that happens to nicely match specific code patterns.
...
> For aarch64, I think we can simply do:
> 
> /*
>  * aarch64 -mbig-endian generates mixed endianness code vs data:
>  * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
>  * matches code endianness.
>  */
> #define RSEQ_SIG_CODE   0xd428bc00  /* BRK #0x45E0.  */
> 
> #ifdef __ARM_BIG_ENDIAN
> #define RSEQ_SIG_DATA   0x00bc28d4  /* BRK #0x45E0.  */
> #else
> #define RSEQ_SIG_DATA   RSEQ_SIG_CODE
> #endif
> 
> #define RSEQ_SIGRSEQ_SIG_DATA
> 
> Feedback is most welcome,

so the RSEQ_SIG value is supposed to be used with .word
in asm instead of .inst?

i don't think we use __ARM_* in public headers currently,
but hopefully aarch64_be compilers implement it.

otherwise this looks ok to me.

(i think a rare palindrome instruction would work too, e.g.
0a5f5f0aand w10, w24, wzr, lsr #23 // shifted 0
2a5f5f2aorr w10, w25, wzr, lsr #23
eb9f9febnegsx11, xzr, asr #39
c83f3fc8stxpwzr, x8, x15, [x30]  // store to LR ignoring success
d9d9stz2g   x25, [x30, #-16]!// v8.5 tag+zero 2 granules around 
LR
etc. it does not need to be a guaranteed trap)


Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-25 Thread Szabolcs Nagy
On 25/02/2019 16:57, Catalin Marinas wrote:
> On Tue, Feb 19, 2019 at 06:38:31PM +0000, Szabolcs Nagy wrote:
>> i think these rules work for the cases i care about, a more
>> tricky question is when/how to check for the new syscall abi
>> and when/how the TCR_EL1.TBI0 setting may be turned off.
> 
> I don't think turning TBI0 off is critical (it's handy for PAC with
> 52-bit VA but then it's short-lived if you want more security features
> like MTE).

yes, i made a mistake assuming TBI0 off is
required for (or at least compatible with) MTE.

if TBI0 needs to be on for MTE then some of my
analysis is wrong, and i expect TBI0 to be on
in the foreseeable future.

>> consider the following cases (tb == top byte):
>>
>> binary 1: user tb = any, syscall tb = 0
>>   tbi is on, "legacy binary"
>>
>> binary 2: user tb = any, syscall tb = any
>>   tbi is on, "new binary using tb"
>>   for backward compat it needs to check for new syscall abi.
>>
>> binary 3: user tb = 0, syscall tb = 0
>>   tbi can be off, "new binary",
>>   binary is marked to indicate unused tb,
>>   kernel may turn tbi off: additional pac bits.
>>
>> binary 4: user tb = mte, syscall tb = mte
>>   like binary 3, but with mte, "new binary using mte"

so this should be "like binary 2, but with mte".

>>   does it have to check for new syscall abi?
>>   or MTE HWCAP would imply it?
>>   (is it possible to use mte without new syscall abi?)
> 
> I think MTE HWCAP should imply it.
> 
>> in userspace we want most binaries to be like binary 3 and 4
>> eventually, i.e. marked as not-relying-on-tbi, if a dso is
>> loaded that is unmarked (legacy or new tb user), then either
>> the load fails (e.g. if mte is already used? or can we turn
>> mte off at runtime?) or tbi has to be enabled (prctl? does
>> this work with pac? or multi-threads?).
> 
> We could enable it via prctl. That's the plan for MTE as well (in
> addition maybe to some ELF flag).
> 
>> as for checking the new syscall abi: i don't see much semantic
>> difference between AT_HWCAP and AT_FLAGS (either way, the user
>> has to check a feature flag before using the feature of the
>> underlying system and it does not matter much if it's a syscall
>> abi feature or cpu feature), but i don't see anything wrong
>> with AT_FLAGS if the kernel prefers that.
> 
> The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the
> relaxation of the syscall ABI to accept tb = any. The MTE support will
> have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably,
> AT_FLAGS is either redundant here if MTE implies it (and no harm in
> keeping it around) or the meaning is different: a tb != 0 may be checked
> by the kernel against the allocation tag (i.e. get_user() could fail,
> the tag is not entirely ignored).
> 
>> the discussion here was mostly about binary 2,
> 
> That's because passing tb != 0 into the syscall ABI is the main blocker
> here that needs clearing out before merging the MTE support. There is,
> of course, a variation of binary 1 for MTE:
> 
> binary 5: user tb = mte, syscall tb = 0
> 
> but this requires a lot of C lib changes to support properly.

yes, i don't think we want to do that.

but it's ok to have both syscall tbi AT_FLAGS and MTE HWCAP.

>> but for
>> me the open question is if we can make binary 3/4 work.
>> (which requires some elf binary marking, that is recognised
>> by the kernel and dynamic loader, and efficient handling of
>> the TBI0 bit, ..if it's not possible, then i don't see how
>> mte will be deployed).
> 
> If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have
> MTE or not.
> 
>> and i guess on the kernel side the open question is if the
>> rules 1/2/3/4 can be made to work in corner cases e.g. when
>> pointers embedded into structs are passed down in ioctl.
> 
> We've been trying to track these down since last summer and we came to
> the conclusion that it should be (mostly) fine for the non-weird memory
> described above.

i think an interesting case is when userspace passes
a pointer to the kernel and later gets it back,
which is why i proposed rule 4 (kernel has to keep
the tag then).

but i wonder what's the right thing to do for sp
(user can malloc thread/sigalt/makecontext stack
which will be mte tagged in practice with mte on)
does tagged sp work? should userspace untag the
stack memory before setting it up as a stack?
(but then user pointers to that allocation may get
broken..)


Re: [PATCH v10 00/12] arm64: untag user pointers passed to the kernel

2019-02-22 Thread Szabolcs Nagy
On 22/02/2019 15:40, Andrey Konovalov wrote:
> On Fri, Feb 22, 2019 at 4:35 PM Szabolcs Nagy  wrote:
>>
>> On 22/02/2019 12:53, Andrey Konovalov wrote:
>>> This patchset is meant to be merged together with "arm64 relaxed ABI" [1].
>>>
>>> arm64 has a feature called Top Byte Ignore, which allows to embed pointer
>>> tags into the top byte of each pointer. Userspace programs (such as
>>> HWASan, a memory debugging tool [2]) might use this feature and pass
>>> tagged user pointers to the kernel through syscalls or other interfaces.
>>>
>>> Right now the kernel is already able to handle user faults with tagged
>>> pointers, due to these patches:
>>>
>>> 1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
>>>  tagged pointer")
>>> 2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
>>> pointers")
>>> 3. 276e9327 ("arm64: entry: improve data abort handling of tagged
>>> pointers")
>>>
>>> This patchset extends tagged pointer support to syscall arguments.
>>>
>>> For non-memory syscalls this is done by untaging user pointers when the
>>> kernel performs pointer checking to find out whether the pointer comes
>>> from userspace (most notably in access_ok). The untagging is done only
>>> when the pointer is being checked, the tag is preserved as the pointer
>>> makes its way through the kernel.
>>>
>>> Since memory syscalls (mmap, mprotect, etc.) don't do memory accesses but
>>> rather deal with memory ranges, untagged pointers are better suited to
>>> describe memory ranges internally. Thus for memory syscalls we untag
>>> pointers completely when they enter the kernel.
>>
>> i think the same is true when user pointers are compared.
>>
>> e.g. i suspect there may be issues with tagged robust mutex
>> list pointers because the kernel does
>>
>> futex.c:3541:   while (entry != >list) {
>>
>> where entry is a user pointer that may be tagged, and
>> >list is probably not tagged.
> 
> You're right. I'll expand the cover letter in the next version to
> describe this more accurately. The patchset however contains "mm,
> arm64: untag user pointers in mm/gup.c" that should take care of futex
> pointers.

the robust mutex list pointer is not a futex pointer,
i'm not sure how the mm/gup.c patch helps.

>>
>>> One of the alternative approaches to untagging that was considered is to
>>> completely strip the pointer tag as the pointer enters the kernel with
>>> some kind of a syscall wrapper, but that won't work with the countless
>>> number of different ioctl calls. With this approach we would need a custom
>>> wrapper for each ioctl variation, which doesn't seem practical.
>>>
>>> The following testing approaches has been taken to find potential issues
>>> with user pointer untagging:
>>>
>>> 1. Static testing (with sparse [3] and separately with a custom static
>>>analyzer based on Clang) to track casts of __user pointers to integer
>>>types to find places where untagging needs to be done.
>>>
>>> 2. Static testing with grep to find parts of the kernel that call
>>>find_vma() (and other similar functions) or directly compare against
>>>vm_start/vm_end fields of vma.
>>>
>>> 3. Static testing with grep to find parts of the kernel that compare
>>>user pointers with TASK_SIZE or other similar consts and macros.
>>>
>>> 4. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
>>>a modified syzkaller version that passes tagged pointers to the kernel.
>>>
>>> Based on the results of the testing the requried patches have been added
>>> to the patchset.
>>>
>>> This patchset has been merged into the Pixel 2 kernel tree and is now
>>> being used to enable testing of Pixel 2 phones with HWASan.
>>>
>>> This patchset is a prerequisite for ARM's memory tagging hardware feature
>>> support [4].
>>>
>>> Thanks!
>>>
>>> [1] https://lkml.org/lkml/2018/12/10/402
>>>
>>> [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
>>>
>>> [3] 
>>> https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060e0145f292
>>>
>>> [4] 
>>> https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2

Re: [PATCH v10 00/12] arm64: untag user pointers passed to the kernel

2019-02-22 Thread Szabolcs Nagy
On 22/02/2019 12:53, Andrey Konovalov wrote:
> This patchset is meant to be merged together with "arm64 relaxed ABI" [1].
> 
> arm64 has a feature called Top Byte Ignore, which allows to embed pointer
> tags into the top byte of each pointer. Userspace programs (such as
> HWASan, a memory debugging tool [2]) might use this feature and pass
> tagged user pointers to the kernel through syscalls or other interfaces.
> 
> Right now the kernel is already able to handle user faults with tagged
> pointers, due to these patches:
> 
> 1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
>  tagged pointer")
> 2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
> pointers")
> 3. 276e9327 ("arm64: entry: improve data abort handling of tagged
> pointers")
> 
> This patchset extends tagged pointer support to syscall arguments.
> 
> For non-memory syscalls this is done by untaging user pointers when the
> kernel performs pointer checking to find out whether the pointer comes
> from userspace (most notably in access_ok). The untagging is done only
> when the pointer is being checked, the tag is preserved as the pointer
> makes its way through the kernel.
> 
> Since memory syscalls (mmap, mprotect, etc.) don't do memory accesses but
> rather deal with memory ranges, untagged pointers are better suited to
> describe memory ranges internally. Thus for memory syscalls we untag
> pointers completely when they enter the kernel.

i think the same is true when user pointers are compared.

e.g. i suspect there may be issues with tagged robust mutex
list pointers because the kernel does

futex.c:3541:   while (entry != >list) {

where entry is a user pointer that may be tagged, and
>list is probably not tagged.

> One of the alternative approaches to untagging that was considered is to
> completely strip the pointer tag as the pointer enters the kernel with
> some kind of a syscall wrapper, but that won't work with the countless
> number of different ioctl calls. With this approach we would need a custom
> wrapper for each ioctl variation, which doesn't seem practical.
> 
> The following testing approaches has been taken to find potential issues
> with user pointer untagging:
> 
> 1. Static testing (with sparse [3] and separately with a custom static
>analyzer based on Clang) to track casts of __user pointers to integer
>types to find places where untagging needs to be done.
> 
> 2. Static testing with grep to find parts of the kernel that call
>find_vma() (and other similar functions) or directly compare against
>vm_start/vm_end fields of vma.
> 
> 3. Static testing with grep to find parts of the kernel that compare
>user pointers with TASK_SIZE or other similar consts and macros.
> 
> 4. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
>a modified syzkaller version that passes tagged pointers to the kernel.
> 
> Based on the results of the testing the requried patches have been added
> to the patchset.
> 
> This patchset has been merged into the Pixel 2 kernel tree and is now
> being used to enable testing of Pixel 2 phones with HWASan.
> 
> This patchset is a prerequisite for ARM's memory tagging hardware feature
> support [4].
> 
> Thanks!
> 
> [1] https://lkml.org/lkml/2018/12/10/402
> 
> [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
> 
> [3] 
> https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060e0145f292
> 
> [4] 
> https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
> 
> Changes in v10:
> - Added "mm, arm64: untag user pointers passed to memory syscalls" back.
> - New patch "fs, arm64: untag user pointers in fs/userfaultfd.c".
> - New patch "net, arm64: untag user pointers in tcp_zerocopy_receive".
> - New patch "kernel, arm64: untag user pointers in prctl_set_mm*".
> - New patch "tracing, arm64: untag user pointers in seq_print_user_ip".
> 
> Changes in v9:
> - Rebased onto 4.20-rc6.
> - Used u64 instead of __u64 in type casts in the untagged_addr macro for
>   arm64.
> - Added braces around (addr) in the untagged_addr macro for other arches.
> 
> Changes in v8:
> - Rebased onto 65102238 (4.20-rc1).
> - Added a note to the cover letter on why syscall wrappers/shims that untag
>   user pointers won't work.
> - Added a note to the cover letter that this patchset has been merged into
>   the Pixel 2 kernel tree.
> - Documentation fixes, in particular added a list of syscalls that don't
>   support tagged user pointers.
> 
> Changes in v7:
> - Rebased onto 17b57b18 (4.19-rc6).
> - Dropped the "arm64: untag user address in __do_user_fault" patch, since
>   the existing patches already handle user faults properly.
> - Dropped the "usb, arm64: untag user addresses in devio" patch, since the
>   passed pointer must come from a vma and therefore be untagged.
> - Dropped the "arm64: annotate user pointers 

Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-19 Thread Szabolcs Nagy
On 12/02/2019 18:02, Catalin Marinas wrote:
> On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
>> On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky  wrote:
>>> On 19/12/2018 12:52, Dave Martin wrote:
 Really, the kernel should do the expected thing with all "non-weird"
 memory.

 In lieu of a proper definition of "non-weird", I think we should have
 some lists of things that are explicitly included, and also excluded:

 OK:
   kernel-allocated process stack
   brk area
   MAP_ANONYMOUS | MAP_PRIVATE
   MAP_PRIVATE mappings of /dev/zero

 Not OK:
   MAP_SHARED
   mmaps of non-memory-like devices
   mmaps of anything that is not a regular file
   the VDSO
   ...

 In general, userspace can tag memory that it "owns", and we do not assume
 a transfer of ownership except in the "OK" list above.  Otherwise, it's
 the kernel's memory, or the owner is simply not well defined.
>>>
>>> Agreed on the general idea: a process should be able to pass tagged 
>>> pointers at the
>>> syscall interface, as long as they point to memory privately owned by the 
>>> process. I
>>> think it would be possible to simplify the definition of "non-weird" memory 
>>> by using
>>> only this "OK" list:
>>> - mmap() done by the process itself, where either:
>>>* flags = MAP_PRIVATE | MAP_ANONYMOUS
>>>* flags = MAP_PRIVATE and fd refers to a regular file or a well-defined 
>>> list of
>>> device files (like /dev/zero)
>>> - brk() done by the process itself
>>> - Any memory mapped by the kernel in the new process's address space during 
>>> execve(),
>>> with the same restrictions as above ([vdso]/[vvar] are therefore excluded)
> 
> Sounds reasonable.

OK. this non-weird memory definition works for me too.

rule 1: if weird memory pointers are passed to the kernel
with top byte set then the behaviour is undefined.

   * When the kernel dereferences a pointer on userspace's behalf, it
 shall behave equivalently to userspace dereferencing the same pointer,
 including use of the same tag (where passed by userspace).

   * Where the pointer tag affects pointer dereference behaviour (i.e.,
 with hardware memory colouring) the kernel makes no guarantee to
 honour pointer tags correctly for every location a buffer based on a
 pointer passed by userspace to the kernel.

 (This means for example that for a read(fd, buf, size), we can check
 the tag for a single arbitrary location in *(char (*)[size])buf
 before passing the buffer to get_user_pages().  Hopefully this could
 be done in get_user_pages() itself rather than hunting call sites.
 For userspace, it means that you're on your own if you ask the
 kernel to operate on a buffer than spans multiple, independently-
 allocated objects, or a deliberately striped single object.)
>>>
>>> I think both points are reasonable. It is very valuable for the kernel to 
>>> access
>>> userspace memory using the user-provided tag, because it enables kernel 
>>> accesses to
>>> be checked in the same way as user accesses, allowing to detect bugs that 
>>> are
>>> potentially hard to find. For instance, if a pointer to an object is passed 
>>> to the
>>> kernel after it has been deallocated, this is invalid and should be 
>>> detected.
>>> However, you are absolutely right that the kernel cannot *guarantee* that 
>>> such a
>>> check is carried out for the entire memory range (or in fact at all); it 
>>> should be a
>>> best-effort approach.
>>
>> It would also be valuable to narrow down the set of "relaxed" (i.e.
>> not tag-checking) syscalls as reasonably possible. We would want to
>> provide tag-checking userspace wrappers for any important calls that
>> are not checked in the kernel. Is it correct to assume that anything
>> that goes through copy_from_user  / copy_to_user is checked?
> 
> I lost track of the context of this thread but if it's just about
> relaxing the ABI for hwasan, the kernel has no idea of the compiler
> generated structures in user space, so nothing is checked.
> 
> If we talk about tags in the context of MTE, than yes, with the current
> proposal the tag would be checked by copy_*_user() functions.

rule 2: kernel derefs as if user derefs when non-weird memory
pointers are passed to the kernel.

note that the important bit is what happens on valid pointer
derefs, invalid pointer deref is usually undefined for user
programs, so what happens in case of mte tag failures is
more of a QoI issue than abi i think.

(e.g. EFAULT is not guaranteed by the kernel currently, i can
successfully do write(open("/dev/null",O_WRONLY), 0, 1), or
get a crash when passing invalid pointer to a vdso function,
so userspace should not rely on some strict EFAULT behaviour).

   * The kernel shall not extend the lifetime of user pointers in ways
 

Re: [PATCH 3/3] arm64/sve: Disentangle from

2018-12-19 Thread Szabolcs Nagy
On 19/12/2018 15:23, Dave Martin wrote:
> On Wed, Dec 19, 2018 at 03:11:52PM +0000, Szabolcs Nagy wrote:
>> On 18/12/2018 12:14, Dave Martin wrote:
>>> On Sat, Dec 15, 2018 at 05:20:29PM +0800, kbuild test robot wrote:
> 
> [...]
> 
>>>>>> ./usr/include/asm/sve_context.h:30: found __[us]{8,16,32,64} type 
>>>>>> without #include 
>>>
>>> Since the new header is not meant to be included directly (and has a
>>> guard to that effect), we don't strictly need to do anything here.
>>>
>>> The way to include  in userspace is via
>>>  or , both of which include
>>>  first.
>>>
>>
>> i think there is no need to explicitly prevent the inclusion of
>> the header.
>>
>> it is enough to have a comment that it's not supposed to be
>> included by user code (so the header can be later refactored).
>>
>> and then such automated header checks (or whatever other hacks
>> ppl do temporarily) can continue to work.
> 
> The guard is in linux-next now AFAIK.
> 
> Are you saying that it's likely to break something and needs to be
> removed, or it is unnecessary but harmless?

unnecessary but harmless.

(i assumed the header check bot would want to include every
header on its own and see if there are undefined symbols, but
if it works with the guard then fine)


Re: [PATCH 3/3] arm64/sve: Disentangle from

2018-12-19 Thread Szabolcs Nagy
On 18/12/2018 12:14, Dave Martin wrote:
> On Sat, Dec 15, 2018 at 05:20:29PM +0800, kbuild test robot wrote:
>> Hi Dave,
>>
>> Thank you for the patch! Perhaps something to improve:
>>
>> [auto build test WARNING on arm64/for-next/core]
>> [also build test WARNING on v4.20-rc6 next-20181214]
>> [if your patch is applied to the wrong git tree, please drop us a note to 
>> help improve the system]
>>
>> url:
>> https://github.com/0day-ci/linux/commits/Dave-Martin/arm64-sve-UAPI-Disentangle-ptrace-h-from-sigcontext-h/20181214-225154
>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git 
>> for-next/core
>> config: arm64-allmodconfig (attached as .config)
>> compiler: aarch64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
>> reproduce:
>> wget 
>> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
>> ~/bin/make.cross
>> chmod +x ~/bin/make.cross
>> # save the attached .config to linux build tree
>> GCC_VERSION=7.2.0 make.cross ARCH=arm64 
>>
>> All warnings (new ones prefixed by >>):
>>
 ./usr/include/asm/sve_context.h:30: found __[us]{8,16,32,64} type without 
 #include 
> 
> Since the new header is not meant to be included directly (and has a
> guard to that effect), we don't strictly need to do anything here.
> 
> The way to include  in userspace is via
>  or , both of which include
>  first.
> 

i think there is no need to explicitly prevent the inclusion of
the header.

it is enough to have a comment that it's not supposed to be
included by user code (so the header can be later refactored).

and then such automated header checks (or whatever other hacks
ppl do temporarily) can continue to work.


Re: [PATCH 0/3] arm64/sve: UAPI: Disentangle ptrace.h from sigcontext.h

2018-12-14 Thread Szabolcs Nagy
On 14/12/2018 18:25, Dave Martin wrote:
> On Fri, Dec 14, 2018 at 06:13:33PM +0000, Szabolcs Nagy wrote:
>> On 11/12/2018 19:26, Dave Martin wrote:
>>> This patch refactors the UAPI header definitions for the Arm SVE
>>> extension to avoid multiple-definition problems when userspace mixes its
>>> own sigcontext.h definitions with the kernel's ptrace.h (which is
>>> apparently routine).
>>>
>>> A common backend header is created to hold common definitions, suitably
>>> namespaced, and with an appropriate header guard.
>>>
>>> See the commit message in patch 3 for further explanation of why this
>>> is needed.
>>>
>>> Because of the non-trivial header guard in the new sve_context.h, patch
>>> 1 adds support to headers_install.sh to munge #if defined(_UAPI_FOO) in
>>> a similar way to the current handling of #ifndef _UAPI_FOO.
>>>
>>
>> thanks for doing this.
>>
>> the patches fix the gdb build issue on musl libc with an
>> additional gdb patch:
>> https://sourceware.org/ml/gdb-patches/2018-12/msg00152.html
>> (in userspace i'd expect users relying on signal.h providing
>> whatever is in asm/sigcontext.h.)
>>
>> i think sve_context.h could be made to work with direct include,
>> even if that's not useful because there is no public api there.
>> (and then you dont need the first patch)
> 
> My general view is that if you want the sigframe types userspace should
> usually include  and refer to mcontext_t.
> 

ucontext.h does not expose the asm/sigcontext.h types in glibc,
but it is compatible with the inclusion of asm/sigcontext.h
(or signal.h).

in musl ucontext.h includes signal.h and signal.h provides
the asm/sigcontext.h api with abi compatible definitions.

> Because the prototype for sa_sigaction() specifies a void * for the
> ucontext argument, I've generally assumed that  is not
> sufficient to get ucontext_t (or mcontext_t) (but maybe I'm too paranoid
> there).

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/signal.h.html

"The  header shall define the ucontext_t type as a structure
 that shall include at least the following members:
...
 mcontext_t  uc_mcontext A machine-specific representation of the saved
 context."

so signal.h must define ucontext_t but mcontext_t can be opaque.
(it is opaque with posix conform feature tests to avoid
namespace pollution, but with _GNU_SOURCE defined all
asm/sigcontext.h apis are there and mcontext_t matches
struct sigcontext)

> 
> Non-POSIX-flavoured software might include  directly.
> In glibc/musl libc will that conflict with , or can the two
> coexist?

if you compile e.g in standard conform mode then
i think signal.h and asm/sigcontext.h are compatible.

> 
> Cheers
> ---Dave
> 



Re: [PATCH 0/3] arm64/sve: UAPI: Disentangle ptrace.h from sigcontext.h

2018-12-14 Thread Szabolcs Nagy
On 11/12/2018 19:26, Dave Martin wrote:
> This patch refactors the UAPI header definitions for the Arm SVE
> extension to avoid multiple-definition problems when userspace mixes its
> own sigcontext.h definitions with the kernel's ptrace.h (which is
> apparently routine).
> 
> A common backend header is created to hold common definitions, suitably
> namespaced, and with an appropriate header guard.
> 
> See the commit message in patch 3 for further explanation of why this
> is needed.
> 
> Because of the non-trivial header guard in the new sve_context.h, patch
> 1 adds support to headers_install.sh to munge #if defined(_UAPI_FOO) in
> a similar way to the current handling of #ifndef _UAPI_FOO.
> 

thanks for doing this.

the patches fix the gdb build issue on musl libc with an
additional gdb patch:
https://sourceware.org/ml/gdb-patches/2018-12/msg00152.html
(in userspace i'd expect users relying on signal.h providing
whatever is in asm/sigcontext.h.)

i think sve_context.h could be made to work with direct include,
even if that's not useful because there is no public api there.
(and then you dont need the first patch)

> Dave Martin (3):
>   kbuild: install_headers.sh: Strip _UAPI from #if-defined() guards
>   arm64/sve: ptrace: Fix SVE_PT_REGS_OFFSET definition
>   arm64/sve: Disentangle  from
> 
> 
>  arch/arm64/include/uapi/asm/ptrace.h  | 39 ++---
>  arch/arm64/include/uapi/asm/sigcontext.h  | 56 
> +++
>  arch/arm64/include/uapi/asm/sve_context.h | 50 +++
>  scripts/headers_install.sh|  1 +
>  4 files changed, 97 insertions(+), 49 deletions(-)
>  create mode 100644 arch/arm64/include/uapi/asm/sve_context.h
> 



Re: [RFC PATCH v4 1/5] glibc: Perform rseq(2) registration at nptl init and thread creation

2018-11-26 Thread Szabolcs Nagy
On 23/11/18 21:09, Mathieu Desnoyers wrote:
> - On Nov 23, 2018, at 1:35 PM, Rich Felker dal...@libc.org wrote:
> 
>> On Fri, Nov 23, 2018 at 12:52:21PM -0500, Mathieu Desnoyers wrote:
>>> - On Nov 23, 2018, at 12:30 PM, Rich Felker dal...@libc.org wrote:
>>>
 On Fri, Nov 23, 2018 at 12:05:20PM -0500, Mathieu Desnoyers wrote:
> - On Nov 23, 2018, at 9:28 AM, Rich Felker dal...@libc.org wrote:
> [...]
>>
>> Absolutely. As long as it's in libc, implicit destruction will happen.
>> Actually I think the glibc code shound unconditionally unregister the
>> rseq address at exit (after blocking signals, so no application code
>> can run) in case a third-party rseq library was linked and failed to
>> do so before thread exit (e.g. due to mismatched ref counts) rather
>> than respecting the reference count, since it knows it's the last
>> user. This would make potentially-buggy code safer.
>
> OK, let me go ahead with a few ideas/questions along that path.
 ^^^
>
> Let's say our stated goal is to let the "exit" system call from the
> glibc thread exit path perform rseq unregistration (without explicit
> unregistration beforehand). Let's look at what we need.

 This is not "along that path". The above-quoted text is not about
 assuming it's safe to make SYS_exit without unregistering the rseq
 object, but rather about glibc being able to perform the
 rseq-unregister syscall without caring about reference counts, since
 it knows no other code that might depend on rseq can run after it.
>>>
>>> When saying "along that path", what I mean is: if we go in that direction,
>>> then we should look into going all the way there, and rely on thread
>>> exit to implicitly unregister the TLS area.
>>>
>>> Do you see any reason for doing an explicit unregistration at thread
>>> exit rather than simply rely on the exit system call ?
>>
>> Whether this is needed is an implementation detail of glibc that
>> should be permitted to vary between versions. Unless glibc wants to
>> promise that it would become a public guarantee, it's not part of the
>> discussion around the API/ABI. Only part of the discussion around
>> implementation internals of the glibc rseq stuff.
>>
>> Of course I may be biased thinking application code should not assume
>> this since it's not true on musl -- for detached threads, the thread
>> frees its own stack before exiting (and thus has to unregister
>> set_tid_address and set_robustlist before exiting).
> 
> OK, so on glibc, the implementation could rely on exit side-effect to
> implicitly unregister rseq. On musl, based on the scenario you describe,
> the library should unregister rseq explicitly before stack reclaim.
> 
> Am I understanding the situation correctly ?

i think the point is that you don't need to know these
details in order to come up with a design that allows
both implementations. (then the libc can change later)

so
- is there a need for public unregister api (does the
user do it or the rseq library implicitly unregisters)?
- is there a need for ref counting (or the rseq lib
unconditionally unregisters at the end of a thread,
the libc can certainly do this)?

>>> OK, AFAIU so you argue for leaving the __rseq_abi symbol "weak". Just making
>>> sure I correctly understand your position.
>>
>> I don't think it matters, and I don't think making it weak is
>> meaningful or useful (weak in a shared library is largely meaningless)
>> but maybe I'm missing something here.
> 
> Using a "weak" symbol in early adopter libraries is important, so they
> can be loaded together into the same process without causing loader
> errors due to many definitions of the same strong symbol.
> 
> Using "weak" in a C library is something I'm not sure is a characteristic
> we want or need, because I doubt we would ever want to load two libc at the
> same time in a given process.
> 
> The only reason I see for using "weak" for the __rseq_abi symbol in the
> libc is if we want to allow early adopter applications to define
> __rseq_abi as a strong symbol, which would make some sense.

weak really does not matter in dynamic linking
(unless you set the LD_DYNAMIC_WEAK env var for
backward compat with very old glibc, or if it's
an undefined weak reference)

>> Just blocking at start/exit won't solve the problem because
>> global-dynamic TLS in glibc involves dynamic allocation, which is hard
>> to make AS-safe and of course can fail, leaving no way to make forward
>> progress.
> 
> How hard would it be to create a async-signal-safe memory pool, which would
> be always accessed with signals blocked, so we could fix those corner-cases
> for good ?

that is hard.

in musl tls access is as-safe, but it uses a different
approach: it does all allocations at thread creation or
dlopen time.

glibc has further issues because it supports dlclose
with module unloading 

Re: [RFC PATCH v4 1/5] glibc: Perform rseq(2) registration at nptl init and thread creation

2018-11-26 Thread Szabolcs Nagy
On 23/11/18 21:09, Mathieu Desnoyers wrote:
> - On Nov 23, 2018, at 1:35 PM, Rich Felker dal...@libc.org wrote:
> 
>> On Fri, Nov 23, 2018 at 12:52:21PM -0500, Mathieu Desnoyers wrote:
>>> - On Nov 23, 2018, at 12:30 PM, Rich Felker dal...@libc.org wrote:
>>>
 On Fri, Nov 23, 2018 at 12:05:20PM -0500, Mathieu Desnoyers wrote:
> - On Nov 23, 2018, at 9:28 AM, Rich Felker dal...@libc.org wrote:
> [...]
>>
>> Absolutely. As long as it's in libc, implicit destruction will happen.
>> Actually I think the glibc code shound unconditionally unregister the
>> rseq address at exit (after blocking signals, so no application code
>> can run) in case a third-party rseq library was linked and failed to
>> do so before thread exit (e.g. due to mismatched ref counts) rather
>> than respecting the reference count, since it knows it's the last
>> user. This would make potentially-buggy code safer.
>
> OK, let me go ahead with a few ideas/questions along that path.
 ^^^
>
> Let's say our stated goal is to let the "exit" system call from the
> glibc thread exit path perform rseq unregistration (without explicit
> unregistration beforehand). Let's look at what we need.

 This is not "along that path". The above-quoted text is not about
 assuming it's safe to make SYS_exit without unregistering the rseq
 object, but rather about glibc being able to perform the
 rseq-unregister syscall without caring about reference counts, since
 it knows no other code that might depend on rseq can run after it.
>>>
>>> When saying "along that path", what I mean is: if we go in that direction,
>>> then we should look into going all the way there, and rely on thread
>>> exit to implicitly unregister the TLS area.
>>>
>>> Do you see any reason for doing an explicit unregistration at thread
>>> exit rather than simply rely on the exit system call ?
>>
>> Whether this is needed is an implementation detail of glibc that
>> should be permitted to vary between versions. Unless glibc wants to
>> promise that it would become a public guarantee, it's not part of the
>> discussion around the API/ABI. Only part of the discussion around
>> implementation internals of the glibc rseq stuff.
>>
>> Of course I may be biased thinking application code should not assume
>> this since it's not true on musl -- for detached threads, the thread
>> frees its own stack before exiting (and thus has to unregister
>> set_tid_address and set_robustlist before exiting).
> 
> OK, so on glibc, the implementation could rely on exit side-effect to
> implicitly unregister rseq. On musl, based on the scenario you describe,
> the library should unregister rseq explicitly before stack reclaim.
> 
> Am I understanding the situation correctly ?

i think the point is that you don't need to know these
details in order to come up with a design that allows
both implementations. (then the libc can change later)

so
- is there a need for public unregister api (does the
user do it or the rseq library implicitly unregisters)?
- is there a need for ref counting (or the rseq lib
unconditionally unregisters at the end of a thread,
the libc can certainly do this)?

>>> OK, AFAIU so you argue for leaving the __rseq_abi symbol "weak". Just making
>>> sure I correctly understand your position.
>>
>> I don't think it matters, and I don't think making it weak is
>> meaningful or useful (weak in a shared library is largely meaningless)
>> but maybe I'm missing something here.
> 
> Using a "weak" symbol in early adopter libraries is important, so they
> can be loaded together into the same process without causing loader
> errors due to many definitions of the same strong symbol.
> 
> Using "weak" in a C library is something I'm not sure is a characteristic
> we want or need, because I doubt we would ever want to load two libc at the
> same time in a given process.
> 
> The only reason I see for using "weak" for the __rseq_abi symbol in the
> libc is if we want to allow early adopter applications to define
> __rseq_abi as a strong symbol, which would make some sense.

weak really does not matter in dynamic linking
(unless you set the LD_DYNAMIC_WEAK env var for
backward compat with very old glibc, or if it's
an undefined weak reference)

>> Just blocking at start/exit won't solve the problem because
>> global-dynamic TLS in glibc involves dynamic allocation, which is hard
>> to make AS-safe and of course can fail, leaving no way to make forward
>> progress.
> 
> How hard would it be to create a async-signal-safe memory pool, which would
> be always accessed with signals blocked, so we could fix those corner-cases
> for good ?

that is hard.

in musl tls access is as-safe, but it uses a different
approach: it does all allocations at thread creation or
dlopen time.

glibc has further issues because it supports dlclose
with module unloading 

Re: Official Linux system wrapper library?

2018-11-23 Thread Szabolcs Nagy
On 23/11/18 14:11, David Newall wrote:
> On 24/11/18 12:04 am, Florian Weimer wrote:
>> But socketcall does not exist on all architectures.  Neither does
>> getpid, it's called getxpid on some architectures.
>> ...
>> I think it would be a poor approach to expose application developers to
>> these portability issues.  We need to abstract over these differences at
>> a certain layer, and applications are too late.
> 
> Interesting.  I think the opposite.  I think exposing the OS's interfaces is 
> exactly what a c-library should do.  It might also provide
> alternative interfaces that work consistently across different platforms, but 
> in addition to, not instead of the OS interface.

you don't understand the point of the c language if you think so.


Re: Official Linux system wrapper library?

2018-11-23 Thread Szabolcs Nagy
On 23/11/18 14:11, David Newall wrote:
> On 24/11/18 12:04 am, Florian Weimer wrote:
>> But socketcall does not exist on all architectures.  Neither does
>> getpid, it's called getxpid on some architectures.
>> ...
>> I think it would be a poor approach to expose application developers to
>> these portability issues.  We need to abstract over these differences at
>> a certain layer, and applications are too late.
> 
> Interesting.  I think the opposite.  I think exposing the OS's interfaces is 
> exactly what a c-library should do.  It might also provide
> alternative interfaces that work consistently across different platforms, but 
> in addition to, not instead of the OS interface.

you don't understand the point of the c language if you think so.


Re: [RFC PATCH v4 1/5] glibc: Perform rseq(2) registration at nptl init and thread creation

2018-11-22 Thread Szabolcs Nagy
On 22/11/18 15:33, Mathieu Desnoyers wrote:
> - On Nov 22, 2018, at 10:21 AM, Florian Weimer fwei...@redhat.com wrote:
>> Right, but in case of user-supplied stacks, we actually free TLS memory
>> at this point, so signals need to be blocked because the TCB is
>> (partially) gone after that.
> 
> Unfortuntately, disabling signals is not enough.
> 
> With rseq registered, the kernel accesses the rseq TLS area when returning to
> user-space after _preemption_ of user-space, which can be triggered at any
> point by an interrupt or a fault, even if signals are blocked.
> 
> So if there are cases where the TLS memory is freed while the thread is still
> running, we _need_ to explicitly unregister rseq beforehand.

i think the man page should point this out.

the memory of a registered rseq object must not be freed
before thread exit. (either unregister it or free later)

and ideally also point out that c language thread storage
duration does not provide this guarantee: it may be freed
by the implementation before thread exit (which is currently
not observable, but with the rseq syscall it is).


Re: [RFC PATCH v4 1/5] glibc: Perform rseq(2) registration at nptl init and thread creation

2018-11-22 Thread Szabolcs Nagy
On 22/11/18 15:33, Mathieu Desnoyers wrote:
> - On Nov 22, 2018, at 10:21 AM, Florian Weimer fwei...@redhat.com wrote:
>> Right, but in case of user-supplied stacks, we actually free TLS memory
>> at this point, so signals need to be blocked because the TCB is
>> (partially) gone after that.
> 
> Unfortuntately, disabling signals is not enough.
> 
> With rseq registered, the kernel accesses the rseq TLS area when returning to
> user-space after _preemption_ of user-space, which can be triggered at any
> point by an interrupt or a fault, even if signals are blocked.
> 
> So if there are cases where the TLS memory is freed while the thread is still
> running, we _need_ to explicitly unregister rseq beforehand.

i think the man page should point this out.

the memory of a registered rseq object must not be freed
before thread exit. (either unregister it or free later)

and ideally also point out that c language thread storage
duration does not provide this guarantee: it may be freed
by the implementation before thread exit (which is currently
not observable, but with the rseq syscall it is).


Re: Official Linux system wrapper library?

2018-11-14 Thread Szabolcs Nagy
On 13/11/18 19:39, Dave Martin wrote:
> On Mon, Nov 12, 2018 at 05:19:14AM -0800, Daniel Colascione wrote:
>> We should adopt a similar approach. Shipping a lower-level
>> "liblinux.so" tightly bound to the kernel would not only let the
>> kernel bypass glibc's "editorial discretion" in exposing new
>> facilities to userspace, but would also allow for tighter user-kernel
>> integration that one can achieve with a simplistic syscall(2)-style
>> escape hatch. (For example, for a long time now, I've wanted to go
>> beyond POSIX and improve the system's signal handling API, and this
>> improvement requires userspace cooperation.) The vdso is probably too
>> small and simplistic to serve in this role; I'd want a real library.
> 
> Can you expand on your reasoning here?

such lib creates a useless abi+api layer that
somebody has to maintain and document (with or
without vdso).

it obviously cannot work together with a posix
conform libc implementation for which it would
require knowledge about

thread cancellation internals, potentially TLS
for errno, know libc types even ones that are
based on compile time feature macros (and expose
them in headers in a way that does not collide
with libc headers), abi variants the libc supports
(e.g. softfp, security hardened abi), libc
internal signals (for anything that's changing
signal masks), thread internals for syscalls that
require coordination between all user created
threads (setxid), libc internal state for syscalls
that create/destroy threads.

and thus such lib does not solve the problems
of users who actually requested wrappers for
new syscalls (since they want to call into libc
and create threads).

there is a lot of bikesheding here by people who
don't understand the constraints nor the use-cases.

an actual proposal in the thread that i think is
worth considering is to make the linux syscall
design process involve libc devs so the c api is
designed together with the syscall abi.

unfortunately i still haven't seen a solution that
makes using linux uapi headers together with libc
headers reliable, continuously testing them in
isolation is useful, but that does not solve the
potential conflicts with libc definitions.


Re: Official Linux system wrapper library?

2018-11-14 Thread Szabolcs Nagy
On 13/11/18 19:39, Dave Martin wrote:
> On Mon, Nov 12, 2018 at 05:19:14AM -0800, Daniel Colascione wrote:
>> We should adopt a similar approach. Shipping a lower-level
>> "liblinux.so" tightly bound to the kernel would not only let the
>> kernel bypass glibc's "editorial discretion" in exposing new
>> facilities to userspace, but would also allow for tighter user-kernel
>> integration that one can achieve with a simplistic syscall(2)-style
>> escape hatch. (For example, for a long time now, I've wanted to go
>> beyond POSIX and improve the system's signal handling API, and this
>> improvement requires userspace cooperation.) The vdso is probably too
>> small and simplistic to serve in this role; I'd want a real library.
> 
> Can you expand on your reasoning here?

such lib creates a useless abi+api layer that
somebody has to maintain and document (with or
without vdso).

it obviously cannot work together with a posix
conform libc implementation for which it would
require knowledge about

thread cancellation internals, potentially TLS
for errno, know libc types even ones that are
based on compile time feature macros (and expose
them in headers in a way that does not collide
with libc headers), abi variants the libc supports
(e.g. softfp, security hardened abi), libc
internal signals (for anything that's changing
signal masks), thread internals for syscalls that
require coordination between all user created
threads (setxid), libc internal state for syscalls
that create/destroy threads.

and thus such lib does not solve the problems
of users who actually requested wrappers for
new syscalls (since they want to call into libc
and create threads).

there is a lot of bikesheding here by people who
don't understand the constraints nor the use-cases.

an actual proposal in the thread that i think is
worth considering is to make the linux syscall
design process involve libc devs so the c api is
designed together with the syscall abi.

unfortunately i still haven't seen a solution that
makes using linux uapi headers together with libc
headers reliable, continuously testing them in
isolation is useful, but that does not solve the
potential conflicts with libc definitions.


Re: Official Linux system wrapper library?

2018-11-12 Thread Szabolcs Nagy
On 11/11/18 14:22, Daniel Colascione wrote:
> On Sun, Nov 11, 2018 at 3:09 AM, Florian Weimer  wrote:
>> We had a patch for the membarrier system call, but the kernel developers
>> could not tell us what the system call does in therms of the C/C++
>> memory model
> [snip]
>> A lot of the new system calls lack clear specifications or are just
>> somewhat misdesigned.  For example, pkey_alloc
> [snip]
>> getrandom still causes boot delays
> [snip]
>> For copy_file_range, we still have debates whether the system call (and
>> the glibc emulation) should preserve holes or not,
> [snip]
> 
> These objections illustrate my point. glibc development is not the
> proper forum for raising post-hoc objections to system call design.
> Withholding wrappers will not un-ship these system calls. Applications
> are already using them, via syscall(2). Developers and users would be
> better served by providing access to the system as it is, with
> appropriate documentation caveats, than by holding out for some
> alternate and more ideal set of system calls that may or may not
> appear in the future. This resistance to exposing the capabilities of
> the system as they are, even in flawed and warty form, is what I meant
> by "misplaced idealism" in my previous message. If the kernel provides
> a system call, libc should provide a C wrapper for it, even if in the
> opinion of the libc maintainers, that system call is flawed.

flaws can be worked around.

it's just more work to do that, hence wrappers are delayed.

(while new flawed syscalls get added, there are missing
syscalls for implementing posix semantics or for better libc
quality, so are the priorities of linux right?)

> I agree with the proposals mentioned above to split system interface
> responsibility, having glibc handle higher-level concerns like stdio
> while punting system call wrappers and other low-level facilities to a
> kernel-provided userspace library that can move faster and more
> explicitly conform to the Linux kernel's userspace ABI.

consuming linux uapi headers is a huge problem (not just for
glibc): the libc has to repeat uapi definitions under appropriate
feature macros using proper libc types etc, this usually creates
conflict between linux and libc headers and a lot of duplicated
work at every linux release. the situation would be worse if all
new types were exposed for new syscalls when they appeared.

the proposal mentioned above does not solve this in any way.


Re: Official Linux system wrapper library?

2018-11-12 Thread Szabolcs Nagy
On 11/11/18 14:22, Daniel Colascione wrote:
> On Sun, Nov 11, 2018 at 3:09 AM, Florian Weimer  wrote:
>> We had a patch for the membarrier system call, but the kernel developers
>> could not tell us what the system call does in therms of the C/C++
>> memory model
> [snip]
>> A lot of the new system calls lack clear specifications or are just
>> somewhat misdesigned.  For example, pkey_alloc
> [snip]
>> getrandom still causes boot delays
> [snip]
>> For copy_file_range, we still have debates whether the system call (and
>> the glibc emulation) should preserve holes or not,
> [snip]
> 
> These objections illustrate my point. glibc development is not the
> proper forum for raising post-hoc objections to system call design.
> Withholding wrappers will not un-ship these system calls. Applications
> are already using them, via syscall(2). Developers and users would be
> better served by providing access to the system as it is, with
> appropriate documentation caveats, than by holding out for some
> alternate and more ideal set of system calls that may or may not
> appear in the future. This resistance to exposing the capabilities of
> the system as they are, even in flawed and warty form, is what I meant
> by "misplaced idealism" in my previous message. If the kernel provides
> a system call, libc should provide a C wrapper for it, even if in the
> opinion of the libc maintainers, that system call is flawed.

flaws can be worked around.

it's just more work to do that, hence wrappers are delayed.

(while new flawed syscalls get added, there are missing
syscalls for implementing posix semantics or for better libc
quality, so are the priorities of linux right?)

> I agree with the proposals mentioned above to split system interface
> responsibility, having glibc handle higher-level concerns like stdio
> while punting system call wrappers and other low-level facilities to a
> kernel-provided userspace library that can move faster and more
> explicitly conform to the Linux kernel's userspace ABI.

consuming linux uapi headers is a huge problem (not just for
glibc): the libc has to repeat uapi definitions under appropriate
feature macros using proper libc types etc, this usually creates
conflict between linux and libc headers and a lot of duplicated
work at every linux release. the situation would be worse if all
new types were exposed for new syscalls when they appeared.

the proposal mentioned above does not solve this in any way.


Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc

2018-10-12 Thread Szabolcs Nagy
On 11/10/18 20:42, Mathieu Desnoyers wrote:
> - On Oct 11, 2018, at 1:04 PM, Szabolcs Nagy szabolcs.n...@arm.com wrote:
> 
>> On 11/10/18 17:37, Mathieu Desnoyers wrote:
>>> - On Oct 11, 2018, at 12:20 PM, Szabolcs Nagy szabolcs.n...@arm.com 
>>> wrote:
>>>> On 11/10/18 16:13, Mathieu Desnoyers wrote:
>>>>> - On Oct 11, 2018, at 6:37 AM, Szabolcs Nagy szabolcs.n...@arm.com 
>>>>> wrote:
>>>>>> On 10/10/18 20:19, Mathieu Desnoyers wrote:
>>>>>>> +__attribute__((visibility("hidden"))) __thread
>>>>>>> +volatile struct libc_rseq __lib_rseq_abi = {
>>>>>> ...
>>>> but it's in a magic struct that's called "abi" which is confusing,
>>>> the counter is not abi, it's in a hidden object.
>>>
>>> No, it is really an ABI between user-space apps/libs. It's not meant to be
>>> hidden. glibc implements its own register/unregister functions (it does not
>>> link against librseq). librseq exposes register/unregister functions as 
>>> public
>>> APIs. Those also use the refcount. I also plan to have existing libraries, 
>>> e.g.
>>> liblttng-ust and possibly liburcu flavors, implement the
>>> registration/unregistration and refcount handling on their own, so we don't
>>> have to add a requirement on additional linking on librseq for pre-existing
>>> libraries.
>>>
>>> So that refcount is not an ABI between kernel and user-space, but it's a
>>> user-space ABI nevertheless (between program and shared objects).
>>>
>>
>> if that's what you want, then your declaration is wrong.
>> the object should not have hidden visibility.
> 
> Actually, if we look closer into my patch, it defines two symbols,
> one of which is an alias:
> 
> __attribute__((visibility("hidden"))) __thread
> volatile struct libc_rseq __lib_rseq_abi = {
> .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
> };
> 
> extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
> volatile struct rseq __rseq_abi;
> 
> Note that the public __rseq_abi symbol is weak but does not have
> hidden visibility. I do this to ensure I don't get prototype
> mismatch for __rseq_abi between rseq.c and rseq.h (it is required
> to be a struct rseq by rseq.h), but I want the space to hold the
> extra refcount field present in struct libc_rseq.
> 

but that's wrong: the weak symbol might get resolved to
a different object in another module, while you increment
a local refcounter, so there is no coordination between
userspace components.

this was the reason for my first question in my original mail,
as soon as i saw the local counter i suspected this is broken.

and "assume there is an extra counter field" is not
acceptable as user space abi, if the counter is relevant
across modules then expose the entire struct.

>> either the struct should be public abi (extern tls
>> symbol) or the register/unregister functions should
>> be public abi (so when multiple implementations are
>> present in the same process only one of them will
>> provide definition for the public abi symbol and
>> thus there will be one refcounter).
> 
> Those are two possible solutions, indeed. Considering that
> we already need to expose the __rseq_abi symbol as a public
> ABI in a way that ensures that multiple implementations
> in a same process end up only using one of them, it seems
> straightforward to simply extend that structure and hold the
> refcount there, rather than having two extra ABI symbols
> (register/unregister functions).
> 
> One very appropriate question here is whether we want to
> expose the layout of struct libc_rseq (which includes the
> refcount) in a public header file, and if so, which project
> should hold it ? Or do we just want to document the layout
> of this ABI so projects can define the structure layout
> internally ? As my implementation currently stands, I have
> the following structure duplicated into rseq selftests,
> librseq, and glibc:
> 

"not exposed" and "the counter is abi" together is not
useful, either you want coordination in user-space or
not, that decision should imply the userspace abi/api
(e.g. adding a counter to the user-space struct).

it is true that only modules that implement registration
need to know about the counter and normal users don't,
but if you want any coordination then the layout must
be fixed and that should be exposed somewhere to avoid
breakage.

(i think ideally the api would be controlled by functions
and not object symbols with magic layout, but the rseq
design is already full of such m

Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc

2018-10-12 Thread Szabolcs Nagy
On 11/10/18 20:42, Mathieu Desnoyers wrote:
> - On Oct 11, 2018, at 1:04 PM, Szabolcs Nagy szabolcs.n...@arm.com wrote:
> 
>> On 11/10/18 17:37, Mathieu Desnoyers wrote:
>>> - On Oct 11, 2018, at 12:20 PM, Szabolcs Nagy szabolcs.n...@arm.com 
>>> wrote:
>>>> On 11/10/18 16:13, Mathieu Desnoyers wrote:
>>>>> - On Oct 11, 2018, at 6:37 AM, Szabolcs Nagy szabolcs.n...@arm.com 
>>>>> wrote:
>>>>>> On 10/10/18 20:19, Mathieu Desnoyers wrote:
>>>>>>> +__attribute__((visibility("hidden"))) __thread
>>>>>>> +volatile struct libc_rseq __lib_rseq_abi = {
>>>>>> ...
>>>> but it's in a magic struct that's called "abi" which is confusing,
>>>> the counter is not abi, it's in a hidden object.
>>>
>>> No, it is really an ABI between user-space apps/libs. It's not meant to be
>>> hidden. glibc implements its own register/unregister functions (it does not
>>> link against librseq). librseq exposes register/unregister functions as 
>>> public
>>> APIs. Those also use the refcount. I also plan to have existing libraries, 
>>> e.g.
>>> liblttng-ust and possibly liburcu flavors, implement the
>>> registration/unregistration and refcount handling on their own, so we don't
>>> have to add a requirement on additional linking on librseq for pre-existing
>>> libraries.
>>>
>>> So that refcount is not an ABI between kernel and user-space, but it's a
>>> user-space ABI nevertheless (between program and shared objects).
>>>
>>
>> if that's what you want, then your declaration is wrong.
>> the object should not have hidden visibility.
> 
> Actually, if we look closer into my patch, it defines two symbols,
> one of which is an alias:
> 
> __attribute__((visibility("hidden"))) __thread
> volatile struct libc_rseq __lib_rseq_abi = {
> .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
> };
> 
> extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
> volatile struct rseq __rseq_abi;
> 
> Note that the public __rseq_abi symbol is weak but does not have
> hidden visibility. I do this to ensure I don't get prototype
> mismatch for __rseq_abi between rseq.c and rseq.h (it is required
> to be a struct rseq by rseq.h), but I want the space to hold the
> extra refcount field present in struct libc_rseq.
> 

but that's wrong: the weak symbol might get resolved to
a different object in another module, while you increment
a local refcounter, so there is no coordination between
userspace components.

this was the reason for my first question in my original mail,
as soon as i saw the local counter i suspected this is broken.

and "assume there is an extra counter field" is not
acceptable as user space abi, if the counter is relevant
across modules then expose the entire struct.

>> either the struct should be public abi (extern tls
>> symbol) or the register/unregister functions should
>> be public abi (so when multiple implementations are
>> present in the same process only one of them will
>> provide definition for the public abi symbol and
>> thus there will be one refcounter).
> 
> Those are two possible solutions, indeed. Considering that
> we already need to expose the __rseq_abi symbol as a public
> ABI in a way that ensures that multiple implementations
> in a same process end up only using one of them, it seems
> straightforward to simply extend that structure and hold the
> refcount there, rather than having two extra ABI symbols
> (register/unregister functions).
> 
> One very appropriate question here is whether we want to
> expose the layout of struct libc_rseq (which includes the
> refcount) in a public header file, and if so, which project
> should hold it ? Or do we just want to document the layout
> of this ABI so projects can define the structure layout
> internally ? As my implementation currently stands, I have
> the following structure duplicated into rseq selftests,
> librseq, and glibc:
> 

"not exposed" and "the counter is abi" together is not
useful, either you want coordination in user-space or
not, that decision should imply the userspace abi/api
(e.g. adding a counter to the user-space struct).

it is true that only modules that implement registration
need to know about the counter and normal users don't,
but if you want any coordination then the layout must
be fixed and that should be exposed somewhere to avoid
breakage.

(i think ideally the api would be controlled by functions
and not object symbols with magic layout, but the rseq
design is already full of such m

Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc

2018-10-11 Thread Szabolcs Nagy
On 11/10/18 17:37, Mathieu Desnoyers wrote:
> - On Oct 11, 2018, at 12:20 PM, Szabolcs Nagy szabolcs.n...@arm.com wrote:
>> On 11/10/18 16:13, Mathieu Desnoyers wrote:
>>> - On Oct 11, 2018, at 6:37 AM, Szabolcs Nagy szabolcs.n...@arm.com 
>>> wrote:
>>>> On 10/10/18 20:19, Mathieu Desnoyers wrote:
>>>>> +__attribute__((visibility("hidden"))) __thread
>>>>> +volatile struct libc_rseq __lib_rseq_abi = {
>>>> ...
>> but it's in a magic struct that's called "abi" which is confusing,
>> the counter is not abi, it's in a hidden object.
> 
> No, it is really an ABI between user-space apps/libs. It's not meant to be
> hidden. glibc implements its own register/unregister functions (it does not
> link against librseq). librseq exposes register/unregister functions as public
> APIs. Those also use the refcount. I also plan to have existing libraries, 
> e.g.
> liblttng-ust and possibly liburcu flavors, implement the
> registration/unregistration and refcount handling on their own, so we don't
> have to add a requirement on additional linking on librseq for pre-existing
> libraries.
> 
> So that refcount is not an ABI between kernel and user-space, but it's a
> user-space ABI nevertheless (between program and shared objects).
> 

if that's what you want, then your declaration is wrong.
the object should not have hidden visibility.

then each library (glibc etc) will have its own separate
tls object with their own separate refcounter (and they
will unregister when their own refcounter hits 0)

either the struct should be public abi (extern tls
symbol) or the register/unregister functions should
be public abi (so when multiple implementations are
present in the same process only one of them will
provide definition for the public abi symbol and
thus there will be one refcounter).


Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc

2018-10-11 Thread Szabolcs Nagy
On 11/10/18 17:37, Mathieu Desnoyers wrote:
> - On Oct 11, 2018, at 12:20 PM, Szabolcs Nagy szabolcs.n...@arm.com wrote:
>> On 11/10/18 16:13, Mathieu Desnoyers wrote:
>>> - On Oct 11, 2018, at 6:37 AM, Szabolcs Nagy szabolcs.n...@arm.com 
>>> wrote:
>>>> On 10/10/18 20:19, Mathieu Desnoyers wrote:
>>>>> +__attribute__((visibility("hidden"))) __thread
>>>>> +volatile struct libc_rseq __lib_rseq_abi = {
>>>> ...
>> but it's in a magic struct that's called "abi" which is confusing,
>> the counter is not abi, it's in a hidden object.
> 
> No, it is really an ABI between user-space apps/libs. It's not meant to be
> hidden. glibc implements its own register/unregister functions (it does not
> link against librseq). librseq exposes register/unregister functions as public
> APIs. Those also use the refcount. I also plan to have existing libraries, 
> e.g.
> liblttng-ust and possibly liburcu flavors, implement the
> registration/unregistration and refcount handling on their own, so we don't
> have to add a requirement on additional linking on librseq for pre-existing
> libraries.
> 
> So that refcount is not an ABI between kernel and user-space, but it's a
> user-space ABI nevertheless (between program and shared objects).
> 

if that's what you want, then your declaration is wrong.
the object should not have hidden visibility.

then each library (glibc etc) will have its own separate
tls object with their own separate refcounter (and they
will unregister when their own refcounter hits 0)

either the struct should be public abi (extern tls
symbol) or the register/unregister functions should
be public abi (so when multiple implementations are
present in the same process only one of them will
provide definition for the public abi symbol and
thus there will be one refcounter).


Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc

2018-10-11 Thread Szabolcs Nagy
On 11/10/18 16:13, Mathieu Desnoyers wrote:
> - On Oct 11, 2018, at 6:37 AM, Szabolcs Nagy szabolcs.n...@arm.com wrote:
> 
>> On 10/10/18 20:19, Mathieu Desnoyers wrote:
>>> In order to integrate rseq into user-space applications, add a reference
>>> counter field after the struct rseq TLS ABI so many rseq users can be
>>> linked into the same application (e.g. librseq and glibc). The
>>> reference count ensures that rseq syscall registration/unregistration
>>> happens only for the most early/late user for each thread, thus ensuring
>>> that rseq is registered across the lifetime of all rseq users for a
>>> given thread.
>> ...
>>> +__attribute__((visibility("hidden"))) __thread
>>> +volatile struct libc_rseq __lib_rseq_abi = {
>> ...
>>> +extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
>>> +volatile struct rseq __rseq_abi;
>> ...
>>> @@ -70,7 +86,7 @@ int rseq_register_current_thread(void)
>>> sigset_t oldset;
>>>  
>>> signal_off_save();
>>> -   if (refcount++)
>>> +   if (__lib_rseq_abi.refcount++)
>>> goto end;
>>> rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);
>>
>> why do you use a local refcounter instead of the __rseq_abi one?
> 
> There is no refcount in struct rseq (the ABI between kernel and user-space).
> The registration refcount was part of an earlier version of the rseq system 
> call,
> but we decided against keeping it in the kernel.
> 
> So I'm adding one _after_ struct rseq, purely to allow interaction between
> various user-space components (program/libraries).

then all those components must use the same

  rseq_register_current_thread
  rseq_unregister_current_thread

functions and not call the syscall on their own.

in which case the refcount could be a static __thread variable.

but it's in a magic struct that's called "abi" which is confusing,
the counter is not abi, it's in a hidden object.

>> what prevents calling rseq_register_current_thread more than 4G times?
> 
> Nothing. It would indeed be cleaner to error out if we detect that refcount 
> is at
> INT_MAX. Is that what you have in mind ?

yes

>> why cant the kernel see that the same address is registered again and 
>> succeed?
> 
> It can, and it does. However, refcounting at user-level is needed to ensure
> the registration "lifetime" for rseq covers its entire use. If we have two 
> libraries
> using rseq, we end up with the following scenario:
> 
> Thread 1
> 
>   libA registers rseq
>   libB registers rseq
>   libB unregisters rseq
>   libA uses rseq -> bug! it's been unregistered by libB.
>   libA unregisters rseq -> unexpected, it's already been unregistered.
>  
> same applies if libA unregisters rseq before libB (and libB try to use rseq
> after libA has unregistered).
> 
> The refcount in user-space fixes this.

i see.

> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 



Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc

2018-10-11 Thread Szabolcs Nagy
On 11/10/18 16:13, Mathieu Desnoyers wrote:
> - On Oct 11, 2018, at 6:37 AM, Szabolcs Nagy szabolcs.n...@arm.com wrote:
> 
>> On 10/10/18 20:19, Mathieu Desnoyers wrote:
>>> In order to integrate rseq into user-space applications, add a reference
>>> counter field after the struct rseq TLS ABI so many rseq users can be
>>> linked into the same application (e.g. librseq and glibc). The
>>> reference count ensures that rseq syscall registration/unregistration
>>> happens only for the most early/late user for each thread, thus ensuring
>>> that rseq is registered across the lifetime of all rseq users for a
>>> given thread.
>> ...
>>> +__attribute__((visibility("hidden"))) __thread
>>> +volatile struct libc_rseq __lib_rseq_abi = {
>> ...
>>> +extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
>>> +volatile struct rseq __rseq_abi;
>> ...
>>> @@ -70,7 +86,7 @@ int rseq_register_current_thread(void)
>>> sigset_t oldset;
>>>  
>>> signal_off_save();
>>> -   if (refcount++)
>>> +   if (__lib_rseq_abi.refcount++)
>>> goto end;
>>> rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);
>>
>> why do you use a local refcounter instead of the __rseq_abi one?
> 
> There is no refcount in struct rseq (the ABI between kernel and user-space).
> The registration refcount was part of an earlier version of the rseq system 
> call,
> but we decided against keeping it in the kernel.
> 
> So I'm adding one _after_ struct rseq, purely to allow interaction between
> various user-space components (program/libraries).

then all those components must use the same

  rseq_register_current_thread
  rseq_unregister_current_thread

functions and not call the syscall on their own.

in which case the refcount could be a static __thread variable.

but it's in a magic struct that's called "abi" which is confusing,
the counter is not abi, it's in a hidden object.

>> what prevents calling rseq_register_current_thread more than 4G times?
> 
> Nothing. It would indeed be cleaner to error out if we detect that refcount 
> is at
> INT_MAX. Is that what you have in mind ?

yes

>> why cant the kernel see that the same address is registered again and 
>> succeed?
> 
> It can, and it does. However, refcounting at user-level is needed to ensure
> the registration "lifetime" for rseq covers its entire use. If we have two 
> libraries
> using rseq, we end up with the following scenario:
> 
> Thread 1
> 
>   libA registers rseq
>   libB registers rseq
>   libB unregisters rseq
>   libA uses rseq -> bug! it's been unregistered by libB.
>   libA unregisters rseq -> unexpected, it's already been unregistered.
>  
> same applies if libA unregisters rseq before libB (and libB try to use rseq
> after libA has unregistered).
> 
> The refcount in user-space fixes this.

i see.

> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 



Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc

2018-10-11 Thread Szabolcs Nagy
On 10/10/18 20:19, Mathieu Desnoyers wrote:
> In order to integrate rseq into user-space applications, add a reference
> counter field after the struct rseq TLS ABI so many rseq users can be
> linked into the same application (e.g. librseq and glibc). The
> reference count ensures that rseq syscall registration/unregistration
> happens only for the most early/late user for each thread, thus ensuring
> that rseq is registered across the lifetime of all rseq users for a
> given thread.
...
> +__attribute__((visibility("hidden"))) __thread
> +volatile struct libc_rseq __lib_rseq_abi = {
...
> +extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
> +volatile struct rseq __rseq_abi;
...
> @@ -70,7 +86,7 @@ int rseq_register_current_thread(void)
>   sigset_t oldset;
>  
>   signal_off_save();
> - if (refcount++)
> + if (__lib_rseq_abi.refcount++)
>   goto end;
>   rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);

why do you use a local refcounter instead of the __rseq_abi one?

what prevents calling rseq_register_current_thread more than 4G times?

why cant the kernel see that the same address is registered again and succeed?


Re: [RFC PATCH for 4.21 01/16] rseq/selftests: Add reference counter to coexist with glibc

2018-10-11 Thread Szabolcs Nagy
On 10/10/18 20:19, Mathieu Desnoyers wrote:
> In order to integrate rseq into user-space applications, add a reference
> counter field after the struct rseq TLS ABI so many rseq users can be
> linked into the same application (e.g. librseq and glibc). The
> reference count ensures that rseq syscall registration/unregistration
> happens only for the most early/late user for each thread, thus ensuring
> that rseq is registered across the lifetime of all rseq users for a
> given thread.
...
> +__attribute__((visibility("hidden"))) __thread
> +volatile struct libc_rseq __lib_rseq_abi = {
...
> +extern __attribute__((weak, alias("__lib_rseq_abi"))) __thread
> +volatile struct rseq __rseq_abi;
...
> @@ -70,7 +86,7 @@ int rseq_register_current_thread(void)
>   sigset_t oldset;
>  
>   signal_off_save();
> - if (refcount++)
> + if (__lib_rseq_abi.refcount++)
>   goto end;
>   rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);

why do you use a local refcounter instead of the __rseq_abi one?

what prevents calling rseq_register_current_thread more than 4G times?

why cant the kernel see that the same address is registered again and succeed?


Re: [RFC PATCH] glibc: Perform rseq(2) registration at nptl init and thread creation

2018-09-20 Thread Szabolcs Nagy

On 19/09/18 22:01, Mathieu Desnoyers wrote:

- On Sep 19, 2018, at 1:38 PM, Szabolcs Nagy szabolcs.n...@arm.com wrote:

note that libpthread.so is built with -ftls-model=initial-exec


Which would indeed make these annotations redundant. I'll remove
them.


(and if it wasn't then you'd want to put the attribute on the
declaration in the internal header file, not on the definition,
so the actual tls accesses generate the right code)


This area is one where I'm still uneasy on my comprehension of
the details, especially that it goes in a different direction than
what you are recommending.

I've read through https://www.akkadia.org/drepper/tls.pdf Section 5
"Linker Optimizations" to try to figure it out, and I end up being
under the impression that applying the tls_model("initial-exec")
attribute to a symbol declaration in a header file does not have
much impact on the accesses that use that variable. Reading through
that section, it seems that the variable definition is the one that
matters, and then the compiler/linker/loader are tweaking the sites
that reference the TLS variable through code rewrite based on the
most efficient mechanism that each phase knows can be used at each
stage.

What am I missing ?


in general if you rely on linker relaxations you may not
get optimal code because the linker cannot remove
instructions, just nop them out.

(e.g. on aarch64 an initial-exec access is 4 instructions
a general dynamic (tlsdesc) access is 6 instructions +
it involves a call, so the return address has to be saved
and restored (+ 3 instructions for stack operations if
there were none otherwise, which the linker cannot change))



Re: [RFC PATCH] glibc: Perform rseq(2) registration at nptl init and thread creation

2018-09-20 Thread Szabolcs Nagy

On 19/09/18 22:01, Mathieu Desnoyers wrote:

- On Sep 19, 2018, at 1:38 PM, Szabolcs Nagy szabolcs.n...@arm.com wrote:

note that libpthread.so is built with -ftls-model=initial-exec


Which would indeed make these annotations redundant. I'll remove
them.


(and if it wasn't then you'd want to put the attribute on the
declaration in the internal header file, not on the definition,
so the actual tls accesses generate the right code)


This area is one where I'm still uneasy on my comprehension of
the details, especially that it goes in a different direction than
what you are recommending.

I've read through https://www.akkadia.org/drepper/tls.pdf Section 5
"Linker Optimizations" to try to figure it out, and I end up being
under the impression that applying the tls_model("initial-exec")
attribute to a symbol declaration in a header file does not have
much impact on the accesses that use that variable. Reading through
that section, it seems that the variable definition is the one that
matters, and then the compiler/linker/loader are tweaking the sites
that reference the TLS variable through code rewrite based on the
most efficient mechanism that each phase knows can be used at each
stage.

What am I missing ?


in general if you rely on linker relaxations you may not
get optimal code because the linker cannot remove
instructions, just nop them out.

(e.g. on aarch64 an initial-exec access is 4 instructions
a general dynamic (tlsdesc) access is 6 instructions +
it involves a call, so the return address has to be saved
and restored (+ 3 instructions for stack operations if
there were none otherwise, which the linker cannot change))



Re: [RFC PATCH] glibc: Perform rseq(2) registration at nptl init and thread creation

2018-09-19 Thread Szabolcs Nagy

On 19/09/18 15:44, Mathieu Desnoyers wrote:

Things to consider:

- Move __rseq_refcount to an extra field at the end of __rseq_abi to
   eliminate one symbol. This would require to wrap struct rseq into
   e.g. struct rseq_lib or such, e.g.:

struct rseq_lib {
   struct rseq kabi;
   int refcount;
};

All libraries/programs which try to register rseq (glibc, early-adopter
applications, early-adopter libraries) should use the rseq refcount.
It becomes part of the ABI within a user-space process, but it's not
part of the ABI shared with the kernel per se.

- Restructure how this code is organized so glibc keeps building on
   non-Linux targets.

- We do not need an atomic increment/decrement for the refcount per
   se. Just being atomic with respect to the current thread (and nested
   signals) would be enough. What is the proper API to use there ?
   Should we expose struct rseq_lib in a public glibc header ? Should
   we create a rseq(3) man page ?

- Revisit use of "weak" symbol for __rseq_abi in glibc. Perhaps we
   want a non-weak symbol there ? (and let all other early user
   libraries use weak)



i don't think there is precedent for exposing tls symbol in glibc
(e.g. errno is exposed via __errno_location function) so there
might be issues with this (but i don't have immediate concerns).


diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index fe75d04113..20ee197d94 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -52,6 +52,13 @@ static struct pthread *__nptl_last_event __attribute_used__;
  /* Number of threads running.  */
  unsigned int __nptl_nthreads = 1;
  
+__attribute__((weak, tls_model("initial-exec"))) __thread

+volatile struct rseq __rseq_abi = {
+   .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
+};
+
+__attribute__((weak, tls_model("initial-exec"))) __thread
+volatile int __rseq_refcount;
 


note that libpthread.so is built with -ftls-model=initial-exec

(and if it wasn't then you'd want to put the attribute on the
declaration in the internal header file, not on the definition,
so the actual tls accesses generate the right code)



Re: [RFC PATCH] glibc: Perform rseq(2) registration at nptl init and thread creation

2018-09-19 Thread Szabolcs Nagy

On 19/09/18 15:44, Mathieu Desnoyers wrote:

Things to consider:

- Move __rseq_refcount to an extra field at the end of __rseq_abi to
   eliminate one symbol. This would require to wrap struct rseq into
   e.g. struct rseq_lib or such, e.g.:

struct rseq_lib {
   struct rseq kabi;
   int refcount;
};

All libraries/programs which try to register rseq (glibc, early-adopter
applications, early-adopter libraries) should use the rseq refcount.
It becomes part of the ABI within a user-space process, but it's not
part of the ABI shared with the kernel per se.

- Restructure how this code is organized so glibc keeps building on
   non-Linux targets.

- We do not need an atomic increment/decrement for the refcount per
   se. Just being atomic with respect to the current thread (and nested
   signals) would be enough. What is the proper API to use there ?
   Should we expose struct rseq_lib in a public glibc header ? Should
   we create a rseq(3) man page ?

- Revisit use of "weak" symbol for __rseq_abi in glibc. Perhaps we
   want a non-weak symbol there ? (and let all other early user
   libraries use weak)



i don't think there is precedent for exposing tls symbol in glibc
(e.g. errno is exposed via __errno_location function) so there
might be issues with this (but i don't have immediate concerns).


diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index fe75d04113..20ee197d94 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -52,6 +52,13 @@ static struct pthread *__nptl_last_event __attribute_used__;
  /* Number of threads running.  */
  unsigned int __nptl_nthreads = 1;
  
+__attribute__((weak, tls_model("initial-exec"))) __thread

+volatile struct rseq __rseq_abi = {
+   .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
+};
+
+__attribute__((weak, tls_model("initial-exec"))) __thread
+volatile int __rseq_refcount;
 


note that libpthread.so is built with -ftls-model=initial-exec

(and if it wasn't then you'd want to put the attribute on the
declaration in the internal header file, not on the definition,
so the actual tls accesses generate the right code)



Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-03 Thread Szabolcs Nagy

On 03/08/18 08:53, Florian Weimer wrote:

On 08/03/2018 09:11 AM, Andrew Pinski wrote:

Yes fix Links not to use memcpy on the framebuffer.
It is undefined behavior to use device memory with memcpy.


Some (de facto) ABIs require that it is supported, though.  For example, the POWER string functions avoid unaligned loads and stores for this 
reason because the platform has the same issue with device memory.  And yes, GCC will expand memcpy on POWER to something that is incompatible 
with device memory. 8-(




i think it's not reasonable to require libc memcpy to work on device memory.

i think if device memory is exposed to regular userspace applications that 
should be fixed.


If we don't want people to use memcpy, we probably need to provide a credible 
alternative.

Thanks,
Florian




Re: framebuffer corruption due to overlapping stp instructions on arm64

2018-08-03 Thread Szabolcs Nagy

On 03/08/18 08:53, Florian Weimer wrote:

On 08/03/2018 09:11 AM, Andrew Pinski wrote:

Yes fix Links not to use memcpy on the framebuffer.
It is undefined behavior to use device memory with memcpy.


Some (de facto) ABIs require that it is supported, though.  For example, the POWER string functions avoid unaligned loads and stores for this 
reason because the platform has the same issue with device memory.  And yes, GCC will expand memcpy on POWER to something that is incompatible 
with device memory. 8-(




i think it's not reasonable to require libc memcpy to work on device memory.

i think if device memory is exposed to regular userspace applications that 
should be fixed.


If we don't want people to use memcpy, we probably need to provide a credible 
alternative.

Thanks,
Florian




Re: [(resend)] seq_file: reset iterator to first record for zero offset

2017-11-08 Thread Szabolcs Nagy
* Miklos Szeredi  [2016-12-19 12:38:00 +0100]:
> Al,
> 
> Can you please take (or NACK) this patch please?
> 
> Thanks,
> Miklos
> ---
> From: Tomasz Majchrzak 
> Date: Tue, 29 Nov 2016 15:18:20 +0100
> 
> If kernfs file is empty on a first read, successive read operations
> using the same file descriptor will return no data, even when data is
> available. Default kernfs 'seq_next' implementation advances iterator
> position even when next object is not there. Kernfs 'seq_start' for
> following requests will not return iterator as position is already on
> the second object.
> 
> This defect doesn't allow to monitor badblocks sysfs files from MD raid.
> They are initially empty but if data appears at some stage, userspace is
> not able to read it.
> 
> Signed-off-by: Tomasz Majchrzak 
> Signed-off-by: Miklos Szeredi 
> ---

this patch broke userspace abi:

commit e522751d605d99a81508e58390a8f51ee96fb662
Author: Tomasz Majchrzak 
AuthorDate: 2016-11-29 15:18:20 +0100
Commit: Al Viro 
CommitDate: 2016-12-22 23:03:06 -0500

seq_file: reset iterator to first record for zero offset

reported in may at:
https://bugzilla.kernel.org/show_bug.cgi?id=195697

read(fd,buf,0) on sysfs/procfs changes the behaviour of the next read:
the next read reads the first line twice.

same issue with readv() with a 0 length buffer.

test code:

#include 
#include 
#include 

int main(int argc, char* argv[])
{
char buf1[512] = {0};
char buf2[512] = {0};
int fd;

fd = open("/proc/mounts", O_RDONLY);
if (read(fd, buf1, 0) < 0) return 1;
if (read(fd, buf1, 512) < 0) return 1;
lseek(fd, 0, SEEK_SET);
if (read(fd, buf2, 512) < 0) return 1;

// buf1 should be the same as buf2,
// the first 512 bytes of /proc/mounts

buf1[511]=buf2[511]='\n';
write(1, "# buf1:\n", 8);
write(1, buf1, 512);  // prints the first line twice
write(1, "# buf2:\n", 8);
write(1, buf2, 512);

return memcmp(buf1, buf2, 512) != 0;
}

stdio in musl libc can use readv with 0 length buffer in some cases,
and various tools use stdio to read these synthetic filesystems
so this is observable regression between linux v4.9 and v4.10

(i think musl can avoid the 0 length buffer in stdio, but the
linux behaviour is still incorrect. in general readv/writev could
have more posix conform behaviour on sysfs/procfs, currently they
don't behave as atomic fs operations which is surprising:
writev with several buffers behaves as if several independent write
syscalls were made instead of one, which can cause issues when users
do 'echo 12 >/proc/foo' and writev is used in the stdio implementation)


>  fs/seq_file.c |7 +++
>  1 file changed, 7 insertions(+)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --- a/fs/seq_file.c
> +++ b/fs/seq_file.c
> @@ -190,6 +190,13 @@ ssize_t seq_read(struct file *file, char
>*/
>   m->version = file->f_version;
>  
> + /*
> +  * if request is to read from zero offset, reset iterator to first
> +  * record as it might have been already advanced by previous requests
> +  */
> + if (*ppos == 0)
> + m->index = 0;
> +
>   /* Don't assume *ppos is where we left it */
>   if (unlikely(*ppos != m->read_pos)) {
>   while ((err = traverse(m, *ppos)) == -EAGAIN)


Re: [(resend)] seq_file: reset iterator to first record for zero offset

2017-11-08 Thread Szabolcs Nagy
* Miklos Szeredi  [2016-12-19 12:38:00 +0100]:
> Al,
> 
> Can you please take (or NACK) this patch please?
> 
> Thanks,
> Miklos
> ---
> From: Tomasz Majchrzak 
> Date: Tue, 29 Nov 2016 15:18:20 +0100
> 
> If kernfs file is empty on a first read, successive read operations
> using the same file descriptor will return no data, even when data is
> available. Default kernfs 'seq_next' implementation advances iterator
> position even when next object is not there. Kernfs 'seq_start' for
> following requests will not return iterator as position is already on
> the second object.
> 
> This defect doesn't allow to monitor badblocks sysfs files from MD raid.
> They are initially empty but if data appears at some stage, userspace is
> not able to read it.
> 
> Signed-off-by: Tomasz Majchrzak 
> Signed-off-by: Miklos Szeredi 
> ---

this patch broke userspace abi:

commit e522751d605d99a81508e58390a8f51ee96fb662
Author: Tomasz Majchrzak 
AuthorDate: 2016-11-29 15:18:20 +0100
Commit: Al Viro 
CommitDate: 2016-12-22 23:03:06 -0500

seq_file: reset iterator to first record for zero offset

reported in may at:
https://bugzilla.kernel.org/show_bug.cgi?id=195697

read(fd,buf,0) on sysfs/procfs changes the behaviour of the next read:
the next read reads the first line twice.

same issue with readv() with a 0 length buffer.

test code:

#include 
#include 
#include 

int main(int argc, char* argv[])
{
char buf1[512] = {0};
char buf2[512] = {0};
int fd;

fd = open("/proc/mounts", O_RDONLY);
if (read(fd, buf1, 0) < 0) return 1;
if (read(fd, buf1, 512) < 0) return 1;
lseek(fd, 0, SEEK_SET);
if (read(fd, buf2, 512) < 0) return 1;

// buf1 should be the same as buf2,
// the first 512 bytes of /proc/mounts

buf1[511]=buf2[511]='\n';
write(1, "# buf1:\n", 8);
write(1, buf1, 512);  // prints the first line twice
write(1, "# buf2:\n", 8);
write(1, buf2, 512);

return memcmp(buf1, buf2, 512) != 0;
}

stdio in musl libc can use readv with 0 length buffer in some cases,
and various tools use stdio to read these synthetic filesystems
so this is observable regression between linux v4.9 and v4.10

(i think musl can avoid the 0 length buffer in stdio, but the
linux behaviour is still incorrect. in general readv/writev could
have more posix conform behaviour on sysfs/procfs, currently they
don't behave as atomic fs operations which is surprising:
writev with several buffers behaves as if several independent write
syscalls were made instead of one, which can cause issues when users
do 'echo 12 >/proc/foo' and writev is used in the stdio implementation)


>  fs/seq_file.c |7 +++
>  1 file changed, 7 insertions(+)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --- a/fs/seq_file.c
> +++ b/fs/seq_file.c
> @@ -190,6 +190,13 @@ ssize_t seq_read(struct file *file, char
>*/
>   m->version = file->f_version;
>  
> + /*
> +  * if request is to read from zero offset, reset iterator to first
> +  * record as it might have been already advanced by previous requests
> +  */
> + if (*ppos == 0)
> + m->index = 0;
> +
>   /* Don't assume *ppos is where we left it */
>   if (unlikely(*ppos != m->read_pos)) {
>   while ((err = traverse(m, *ppos)) == -EAGAIN)


Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-03-08 Thread Szabolcs Nagy
* Carlos O'Donell  [2017-03-08 10:53:00 -0500]:
> On 11/11/2016 07:08 AM, Felix Janda wrote:
> > fixes the following compiler errors when  is included
> > after musl :
> > 
> > ./linux/in6.h:32:8: error: redefinition of 'struct in6_addr'
> > ./linux/in6.h:49:8: error: redefinition of 'struct sockaddr_in6'
> > ./linux/in6.h:59:8: error: redefinition of 'struct ipv6_mreq'
> 
> Do you have plans for fixing the error when the inclusion order is the other 
> way?

the other way (linux header included first) is
problematic because linux headers don't follow
all the standards the libc follows, they violate
namespace rules in their struct definitions, so
the libc definitions are necessarily incompatible
with them and thus different translation units can
end up refering to the same object through
incompatible types which is undefined.
(even if the abi matches and thus works across
the syscall interface, a sufficiently smart
toolchain can break such code at link time,
and since the libc itself uses its own definitons
that's what user code should use too).

there should be a way to include standard conform
libc headers and linux headers into the same tu,
at least the case when all conflicting definitions
come from the libc should work and i think that
should be the scope of these libc-compat.h changes.
(of course if glibc tries to support arbitrary
interleavings then the changes should not break that)


Re: [PATCH resent] uapi libc compat: allow non-glibc to opt out of uapi definitions

2017-03-08 Thread Szabolcs Nagy
* Carlos O'Donell  [2017-03-08 10:53:00 -0500]:
> On 11/11/2016 07:08 AM, Felix Janda wrote:
> > fixes the following compiler errors when  is included
> > after musl :
> > 
> > ./linux/in6.h:32:8: error: redefinition of 'struct in6_addr'
> > ./linux/in6.h:49:8: error: redefinition of 'struct sockaddr_in6'
> > ./linux/in6.h:59:8: error: redefinition of 'struct ipv6_mreq'
> 
> Do you have plans for fixing the error when the inclusion order is the other 
> way?

the other way (linux header included first) is
problematic because linux headers don't follow
all the standards the libc follows, they violate
namespace rules in their struct definitions, so
the libc definitions are necessarily incompatible
with them and thus different translation units can
end up refering to the same object through
incompatible types which is undefined.
(even if the abi matches and thus works across
the syscall interface, a sufficiently smart
toolchain can break such code at link time,
and since the libc itself uses its own definitons
that's what user code should use too).

there should be a way to include standard conform
libc headers and linux headers into the same tu,
at least the case when all conflicting definitions
come from the libc should work and i think that
should be the scope of these libc-compat.h changes.
(of course if glibc tries to support arbitrary
interleavings then the changes should not break that)


Re: Formal description of system call interface

2016-11-22 Thread Szabolcs Nagy
* Dmitry Vyukov <dvyu...@google.com> [2016-11-21 16:03:04 +0100]:
> On Mon, Nov 7, 2016 at 1:28 AM, Szabolcs Nagy <n...@port70.net> wrote:
> > * Dmitry Vyukov <dvyu...@google.com> [2016-11-06 14:39:28 -0800]:
> >> For the reference, current syzkaller descriptions are in txt files here:
> >> https://github.com/google/syzkaller/tree/master/sys
> > ...
> >> Taking the opportunity, if you see that something is missing/wrong
> >> in the descriptions of the subsystem you care about, or if it is not
> >> described at all, fixes are welcome.
> >
> > abi variants are missing (abi variation makes a lot of
> > syscall interface related work painful).
> 
> 
> What exactly do you mean by "abi variants"? Is it architecture?
> What exactly needs to be added to the descriptions support "abi variants?
> 

abi variant is a supported syscall abi
(linux supports several abis on the same arch:
e.g. x86 has i386,x86_64,x32 one might also
call different endianness a different abi, but
i'd expect le/be to be handled together.)

available syscalls, argument types and ordering
can be different across abis

i may be wrong, but i did not see those handled
in sys.txt, looking at it now i see missing
paddings in ipc structs (*_ds) so it may not
work the way i assumed (these are types which
have some abi variation).


Re: Formal description of system call interface

2016-11-22 Thread Szabolcs Nagy
* Dmitry Vyukov  [2016-11-21 16:03:04 +0100]:
> On Mon, Nov 7, 2016 at 1:28 AM, Szabolcs Nagy  wrote:
> > * Dmitry Vyukov  [2016-11-06 14:39:28 -0800]:
> >> For the reference, current syzkaller descriptions are in txt files here:
> >> https://github.com/google/syzkaller/tree/master/sys
> > ...
> >> Taking the opportunity, if you see that something is missing/wrong
> >> in the descriptions of the subsystem you care about, or if it is not
> >> described at all, fixes are welcome.
> >
> > abi variants are missing (abi variation makes a lot of
> > syscall interface related work painful).
> 
> 
> What exactly do you mean by "abi variants"? Is it architecture?
> What exactly needs to be added to the descriptions support "abi variants?
> 

abi variant is a supported syscall abi
(linux supports several abis on the same arch:
e.g. x86 has i386,x86_64,x32 one might also
call different endianness a different abi, but
i'd expect le/be to be handled together.)

available syscalls, argument types and ordering
can be different across abis

i may be wrong, but i did not see those handled
in sys.txt, looking at it now i see missing
paddings in ipc structs (*_ds) so it may not
work the way i assumed (these are types which
have some abi variation).


Re: Formal description of system call interface

2016-11-06 Thread Szabolcs Nagy
* Dmitry Vyukov  [2016-11-06 14:39:28 -0800]:
> This is notes from the discussion we had at Linux Plumbers this week
> regarding providing a formal description of system calls (user API).

yes a database of the syscall abis would be useful
..depending on the level of detail it has.

(right now there is not even spec about what registers
the syscall entry point may clobber on the various abis
which would be useful to know when making syscalls)

> Action points:
>  - polish DSL for description (must be extensible)
>  - write a parser for DSL
>  - provide definition for mm syscalls (mm is reasonably simple
>and self-contained)
>  - see if we can do validation of mm arguments

for all abi variants? e.g. mmap offset range is abi dependent.

> For the reference, current syzkaller descriptions are in txt files here:
> https://github.com/google/syzkaller/tree/master/sys
...
> Taking the opportunity, if you see that something is missing/wrong
> in the descriptions of the subsystem you care about, or if it is not
> described at all, fixes are welcome.

abi variants are missing (abi variation makes a lot of
syscall interface related work painful).


Re: Formal description of system call interface

2016-11-06 Thread Szabolcs Nagy
* Dmitry Vyukov  [2016-11-06 14:39:28 -0800]:
> This is notes from the discussion we had at Linux Plumbers this week
> regarding providing a formal description of system calls (user API).

yes a database of the syscall abis would be useful
..depending on the level of detail it has.

(right now there is not even spec about what registers
the syscall entry point may clobber on the various abis
which would be useful to know when making syscalls)

> Action points:
>  - polish DSL for description (must be extensible)
>  - write a parser for DSL
>  - provide definition for mm syscalls (mm is reasonably simple
>and self-contained)
>  - see if we can do validation of mm arguments

for all abi variants? e.g. mmap offset range is abi dependent.

> For the reference, current syzkaller descriptions are in txt files here:
> https://github.com/google/syzkaller/tree/master/sys
...
> Taking the opportunity, if you see that something is missing/wrong
> in the descriptions of the subsystem you care about, or if it is not
> described at all, fixes are welcome.

abi variants are missing (abi variation makes a lot of
syscall interface related work painful).


Re: [RFC PATCH 00/27] ARM64: support ILP32

2016-06-21 Thread Szabolcs Nagy
On 21/06/16 06:06, Yury Norov wrote:
> This series enables aarch64 port with ilp32 mode.
> 
> After long discussions in kernel list, we finally got
> consensus on how ABI should look. This patchset adds
> support for the ABI in GLIBC. It is tested with LTP 
> with no big regressions comparing to LP64 and AARCH32.
> 
> Though it's very raw. Please be patient reviewing it.
> 

note that aarch64 hwcap is 64 bits, so there will be
an issue on ilp32 once more than 32 bits are used.

hwcap can be queried through getauxval and it is
passed to ifunc resolvers.

limiting hwcaps to 32bits may become a problem.



Re: [RFC PATCH 00/27] ARM64: support ILP32

2016-06-21 Thread Szabolcs Nagy
On 21/06/16 06:06, Yury Norov wrote:
> This series enables aarch64 port with ilp32 mode.
> 
> After long discussions in kernel list, we finally got
> consensus on how ABI should look. This patchset adds
> support for the ABI in GLIBC. It is tested with LTP 
> with no big regressions comparing to LP64 and AARCH32.
> 
> Though it's very raw. Please be patient reviewing it.
> 

note that aarch64 hwcap is 64 bits, so there will be
an issue on ilp32 once more than 32 bits are used.

hwcap can be queried through getauxval and it is
passed to ifunc resolvers.

limiting hwcaps to 32bits may become a problem.



Re: [PATCH 02/27] [AARCH64] Add header guards to sysdep.h headers.

2016-06-21 Thread Szabolcs Nagy
On 21/06/16 06:06, Yury Norov wrote:
> From: Andrew Pinski 
> 
> * sysdeps/aarch64/sysdep.h: Add header guards.
> 

the things listed below are not part of the patch
(upstream glibc already has these fixes)

> [AARCH64] Remove 64 from some relocation names as they have been renamed in 
> later versions of the spec.
> 
> The AARCH64 elf ABI spec renamed some relocations removing 64 from the TLS
> relocation names to make them constaint with the ILP32 named ones.
> 
> * elf/elf.h (R_AARCH64_TLS_DTPMOD64): Rename to ..
> (R_AARCH64_TLS_DTPMOD): This.
> (R_AARCH64_TLS_DTPREL64): Rename to ...
> (R_AARCH64_TLS_DTPREL): This.
> (R_AARCH64_TLS_TPREL64): Rename to ...
> (R_AARCH64_TLS_TPREL): This.
> * sysdeps/aarch64/dl-machine.h (elf_machine_type_class): Update
> R_AARCH64_TLS_DTPMOD64, R_AARCH64_TLS_DTPREL64, and R_AARCH64_TLS_TPREL64.
> (elf_machine_rela): Likewise.
...



Re: [PATCH 02/27] [AARCH64] Add header guards to sysdep.h headers.

2016-06-21 Thread Szabolcs Nagy
On 21/06/16 06:06, Yury Norov wrote:
> From: Andrew Pinski 
> 
> * sysdeps/aarch64/sysdep.h: Add header guards.
> 

the things listed below are not part of the patch
(upstream glibc already has these fixes)

> [AARCH64] Remove 64 from some relocation names as they have been renamed in 
> later versions of the spec.
> 
> The AARCH64 elf ABI spec renamed some relocations removing 64 from the TLS
> relocation names to make them constaint with the ILP32 named ones.
> 
> * elf/elf.h (R_AARCH64_TLS_DTPMOD64): Rename to ..
> (R_AARCH64_TLS_DTPMOD): This.
> (R_AARCH64_TLS_DTPREL64): Rename to ...
> (R_AARCH64_TLS_DTPREL): This.
> (R_AARCH64_TLS_TPREL64): Rename to ...
> (R_AARCH64_TLS_TPREL): This.
> * sysdeps/aarch64/dl-machine.h (elf_machine_type_class): Update
> R_AARCH64_TLS_DTPMOD64, R_AARCH64_TLS_DTPREL64, and R_AARCH64_TLS_TPREL64.
> (elf_machine_rela): Likewise.
...



Re: [PATCH 01/27] [AARCH64] Fix utmp struct for compatibility reasons.

2016-06-21 Thread Szabolcs Nagy
On 21/06/16 06:06, Yury Norov wrote:
> From: Andrew Pinski 
> 
> NOTE This is an ABI change for AARCH64.
> If you have some AARCH32 and AARCH64 applications and they both use
> utmp, one of them will fail due to the use of time_t inside the
> utmp binary format.
> 
> This fixes the problem by setting __WORDSIZE_TIME64_COMPAT32.

i think changing the abi now is not ok.

this is BZ 17470 and i think it should be discussed separately
from ilp32.

if it's possible to detect the utmp file format, that would
allow a reasonable fix, the way glibc changes the struct def
based on lp64 targets is non-conforming.



Re: [PATCH 01/27] [AARCH64] Fix utmp struct for compatibility reasons.

2016-06-21 Thread Szabolcs Nagy
On 21/06/16 06:06, Yury Norov wrote:
> From: Andrew Pinski 
> 
> NOTE This is an ABI change for AARCH64.
> If you have some AARCH32 and AARCH64 applications and they both use
> utmp, one of them will fail due to the use of time_t inside the
> utmp binary format.
> 
> This fixes the problem by setting __WORDSIZE_TIME64_COMPAT32.

i think changing the abi now is not ok.

this is BZ 17470 and i think it should be discussed separately
from ilp32.

if it's possible to detect the utmp file format, that would
allow a reasonable fix, the way glibc changes the struct def
based on lp64 targets is non-conforming.



Re: [PATCH v6 00/21] ILP32 for ARM64

2016-06-03 Thread Szabolcs Nagy
On 02/06/16 20:03, Yury Norov wrote:
> On Tue, May 24, 2016 at 03:04:29AM +0300, Yury Norov wrote:
>> ILP32 glibc branch is available here:
>> https://github.com/norov/glibc/tree/ilp32-2.23
>>
> 
> So for AARCH64/ILP32 we turn next types to 64-bit in glibc:
> #define __INO_T_TYPE__UQUAD_TYPE
> #define __OFF_T_TYPE__SQUAD_TYPE
> #define __BLKCNT_T_TYPE __SQUAD_TYPE
> #define __FSBLKCNT_T_TYPE   __UQUAD_TYPE
> #define __FSFILCNT_T_TYPE   __UQUAD_TYPE
> 
> And define:
> # define __INO_T_MATCHES_INO64_T1
> # define __OFF_T_MATCHES_OFF64_T1
> # define __BLKCNT_T_MATCHES_BLKCNT64_T  1
> # define __FSBLKCNT_T_MATCHES_FSBLKCNT64_T  1
> # define __FSFILCNT_T_MATCHES_FSFILCNT_T1
> 
> If so, stat and statfs  structures for ilp32 are turning the same as
> for lp64. And so we'd handle related syscalls with native lp64
> handlers (wrapped, to zero top halves) in kernel. 
> 
> And we don't need stat64 for ilp32.
> 
> Did I miss something? Is everything correct?
> 
> OFF_T is turned to 64-bit quite smoothly, others make applications
> crash with segfault. Now I'm in deep debugging.
> 

based on previous discussions, non-trivial glibc changes may
be needed to make 64bit fs apis the default on a 32bit abi.
http://sourceware.org/ml/libc-alpha/2016-05/msg00337.html
http://sourceware.org/ml/libc-alpha/2016-05/msg00356.html

but i guess if you consistently fix the 32bit assumptions
then it should work.

> https://github.com/norov/glibc/commits/ilp32-dev
> https://github.com/norov/linux/commits/ilp32
> 
> Yury.
> 



Re: [PATCH v6 00/21] ILP32 for ARM64

2016-06-03 Thread Szabolcs Nagy
On 02/06/16 20:03, Yury Norov wrote:
> On Tue, May 24, 2016 at 03:04:29AM +0300, Yury Norov wrote:
>> ILP32 glibc branch is available here:
>> https://github.com/norov/glibc/tree/ilp32-2.23
>>
> 
> So for AARCH64/ILP32 we turn next types to 64-bit in glibc:
> #define __INO_T_TYPE__UQUAD_TYPE
> #define __OFF_T_TYPE__SQUAD_TYPE
> #define __BLKCNT_T_TYPE __SQUAD_TYPE
> #define __FSBLKCNT_T_TYPE   __UQUAD_TYPE
> #define __FSFILCNT_T_TYPE   __UQUAD_TYPE
> 
> And define:
> # define __INO_T_MATCHES_INO64_T1
> # define __OFF_T_MATCHES_OFF64_T1
> # define __BLKCNT_T_MATCHES_BLKCNT64_T  1
> # define __FSBLKCNT_T_MATCHES_FSBLKCNT64_T  1
> # define __FSFILCNT_T_MATCHES_FSFILCNT_T1
> 
> If so, stat and statfs  structures for ilp32 are turning the same as
> for lp64. And so we'd handle related syscalls with native lp64
> handlers (wrapped, to zero top halves) in kernel. 
> 
> And we don't need stat64 for ilp32.
> 
> Did I miss something? Is everything correct?
> 
> OFF_T is turned to 64-bit quite smoothly, others make applications
> crash with segfault. Now I'm in deep debugging.
> 

based on previous discussions, non-trivial glibc changes may
be needed to make 64bit fs apis the default on a 32bit abi.
http://sourceware.org/ml/libc-alpha/2016-05/msg00337.html
http://sourceware.org/ml/libc-alpha/2016-05/msg00356.html

but i guess if you consistently fix the 32bit assumptions
then it should work.

> https://github.com/norov/glibc/commits/ilp32-dev
> https://github.com/norov/linux/commits/ilp32
> 
> Yury.
> 



Re: [PATCH 01/23] all: syscall wrappers: add documentation

2016-05-26 Thread Szabolcs Nagy
On 26/05/16 15:20, Catalin Marinas wrote:
> While writing the above, I realised the current ILP32 patches still miss
> on converting pointers passed from user space (unless I got myself
> confused in macros). The new __SC_WRAP() and COMPAT_SYSCALL_WRAPx()
> macros take care of zero or sign extension via __SC_COMPAT_CAST().
> However, we have two more existing cases which I don't see covered:
> 
> a) Native syscalls taking a pointer argument and invoked directly from
>ILP32. For example, sys_read() takes a pointer but I don't see any
>__SC_WRAP added by patch 5
> 
> b) Current compat syscalls taking a pointer argument. For example,
>compat_sys_vmsplice() gets the iov32 pointer and the compiler assumes
>it is a 64-bit variable. I don't see where the upper half is zeroed
> 

on x32 sign/zero extension is currently left to userspace,
which is difficult to deal with, (long long)arg does the
wrong thing for pointer args.

> We can solve (a) by adding more __SC_WRAP annotations in the generic
> unistd.h. For (b), we would need an __SC_DELOUSE with a bit of penalty
> on AArch32/compat support where it isn't needed. So maybe davem has a
> point on the overall impact of always zeroing the upper half of the
> arguments ;) (both from a performance and maintainability perspective).
> I guess this part of the ABI is still up for discussion.
> 



Re: [PATCH 01/23] all: syscall wrappers: add documentation

2016-05-26 Thread Szabolcs Nagy
On 26/05/16 15:20, Catalin Marinas wrote:
> While writing the above, I realised the current ILP32 patches still miss
> on converting pointers passed from user space (unless I got myself
> confused in macros). The new __SC_WRAP() and COMPAT_SYSCALL_WRAPx()
> macros take care of zero or sign extension via __SC_COMPAT_CAST().
> However, we have two more existing cases which I don't see covered:
> 
> a) Native syscalls taking a pointer argument and invoked directly from
>ILP32. For example, sys_read() takes a pointer but I don't see any
>__SC_WRAP added by patch 5
> 
> b) Current compat syscalls taking a pointer argument. For example,
>compat_sys_vmsplice() gets the iov32 pointer and the compiler assumes
>it is a 64-bit variable. I don't see where the upper half is zeroed
> 

on x32 sign/zero extension is currently left to userspace,
which is difficult to deal with, (long long)arg does the
wrong thing for pointer args.

> We can solve (a) by adding more __SC_WRAP annotations in the generic
> unistd.h. For (b), we would need an __SC_DELOUSE with a bit of penalty
> on AArch32/compat support where it isn't needed. So maybe davem has a
> point on the overall impact of always zeroing the upper half of the
> arguments ;) (both from a performance and maintainability perspective).
> I guess this part of the ABI is still up for discussion.
> 



Re: [PATCH v6 00/21] ILP32 for ARM64

2016-05-25 Thread Szabolcs Nagy
On 24/05/16 01:04, Yury Norov wrote:
> This version is based on kernel v4.6.
> It works with glibc-2.23, and tested with LTP.
> 
...
> ILP32 glibc branch is available here:
> https://github.com/norov/glibc/tree/ilp32-2.23
> 
> It is tested with this series with no major downsides. I will send it to 
> glibc-alpha soon, after final revise. Please review and comment it as well.

i spotted one __ilp32__ vs __ILP32__ typo in the glibc code,
i can review it in detail when there is a cleaned up patch set.

in general the approach seems ok, the ugly part is when lp64
and ilp32 share code, but ilp32 needs some tweaks compared to
the current code (e.g. x vs w regs in asm, long changed to
long long in syscalls, different relocations etc) those will
be hard to review. the naming is sometimes _be_ilp32 sometimes
ilp32_be, but let's hope there will be no new abi variant to
confuse this further.



Re: [PATCH v6 00/21] ILP32 for ARM64

2016-05-25 Thread Szabolcs Nagy
On 24/05/16 01:04, Yury Norov wrote:
> This version is based on kernel v4.6.
> It works with glibc-2.23, and tested with LTP.
> 
...
> ILP32 glibc branch is available here:
> https://github.com/norov/glibc/tree/ilp32-2.23
> 
> It is tested with this series with no major downsides. I will send it to 
> glibc-alpha soon, after final revise. Please review and comment it as well.

i spotted one __ilp32__ vs __ILP32__ typo in the glibc code,
i can review it in detail when there is a cleaned up patch set.

in general the approach seems ok, the ugly part is when lp64
and ilp32 share code, but ilp32 needs some tweaks compared to
the current code (e.g. x vs w regs in asm, long changed to
long long in syscalls, different relocations etc) those will
be hard to review. the naming is sometimes _be_ilp32 sometimes
ilp32_be, but let's hope there will be no new abi variant to
confuse this further.



Re: [RFC6 PATCH v6 00/21] ILP32 for ARM64

2016-05-17 Thread Szabolcs Nagy
On 05/04/16 23:08, Yury Norov wrote:
> This version is rebased on kernel v4.6-rc2, and has fixes in signal subsystem.
> It works with updated glibc [1] (though very draft), and tested with LTP.
> 
> It was tested on QEMU and ThunderX machines. No major difference found.
> This is RFC because ILP32 is not tested in big-endian mode.
> 
>  v3: https://lkml.org/lkml/2014/9/3/704
>  v4: https://lkml.org/lkml/2015/4/13/691
>  v5: https://lkml.org/lkml/2015/9/29/911
> 
>  v6:
>  - time_t, __kenel_off_t and other types turned to be 32-bit
>for compatibility reasons (after v5 discussion);

i added libc-alpha to cc, the thread is at
https://lkml.org/lkml/2016/4/5/872
or http://thread.gmane.org/gmane.linux.kernel.cross-arch/31449

the kernel position seems to be that the aarch64 ilp32 syscall abi
will follow regular 32bit abis (apart from signal context and ptrace
related things and syscalls will take 64bit arguments).

i think the reasoning is that this allows legacy (non-64bit-safe)
software to use this abi on aarch64 without trouble.

i think this design should be ok for glibc.

there are some interop issues between processes following different abis
(e.g. shared mutexes, abi specific file formats, lib paths,..) which apply,
but i assume it is no different from existing 32bit vs 64bit issues.

the 32bit time_t and off_t are ugly and i wonder if 32bit __kernel_off_t
is really necessary. (i don't see where that is changed: it is supposed
to be __kernel_long_t which is 64bit).

i think even legacy software should be able to deal with 64bit off_t,
so we could avoid having two sets of filesystem apis or is 64bit-only
off_t more work to do in linux/glibc?

>  - related changes applied to ILP32 syscall table and handlers;
>  - ILP32 VDSO code excluded. It's not mandatory, and caused questions
>during review process. We definitely make sure we will follow up
>with a VDSO later on because it is needed for performance reasons;
>  - fixed build issues with different combinations of AARCH32 / ILP32
>enabling in config;
>  - ILP32 TLS bug fixed;
>  - entry32-common.S introduced to hold wrappers needed for both ILP32
>and AARCH32_EL0;
>  - documentation updated according to latest changes;
>  - rebased to the current head;
>  - coding style re-checked;
>  - ILP32 syscall table turned around.
> 
>rfc3:
>  - all structures and system calls are just like AARCH32 ones now. with 2
>exceptions: syscalls that take 64-bit parameter in 2 32-bit regosters
>are replaced with LP64 version; struct rt_sigframe is constructed both
>from LP64 and AARCH32 fields to be consistent with AARCH64 register set;
>  - documentation rewritten accordingly;
>  - common code for all 3 ABIs is moved to separated files for easy use,
>new headers and objects are introduced, incl: is_compat.h, thread_bits.h,
>signal_common.h, signal32_common.h.
>  - ILP32 VDSO code restored, Nathans comments are addressed;
>  - patch "arm64: ilp32: force IPC_64 in msgctl, shmctl, semctl" removed, as
>Arnd suggested general solution for IPC_64 problem.
> 
>rfc4:
>  - sys_ilp32.c syscall list is fixed according to comments;
>  - binfmt_elf32.c and binfmt_ilp32.c are introduced to host the code handling
>corresponding formats;
>  - statfs64, fstsatfs64 and mmap wrappers are removed;
>  - rebased on v4.4-rc8 + http://www.spinics.net/lists/kernel/msg2151759.html
> 
>  rfc5:
>  - addressed rfc4 comments;
>  - turned s390 compat wrappers to be generic and applied it to arm64/ilp32.
>Heiko Carsten and Martin Schwidefsky added to CC as s390 maintainers.
> 
>  rfc6:
>  - glibc follows new ABI, [1];
>  - significant rework for signal subsystem (patches 21, 23) - struct ucontext
>is now corresponds user representation;
>  - compat wrappers and 32-bit off_t patchsets are joined with this patchset,
>as for now ilp32 is the only user for them;
>  - moved to kernel v4.6-rc2;
>  - few minor bugfixes.
> 
> [1] https://github.com/norov/glibc/tree/new-api
> 
> Andrew Pinski (7):
>   arm64: ensure the kernel is compiled for LP64
>   arm64: rename COMPAT to AARCH32_EL0 in Kconfig
>   arm64: change some CONFIG_COMPAT over to use CONFIG_AARCH32_EL0
> instead
>   arm64:uapi: set __BITS_PER_LONG correctly for ILP32 and LP64
>   arm64:ilp32: add sys_ilp32.c and a separate table (in entry.S) to use
> it
>   arm64: ilp32: introduce ilp32-specific handlers for sigframe and
> ucontext
>   arm64:ilp32: add ARM64_ILP32 to Kconfig
> 
> Bamvor Jian Zhang (1):
>   arm64: compat: change config dependences to aarch32
> 
> Philipp Tomsich (1):
>   arm64:ilp32: add vdso-ilp32 and use for signal return
> 
> Yury Norov (16):
>   all: syscall wrappers: add documentation
>   all: introduce COMPAT_WRAPPER option and enable it for s390
>   all: s390: move wrapper infrastructure to generic headers
>   all: s390: move compat_wrappers.c from arch/s390/kernel to kernel/
>   all: wrap needed syscalls in generic unistd
>   compat ABI: use 

Re: [RFC6 PATCH v6 00/21] ILP32 for ARM64

2016-05-17 Thread Szabolcs Nagy
On 05/04/16 23:08, Yury Norov wrote:
> This version is rebased on kernel v4.6-rc2, and has fixes in signal subsystem.
> It works with updated glibc [1] (though very draft), and tested with LTP.
> 
> It was tested on QEMU and ThunderX machines. No major difference found.
> This is RFC because ILP32 is not tested in big-endian mode.
> 
>  v3: https://lkml.org/lkml/2014/9/3/704
>  v4: https://lkml.org/lkml/2015/4/13/691
>  v5: https://lkml.org/lkml/2015/9/29/911
> 
>  v6:
>  - time_t, __kenel_off_t and other types turned to be 32-bit
>for compatibility reasons (after v5 discussion);

i added libc-alpha to cc, the thread is at
https://lkml.org/lkml/2016/4/5/872
or http://thread.gmane.org/gmane.linux.kernel.cross-arch/31449

the kernel position seems to be that the aarch64 ilp32 syscall abi
will follow regular 32bit abis (apart from signal context and ptrace
related things and syscalls will take 64bit arguments).

i think the reasoning is that this allows legacy (non-64bit-safe)
software to use this abi on aarch64 without trouble.

i think this design should be ok for glibc.

there are some interop issues between processes following different abis
(e.g. shared mutexes, abi specific file formats, lib paths,..) which apply,
but i assume it is no different from existing 32bit vs 64bit issues.

the 32bit time_t and off_t are ugly and i wonder if 32bit __kernel_off_t
is really necessary. (i don't see where that is changed: it is supposed
to be __kernel_long_t which is 64bit).

i think even legacy software should be able to deal with 64bit off_t,
so we could avoid having two sets of filesystem apis or is 64bit-only
off_t more work to do in linux/glibc?

>  - related changes applied to ILP32 syscall table and handlers;
>  - ILP32 VDSO code excluded. It's not mandatory, and caused questions
>during review process. We definitely make sure we will follow up
>with a VDSO later on because it is needed for performance reasons;
>  - fixed build issues with different combinations of AARCH32 / ILP32
>enabling in config;
>  - ILP32 TLS bug fixed;
>  - entry32-common.S introduced to hold wrappers needed for both ILP32
>and AARCH32_EL0;
>  - documentation updated according to latest changes;
>  - rebased to the current head;
>  - coding style re-checked;
>  - ILP32 syscall table turned around.
> 
>rfc3:
>  - all structures and system calls are just like AARCH32 ones now. with 2
>exceptions: syscalls that take 64-bit parameter in 2 32-bit regosters
>are replaced with LP64 version; struct rt_sigframe is constructed both
>from LP64 and AARCH32 fields to be consistent with AARCH64 register set;
>  - documentation rewritten accordingly;
>  - common code for all 3 ABIs is moved to separated files for easy use,
>new headers and objects are introduced, incl: is_compat.h, thread_bits.h,
>signal_common.h, signal32_common.h.
>  - ILP32 VDSO code restored, Nathans comments are addressed;
>  - patch "arm64: ilp32: force IPC_64 in msgctl, shmctl, semctl" removed, as
>Arnd suggested general solution for IPC_64 problem.
> 
>rfc4:
>  - sys_ilp32.c syscall list is fixed according to comments;
>  - binfmt_elf32.c and binfmt_ilp32.c are introduced to host the code handling
>corresponding formats;
>  - statfs64, fstsatfs64 and mmap wrappers are removed;
>  - rebased on v4.4-rc8 + http://www.spinics.net/lists/kernel/msg2151759.html
> 
>  rfc5:
>  - addressed rfc4 comments;
>  - turned s390 compat wrappers to be generic and applied it to arm64/ilp32.
>Heiko Carsten and Martin Schwidefsky added to CC as s390 maintainers.
> 
>  rfc6:
>  - glibc follows new ABI, [1];
>  - significant rework for signal subsystem (patches 21, 23) - struct ucontext
>is now corresponds user representation;
>  - compat wrappers and 32-bit off_t patchsets are joined with this patchset,
>as for now ilp32 is the only user for them;
>  - moved to kernel v4.6-rc2;
>  - few minor bugfixes.
> 
> [1] https://github.com/norov/glibc/tree/new-api
> 
> Andrew Pinski (7):
>   arm64: ensure the kernel is compiled for LP64
>   arm64: rename COMPAT to AARCH32_EL0 in Kconfig
>   arm64: change some CONFIG_COMPAT over to use CONFIG_AARCH32_EL0
> instead
>   arm64:uapi: set __BITS_PER_LONG correctly for ILP32 and LP64
>   arm64:ilp32: add sys_ilp32.c and a separate table (in entry.S) to use
> it
>   arm64: ilp32: introduce ilp32-specific handlers for sigframe and
> ucontext
>   arm64:ilp32: add ARM64_ILP32 to Kconfig
> 
> Bamvor Jian Zhang (1):
>   arm64: compat: change config dependences to aarch32
> 
> Philipp Tomsich (1):
>   arm64:ilp32: add vdso-ilp32 and use for signal return
> 
> Yury Norov (16):
>   all: syscall wrappers: add documentation
>   all: introduce COMPAT_WRAPPER option and enable it for s390
>   all: s390: move wrapper infrastructure to generic headers
>   all: s390: move compat_wrappers.c from arch/s390/kernel to kernel/
>   all: wrap needed syscalls in generic unistd
>   compat ABI: use 

  1   2   >