On Mon, Apr 28, 2025 at 11:14:06AM +0100, Ryan Roberts wrote:
> On 25/04/2025 23:45, Kees Cook wrote:
> > In commit bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing
> > direct loader exec"), the brk was moved out of the mmap region when
> > loading static PIE binaries (ET_DYN without INTERP). The common case
> > for these binaries was testing new ELF loaders, so the brk needed to
> > be away from mmap to avoid colliding with stack, future mmaps (of the
> > loader-loaded binary), etc. But this was only done when ASLR was enabled,
> > in an attempt to minimize changes to memory layouts.
> 
> If it's ok to move the brk to low memory for the !INTERP case, why is it not 
> ok
> to just load the whole program in low memory? Perhaps if the thing that is 
> being
> loaded does turn out to be the interpretter then it will move the brk to just
> after to the program it loads so there is no conflict (I'm just guessing).

The bulk of the rationale is in commit eab09532d400 ("binfmt_elf: use
ELF_ET_DYN_BASE only for PIE"). But it mostly boils down to "try to keep
things as far apart as possible to avoid having things collide, which
is especially problematic on 32-bit systems". Also, since memory layouts
also end up getting limited by userspace assumptions, as seen with commit
c715b72c1ba4 ("mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes"),
it's been shown we want to change as little as possible at a time. :)
The intent was to lower ELF_ET_DYN_BASE further, but it ended up not
being possible on x86 nor arm64. :( So, yes, it would work for 64-bit
archs, but not 32-bit. I've been trying to avoid region selection being
arch-width-specific.

So, since brk is small and isolated, this has proven a viable thing to
move (rather than the whole program), and with the default being ASLR
enabled it's been in this position for a while now. Doing it also for
non-ASLR should be okay too.

> > After adding support to respect alignment requirements for static PIE
> > binaries in commit 3545deff0ec7 ("binfmt_elf: Honor PT_LOAD alignment
> > for static PIE"), it became possible to have a large gap after the
> > final PT_LOAD segment and the top of the mmap region. This means that
> > future mmap allocations might go after the last PT_LOAD segment (where
> > brk might be if ASLR was disabled) instead of before them (where they
> > traditionally ended up).
> > 
> > On arm64, running with ASLR disabled, Ubuntu 22.04's "ldconfig" binary,
> > a static PIE, has alignment requirements that leaves a gap large enough
> > after the last PT_LOAD segment to fit the vdso and vvar, but still leave
> > enough space for the brk (which immediately follows the last PT_LOAD
> > segment) to be allocated by the binary.
> > 
> > fffff7f20000-fffff7fde000 r-xp 00000000 fe:02 8110426 
> > /home/ubuntu/glibc-2.35/build/elf/ldconfig
> > fffff7fee000-fffff7ff5000 rw-p 000be000 fe:02 8110426 
> > /home/ubuntu/glibc-2.35/build/elf/ldconfig
> 
> nit: I captured this with a locally built version that has debug symbols, 
> hence
> the weird "/home/ubuntu/glibc-2.35/build/elf/ldconfig" path. Perhaps it is
> clearer to change this to "/sbin/ldconfig.real", which is the system installed
> location?

Sure; I can update the example.

> 
> > fffff7ff5000-fffff7ffa000 rw-p 00000000 00:00 0
> > ***[brk will go here at fffff7ffa000]***
> > fffff7ffc000-fffff7ffe000 r--p 00000000 00:00 0       [vvar]
> > fffff7ffe000-fffff8000000 r-xp 00000000 00:00 0       [vdso]
> > fffffffdf000-1000000000000 rw-p 00000000 00:00 0      [stack]
> > 
> > After commit 0b3bc3354eb9 ("arm64: vdso: Switch to generic storage
> > implementation"), the arm64 vvar grew slightly, and suddenly the brk
> > collided with the allocation.
> > 
> > fffff7f20000-fffff7fde000 r-xp 00000000 fe:02 8110426 
> > /home/ubuntu/glibc-2.35/build/elf/ldconfig
> > fffff7fee000-fffff7ff5000 rw-p 000be000 fe:02 8110426 
> > /home/ubuntu/glibc-2.35/build/elf/ldconfig
> > fffff7ff5000-fffff7ffa000 rw-p 00000000 00:00 0
> > ***[oops, no room any more, vvar is at fffff7ffa000!]***
> > fffff7ffa000-fffff7ffe000 r--p 00000000 00:00 0       [vvar]
> > fffff7ffe000-fffff8000000 r-xp 00000000 00:00 0       [vdso]
> > fffffffdf000-1000000000000 rw-p 00000000 00:00 0      [stack]
> > 
> > The solution is to unconditionally move the brk out of the mmap region
> > for static PIE binaries. Whether ASLR is enabled or not does not change if
> > there may be future mmap allocation collisions with a growing brk region.
> > 
> > Update memory layout comments (with kernel-doc headings), consolidate
> > the setting of mm->brk to later (it isn't needed early), move static PIE
> > brk out of mmap unconditionally, and make sure brk(2) knows to base brk
> > position off of mm->start_brk not mm->end_data no matter what the cause of
> > moving it is (via current->brk_randomized). (Though why isn't this always
> > just start_brk? More research is needed, but leave that alone for now.)
> > 
> > Reported-by: Ryan Roberts <[email protected]>
> > Closes: 
> > https://lore.kernel.org/lkml/[email protected]/
> > Fixes: bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing direct 
> > loader exec")
> > Signed-off-by: Kees Cook <[email protected]>
> > ---
> > Cc: Catalin Marinas <[email protected]>
> > Cc: Al Viro <[email protected]>
> > Cc: Christian Brauner <[email protected]>
> > Cc: Jan Kara <[email protected]>
> > Cc: <[email protected]>
> > Cc: <[email protected]>
> > ---
> >  fs/binfmt_elf.c | 67 +++++++++++++++++++++++++++++++------------------
> >  1 file changed, 43 insertions(+), 24 deletions(-)
> > 
> > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > index 584fa89bc877..26c87d076adb 100644
> > --- a/fs/binfmt_elf.c
> > +++ b/fs/binfmt_elf.c
> > @@ -830,6 +830,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
> >     struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
> >     struct elf_phdr *elf_property_phdata = NULL;
> >     unsigned long elf_brk;
> > +   bool brk_moved = false;
> >     int retval, i;
> >     unsigned long elf_entry;
> >     unsigned long e_entry;
> > @@ -1097,15 +1098,19 @@ static int load_elf_binary(struct linux_binprm 
> > *bprm)
> >                     /* Calculate any requested alignment. */
> >                     alignment = maximum_alignment(elf_phdata, 
> > elf_ex->e_phnum);
> >  
> > -                   /*
> > -                    * There are effectively two types of ET_DYN
> > -                    * binaries: programs (i.e. PIE: ET_DYN with PT_INTERP)
> > -                    * and loaders (ET_DYN without PT_INTERP, since they
> > -                    * _are_ the ELF interpreter). The loaders must
> > -                    * be loaded away from programs since the program
> > -                    * may otherwise collide with the loader (especially
> > -                    * for ET_EXEC which does not have a randomized
> > -                    * position). For example to handle invocations of
> > +                   /**
> > +                    * DOC: PIE handling
> > +                    *
> > +                    * There are effectively two types of ET_DYN ELF
> > +                    * binaries: programs (i.e. PIE: ET_DYN with
> > +                    * PT_INTERP) and loaders (i.e. static PIE: ET_DYN
> > +                    * without PT_INTERP, usually the ELF interpreter
> > +                    * itself). Loaders must be loaded away from programs
> > +                    * since the program may otherwise collide with the
> > +                    * loader (especially for ET_EXEC which does not have
> > +                    * a randomized position).
> > +                    *
> > +                    * For example, to handle invocations of
> >                      * "./ld.so someprog" to test out a new version of
> >                      * the loader, the subsequent program that the
> >                      * loader loads must avoid the loader itself, so
> > @@ -1118,6 +1123,9 @@ static int load_elf_binary(struct linux_binprm *bprm)
> >                      * ELF_ET_DYN_BASE and loaders are loaded into the
> >                      * independently randomized mmap region (0 load_bias
> >                      * without MAP_FIXED nor MAP_FIXED_NOREPLACE).
> > +                    *
> > +                    * See below for "brk" handling details, which is
> > +                    * also affected by program vs loader and ASLR.
> >                      */
> >                     if (interpreter) {
> >                             /* On ET_DYN with PT_INTERP, we do the ASLR. */
> > @@ -1234,8 +1242,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
> >     start_data += load_bias;
> >     end_data += load_bias;
> >  
> > -   current->mm->start_brk = current->mm->brk = ELF_PAGEALIGN(elf_brk);
> > -
> >     if (interpreter) {
> >             elf_entry = load_elf_interp(interp_elf_ex,
> >                                         interpreter,
> > @@ -1291,27 +1297,40 @@ static int load_elf_binary(struct linux_binprm 
> > *bprm)
> >     mm->end_data = end_data;
> >     mm->start_stack = bprm->p;
> >  
> > -   if ((current->flags & PF_RANDOMIZE) && (snapshot_randomize_va_space > 
> > 1)) {
> > +   /**
> > +    * DOC: "brk" handling
> > +    *
> > +    * For architectures with ELF randomization, when executing a
> > +    * loader directly (i.e. static PIE: ET_DYN without PT_INTERP),
> > +    * move the brk area out of the mmap region and into the unused
> > +    * ELF_ET_DYN_BASE region. Since "brk" grows up it may collide
> > +    * early with the stack growing down or other regions being put
> > +    * into the mmap region by the kernel (e.g. vdso).
> > +    */
> > +   if (IS_ENABLED(CONFIG_ARCH_HAS_ELF_RANDOMIZE) &&
> 
> Does this imply that this issue will persist for 
> !CONFIG_ARCH_HAS_ELF_RANDOMIZE
> arches?

Ah, hm, interesting point. I think those architectures are unlikely to
have static PIE binaries, though? ARCH_HAS_ELF_RANDOMIZE is currently
selected (some through ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT) for these
architectures:

arm
arm64
csky
loongarch
mips
parisc
powerpc
riscv
s390
x86

In the interest of changing as little as possible, I think I'd like to
stick with this being gated by CONFIG_ARCH_HAS_ELF_RANDOMIZE, since
those architectures, in theory, would be expecting brk to be moved, and
the others may not.

> 
> > +       elf_ex->e_type == ET_DYN && !interpreter) {
> > +           elf_brk = ELF_ET_DYN_BASE;
> > +           /* This counts as moving the brk, so let brk(2) know. */
> > +           brk_moved = true;
> 
> So you are now randomizing the brk regardless of the value of
> snapshot_randomize_va_space. I suggested this as a potential solution but was
> concerned about back-compat issues. See this code snippet from memory.c:

Well, the "randomize" is only happening if snapshot_randomize_va_space
is >1, but we are _moving_ the brk in this case, which is what the
brk(2) syscall wants to know about, and is what CONFIG_COMPAT_BRK tries
to deal with. So yes, there is a bit of a conflict. More below...

> 
> ----8<----
> /*
>  * Randomize the address space (stacks, mmaps, brk, etc.).
>  *
>  * ( When CONFIG_COMPAT_BRK=y we exclude brk from randomization,
>  *   as ancient (libc5 based) binaries can segfault. )
>  */
> int randomize_va_space __read_mostly =
> #ifdef CONFIG_COMPAT_BRK
>                                       1;
> #else
>                                       2;
> #endif
> ----8<----
> 
> This implies to me that this change is in danger of breaking libc5-based 
> binaries?

It's possible it could break running the loader directly against some
libc5-based binaries. If this turns out to be a real-world issue, we can
find a better solution (perhaps pre-allocating a large brk).

-Kees

> 
> Thanks,
> Ryan
> 
> > +   }
> > +   mm->start_brk = mm->brk = ELF_PAGEALIGN(elf_brk);
> > +
> > +   if ((current->flags & PF_RANDOMIZE) && snapshot_randomize_va_space > 1) 
> > {
> >             /*
> > -            * For architectures with ELF randomization, when executing
> > -            * a loader directly (i.e. no interpreter listed in ELF
> > -            * headers), move the brk area out of the mmap region
> > -            * (since it grows up, and may collide early with the stack
> > -            * growing down), and into the unused ELF_ET_DYN_BASE region.
> > +            * If we didn't move the brk to ELF_ET_DYN_BASE (above),
> > +            * leave a gap between .bss and brk.
> >              */
> > -           if (IS_ENABLED(CONFIG_ARCH_HAS_ELF_RANDOMIZE) &&
> > -               elf_ex->e_type == ET_DYN && !interpreter) {
> > -                   mm->brk = mm->start_brk = ELF_ET_DYN_BASE;
> > -           } else {
> > -                   /* Otherwise leave a gap between .bss and brk. */
> > +           if (!brk_moved)
> >                     mm->brk = mm->start_brk = mm->brk + PAGE_SIZE;
> > -           }
> >  
> >             mm->brk = mm->start_brk = arch_randomize_brk(mm);
> > +           brk_moved = true;
> > +   }
> > +
> >  #ifdef compat_brk_randomized
> > +   if (brk_moved)
> >             current->brk_randomized = 1;
> >  #endif
> > -   }
> >  
> >     if (current->personality & MMAP_PAGE_ZERO) {
> >             /* Why this, you ask???  Well SVr4 maps page 0 as read-only,
> 

-- 
Kees Cook

Reply via email to