Re: Using stock PIE executables from standard distributions?

2018-01-29 Thread Nadav Har'El
On Mon, Jan 29, 2018 at 9:02 AM, Nadav Har'El  wrote:

>
> On Mon, Jan 29, 2018 at 12:37 AM, Rick Payne  wrote:
>
>>
>>
>> > On 28 Jan 2018, at 22:23, Nadav Har'El  wrote:
>> >
>> > However, sadly, we still do have bugs with PIE support that need to be
>> fixed before running PIEs on OSv becomes a hassle-free experience:
>> >
>> > A bug which relatively-recently became relevant (as gcc changed) is
>> https://github.com/cloudius-systems/osv/issues/689, which prevents PIEs
>> which use getopt() with "optarg" from working.
>> > A harder bug is https://github.com/cloudius-systems/osv/issues/352
>> which I think is still partially relevant - I think we still have problems
>> with thread-local variables in PIEs (but not in shared libraries).
>> >
>> > Please check the PIE which interests you, and see if one of these bugs
>> affects you, or if there are any other bugs.
>> > Both the aforementioned bug reports contain also ideas on how to fix
>> them, if you're looking for
>>
>> Aha, and maybe my ERTS issue was another bug?
>>
>
> Did you compile it as -pie, -fpie? Or -shared -fPIC?
>
> Is your Erlang problem something reproducable using the apps/erlang
> somehow?
>
> I tried with gcc 7.2.1,
>  scripts/build -j5 image=erlang; scripts/run.py -V
> but get a different error:
>
> /usr/lib64/erlang/erts-7.0/bin/beam.smp: failed looking up symbol tgetent
>

Oh, this is a yet another case of compiling with one version of the
library, then putting
a different version in the image. I opened
https://github.com/cloudius-systems/osv/issues/943


>
> [backtrace]
> 0x0033a7cf 
> 0x0033d384 
> 0x0033d555 
>
> Something happened to the termcap library? Weird.
>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Page fault outside of application

2018-01-29 Thread Nadav Har'El
On Wed, Jan 24, 2018 at 11:07 AM, Rick Payne  wrote:

> Hi,
>
> On 23/01/18 20:16, Nadav Har'El wrote:
>
>> I don't have any bright ideas, but just a few small comments below,
>> hopefully (?) they will help something...
>>
>
> Appreciated...
>
> This writes in "addr", which seems a reasonable address (doesn't seem like
>> junk).
>> In object::resolve_pltgot() you can see the addr is _base + slot.r_offset
>> maybe you
>> can print them and see with "nm"/"readelf" of the object being loaded if
>> this offset
>> address makes sense (in the PLT section)?
>>
>
> So that made sense as far as I can see:
>
> (gdb)
> #9  0x00492c7b in elf::object::arch_relocate_jump_slot (
> this=0xa0010327b400, sym=1, addr=0x1aa0fe28, addend=0)
> at arch/x64/arch-elf.cc:109
> 109 *static_cast(addr) = symsym.relocated_addr();
> (gdb) p symsym.obj._base
> $1 = (void *) 0x0
> (gdb) up
> #10 0x003fdfd7 in elf::object::resolve_pltgot (
> this=0xa0010327b400, index=0) at core/elf.cc:692
> 692 if (!arch_relocate_jump_slot(sym, addr, slot.r_addend)) {
> (gdb) p slot.r_offset
> $2 = 2162216
> (gdb) p/x slot.r_offset
> $3 = 0x20fe28
> (gdb)
>
> $ readelf -a _build/default/rel/dbgp_webapi/erts-9.0.5/bin/erlexec | grep
> 20fe28
> 0020fe28  00010007 R_X86_64_JUMP_SLO 
> getenv@GLIBC_2.2.5 + 0
>

This all seems reasonable.
Maybe we somehow got the PLT becoming read-only, so we are getting a
pagefault trying to write to it?
Can you please try in gdb "osv mmap" and look at the mapping which includes
the faulting address (0x1aa0fe28), is it read-write or read-only?

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Page fault outside of application

2018-01-29 Thread Rick Payne
On Mon, 2018-01-29 at 10:54 +0200, Nadav Har'El wrote:
> This all seems reasonable.
> Maybe we somehow got the PLT becoming read-only, so we are getting a
> pagefault trying to write to it?
> Can you please try in gdb "osv mmap" and look at the mapping which
> includes the faulting address (0x1aa0fe28), is it read-write or
> read-only?

New build, so a slightly different address, but in the same range (and
its the same crash). I think you've nailed it though:
(gdb) up#6  0x003c451c in mmu::vm_fault
(addr=17592355974704, ef=0x83d82068) at
core/mmu.cc:13301330vm_sigsegv(addr, ef);(gdb) p/x
addr$1 = 0x1a20ee30
0x1a00 0x1a00f000 [60.0
kB]flags=fmF  perm=rx   offset=0x path=/otp/erts-
9.0.5/bin/erlexec0x1a20e000 0x1a20f000 [4.0
kB] flags=fmF  perm=roffset=0xe000 path=/otp/erts-
9.0.5/bin/erlexec0x1a20f000 0x1a21 [4.0
kB] flags=fmF  perm=rw   offset=0xf000 path=/otp/erts-
9.0.5/bin/erlexec
That address is in the second segment, and thus marked 'r'. Is gcc7
doing something different thats incompatible with the elf loader in
OSv? Related to the intel fiasco?
Cheers,Rick

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Page fault outside of application

2018-01-29 Thread Nadav Har'El
On Mon, Jan 29, 2018 at 11:20 AM, Rick Payne  wrote:

> On Mon, 2018-01-29 at 10:54 +0200, Nadav Har'El wrote:
>
> This all seems reasonable.
> Maybe we somehow got the PLT becoming read-only, so we are getting a
> pagefault trying to write to it?
> Can you please try in gdb "osv mmap" and look at the mapping which
> includes the faulting address (0x1aa0fe28), is it read-write or
> read-only?
>
>
> New build, so a slightly different address, but in the same range (and its
> the same crash). I think you've nailed it though:
>
> (gdb) up
> #6  0x003c451c in mmu::vm_fault (addr=17592355974704,
> ef=0x83d82068) at core/mmu.cc:1330
> 1330 vm_sigsegv(addr, ef);
> (gdb) p/x addr
> $1 = 0x1a20ee30
>
> 0x1a00 0x1a00f000 [60.0 kB]flags=fmF  perm=rx 
>   offset=0x
> path=/otp/erts-9.0.5/bin/erlexec
> 0x1a20e000 0x1a20f000 [4.0 kB] flags=fmF  perm=r  
>   offset=0xe000
> path=/otp/erts-9.0.5/bin/erlexec
> 0x1a20f000 0x1a21 [4.0 kB] flags=fmF  perm=rw 
>   offset=0xf000
> path=/otp/erts-9.0.5/bin/erlexec
>
> That address is in the second segment, and thus marked 'r'. Is gcc7 doing
> something different thats incompatible with the elf loader in OSv? Related
> to the intel fiasco?
>

Hmm, I don't know, I wasn't aware anything like that changed.
We usually change parts of the object marked by PT_GNU_RELRO to read-only
in object::fix_permissions(), I'm guessing (but didn't check) this what
caused the read-only page you're seeing.
The compiler usually does NOT mark the .GOT.PLT section - for function
lookup - as RELRO, because this needs to be modified after startup, every
time a function is used for the first time;
Only when "-z now" is used during linking (DT_BIND_NOW object flag) do we
do all the function lookups on startup (see object::relocate_pltgot()) and
then, it's ok that the .GOT.PLT is also marked RELRO and made read-only.

I'm *guessing* (with no evidence) that one of the following happened:
1. Your compiler defaults to "full relro" (-Wl,-z,now -Wl,-z,relro) but for
some reason object::relocate_pltgot() doesn't recognize the bind_now.
2. Somehow the loop in object::relocate_pltgot() missed some of the
functions - like getenv()
3. Something in the new compiler changed the meaning of PT_GNU_RELRO or
added other flags which confused object::fix_permissions() and caused it to
make a page read-only when it shouldn't have.

Good luck (and thanks) on figuring this out,
Nadav.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Page fault outside of application

2018-01-29 Thread Rick Payne
On Mon, 2018-01-29 at 11:43 +0200, Nadav Har'El wrote:
> 
> Hmm, I don't know, I wasn't aware anything like that changed.
> We usually change parts of the object marked by PT_GNU_RELRO to read-
> only in object::fix_permissions(), I'm guessing (but didn't check)
> this what caused the read-only page you're seeing.

I'll take a look.

> The compiler usually does NOT mark the .GOT.PLT section - for
> function lookup - as RELRO, because this needs to be modified after
> startup, every time a function is used for the first time;

Maybe I'm not following. The GNU_RELO sections look the same between
the 2 versions of erlexec. First one (-ubuntu17.10) fails, second one
is fine:

rickp@mo:~$ readelf --headers /usr/local/packages/OTP-20.0.5-OSv-
ubuntu17.10/erts-9.0.5/bin/erlexec | grep -2 RELRO
  GNU_STACK  0x 0x
0x
 0x 0x  RW 0x10
  GNU_RELRO  0xebe8 0x0020ebe8
0x0020ebe8
 0x0418 0x0418  R  0x1

rickp@mo:~$ readelf --headers /usr/local/packages/OTP-20.0.5-OSv/erts-
9.0.5/bin/erlexec | grep -2 RELRO
  GNU_STACK  0x 0x
0x
 0x 0x  RW 0x10
  GNU_RELRO  0xec08 0x0020ec08
0x0020ec08
 0x03f8 0x03f8  R  0x1

> Only when "-z now" is used during linking (DT_BIND_NOW object flag)
> do we do all the function lookups on startup (see
> object::relocate_pltgot()) and then, it's ok that the .GOT.PLT is
> also marked RELRO and made read-only.
> 
> I'm *guessing* (with no evidence) that one of the following happened:
> 1. Your compiler defaults to "full relro" (-Wl,-z,now -Wl,-z,relro)
> but for some reason object::relocate_pltgot() doesn't recognize the
> bind_now.

So there is definitely a difference in the binaries. In the one that
fails, getenv is defined like this, in the .rela.plt section:

0020ee30  00010007 R_X86_64_JUMP_SLO  getenv@GL
IBC_2.2.5 + 0

But in the one that works, its like this, .rela.dyn section:

0020ee28  00010006 R_X86_64_GLOB_DAT  getenv@GL
IBC_2.2.5 + 0

I see LDFLAGS being set to '-pie' so I don't really understand why the
first one is a jump slot, vs what I'd expect (GLOB_DAT).

> 2. Somehow the loop in object::relocate_pltgot() missed some of the
> functions - like getenv() 

I think its suspicious that getenv() is the first thing to be fixed up,
so I suspect its more fundamental.

> 3. Something in the new compiler changed the meaning of PT_GNU_RELRO
> or added other flags which confused object::fix_permissions() and
> caused it to make a page read-only when it shouldn't have.

Ok. I think I need to do some more reading on elf...

Rick

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Page fault outside of application

2018-01-29 Thread Nadav Har'El
On Mon, Jan 29, 2018 at 12:16 PM, Rick Payne  wrote:

>
> > Only when "-z now" is used during linking (DT_BIND_NOW object flag)
> > do we do all the function lookups on startup (see
> > object::relocate_pltgot()) and then, it's ok that the .GOT.PLT is
> > also marked RELRO and made read-only.
> >
> > I'm *guessing* (with no evidence) that one of the following happened:
> > 1. Your compiler defaults to "full relro" (-Wl,-z,now -Wl,-z,relro)
> > but for some reason object::relocate_pltgot() doesn't recognize the
> > bind_now.
>
> So there is definitely a difference in the binaries. In the one that
> fails, getenv is defined like this, in the .rela.plt section:
>
> 0020ee30  00010007 R_X86_64_JUMP_SLO  getenv@GL
> IBC_2.2.5 + 0
>
> But in the one that works, its like this, .rela.dyn section:
>
> 0020ee28  00010006 R_X86_64_GLOB_DAT  getenv@GL
> IBC_2.2.5 + 0
>
> I see LDFLAGS being set to '-pie' so I don't really understand why the
> first one is a jump slot, vs what I'd expect (GLOB_DAT).
>

Both versions used "-pie", not "-shared"?

Nadav.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Page fault outside of application

2018-01-29 Thread Rick Payne
On Mon, 2018-01-29 at 12:27 +0200, Nadav Har'El wrote:
> Both versions used "-pie", not "-shared"?

Should be, yes. Its exactly the same build setup and the Makefile shows
'-pie' for LDFLAGS.

I don't think gcc7.2 contains any of the -mindirect-branch changes, so
thats a red-herring. I'll continue poking at this tomorrow (its getting
late here).

Rick

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Page fault outside of application

2018-01-29 Thread Nadav Har'El
On Mon, Jan 29, 2018 at 12:16 PM, Rick Payne  wrote:

>
> Maybe I'm not following. The GNU_RELO sections look the same between
> the 2 versions of erlexec. First one (-ubuntu17.10) fails, second one
> is fine:
>
> rickp@mo:~$ readelf --headers /usr/local/packages/OTP-20.0.5-OSv-
> ubuntu17.10/erts-9.0.5/bin/erlexec | grep -2 RELRO
>   GNU_STACK  0x 0x
> 0x
>  0x 0x  RW 0x10
>   GNU_RELRO  0xebe8 0x0020ebe8
> 0x0020ebe8
>  0x0418 0x0418  R  0x1
>
> rickp@mo:~$ readelf --headers /usr/local/packages/OTP-20.0.5-OSv/erts-
> 9.0.5/bin/erlexec | grep -2 RELRO
>   GNU_STACK  0x 0x
> 0x
>  0x 0x  RW 0x10
>   GNU_RELRO  0xec08 0x0020ec08
> 0x0020ec08
>  0x03f8 0x03f8  R  0x1
>
> > Only when "-z now" is used during linking (DT_BIND_NOW object flag)
> > do we do all the function lookups on startup (see
> > object::relocate_pltgot()) and then, it's ok that the .GOT.PLT is
> > also marked RELRO and made read-only.
> >
> > I'm *guessing* (with no evidence) that one of the following happened:
> > 1. Your compiler defaults to "full relro" (-Wl,-z,now -Wl,-z,relro)
> > but for some reason object::relocate_pltgot() doesn't recognize the
> > bind_now.
>
> So there is definitely a difference in the binaries. In the one that
> fails, getenv is defined like this, in the .rela.plt section:
>
> 0020ee30  00010007 R_X86_64_JUMP_SLO  getenv@GL
> IBC_2.2.5 + 0
>

So this address,  0020ee30, is beyond the end of the GNU_RELRO section,
0x0020ebe8
So it should NOT have been made read-only.

Maybe we're making a mistake with page alignments... When these addresses
are translated to memory addresses,
they need to be on different pages. Maybe we're doing something wrong?

I don't remember now how these offsets are translated to memory addresses,
we should review that code.

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Page fault outside of application

2018-01-29 Thread Rick Payne
On Mon, 2018-01-29 at 11:43 +0200, Nadav Har'El wrote:
> 1. Your compiler defaults to "full relro" (-Wl,-z,now -Wl,-z,relro)
> but for some reason object::relocate_pltgot() doesn't recognize the
> bind_now.

FWIW, on both workign and non-working builds, I see '-pie -z now -z
relro' being passed to the linker stage for erlexec. I see very little
difference between the two :(

Rick

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.