Re: L4Re: Identifying the source location of a program exception

Frank Mehnert Wed, 29 Jul 2020 00:50:23 -0700

Hi Paul,

On Mittwoch, 29. Juli 2020 01:33:32 CEST Paul Boddie wrote:
> Hello,
> 
> Returning to L4Re investigations with newer hardware (hosting qemu on the
> amd64 architecture), I have been trying to debug some of my hastily written
> code that undoubtedly has a race condition lurking in it. Unfortunately, I
> cannot remember the way of discovering how to identify the source
> instruction responsible (if I ever really knew it to begin with).
> 
> I get an error like this:
> 
> L4Re[rm]: unhandled read page fault at 0x38 pc=0x3a8bd9
> 
> Previous experiences suggested that I might not get much sense out of the
> program counter value, and it was also suggested to me that I might try and
> use the kernel debugger to show the thread execution state.


I want to encourage you to take the program counter value serious. The
message says that there was an access to the memory at address 0x38
(sounds like an access to offset 38 of an object where the object pointer
was not initialized) and the corresponding program counter in userland
s 0x3a8bd9. From that value I guess that your host is AMD64.

Now the question is of course: Which application triggered this exception?
If you know the answer then you should disassemble the corresponding binary
with

  objdump -ldC <filename> | less

and search for the program counter. If your binary was compiled with
debugging information, you will even see the source code around the
faulting instruction.

If your binary was not compiled with debugging information:

 1. If the application is compiled within the L4Re tree then use the
    binary from the package build directory because that one is not
    stripped, for example

      build-x86-64/pkg/hello/server/src/OBJ-amd64_gen-l4f/hello

    rather than

      build-x86-64/bin/amd64_gen/l4f/hello

    because the latter binary is stripped (i.e. contains no debugging
    information) if CONFIG_BID_STRIP_PROGS is set to 'y'.

 2. If you compiled the binary yourself, make sure to the the '-g' flag
    to the compiler options. For L4Re applications using the L4Re build
    infrastructure this is done automatically, see 1.

Next question: Is your binary linked statically or does it use dynamic
libraries? You can find this out by doing

  objdump -p <filename>

If the output contains at least one line with 'NEEDED' then your binary
uses dynamic libraries and looking for the program counter can be more
difficult if the fault happens in a dynamic library because the library
code is relocated to an unknown address when the library is loaded at
program start.

Therefore for debugging it's always advisable to use static linked
binaries. If your application uses the L4Re build infrastructure, set

  MODE = static

in the Makefile. If you use your own Makefile, make sure to add

  -static

to the linker flags.

Exploring your application binary is always the first advisable strategy
to such an exception.

> I dug up instructions to achieve this...
> 
> 1. Modify pkg/l4re-core/l4re/util/include/region_mapping_svr_2 and
>    pkg/l4re-core/l4re_kernel/server/src/region.cc to employ enter_kdebug
>    invocations.
> 
> 2. Add jdb = L4.Env.jdb in the capabilities section of the start or startv
>    invocation that invokes the task in the appropriate .cfg file.
> 
> 3. Run the system and wait for an exception to trigger the debugger.
> 
> 4. Use the t<enter> command sequence to see the thread state.
> 
> 5. Use <space> to enter the disassembly and see the supposedly problematic
>    instruction.
> 
> Unfortunately, this seems to add confusion. Firstly, I end up with a
> different location: ffffffff10455e98. Even if I assume that the actual
> location is 455e98, being somewhat near to the usual payload base address
> of 400000, it doesn't help me identify what the code is that resides there.

ffffffff10455e98 is a kernel address. NO: You cannot do any deduction from
that kernel address to the userland address.

The reason is the following: As your userland application generated an
exception or a page fault which the L4Re region mapper (L4Re[rm]) of that
application cannot serve, the region mapper does not reply anything to the
faulting thread. As a result, the resulting thread is waiting in the
exception or pagefault IPC and is no longer ready. Same for the region mapper
which is also not ready.

The 't<enter' command in the kernel debugger shows the TCB of the current
running thread. If no other thread in your setup is ready, that's most likely
the idle thread (if your setup has multiple CPUs, there are several idle
threads).

BUT: As you entered the kernel debugging through 'enter_kdebug()' which
you placed into the region mapper thread, the current active thread is
the region mapper.

As you now entered the TCB view, the cursor (the yellow-highlighted word)
shows at some program counter inside the kernel. Move the cursor down (use
the cursor keys) and go to the word which is related to the userland
instruction pointer. Look at the bottom left side when the cursor is placed
at the bottom line of the TCB stack: It should display 'tcb: Return frame:
IP'. THAT is the userland program counter of the thread you are looking for.

Remember: You are inspecting the region mapper thread which is != the
thread which triggered the exception! Therefore, if you press <space>
at the word marked as 'Return frame: IP', you will see the code for
'enter_kdebug()'. That doesn't help you.

Now use the 'lp' view to see the list of present threads in the system. The
cursor is placed at the current thread (the region mapper of your
application). Look around at threads with the same 'sp' value (sp = space,
the address space of the application). See this example:

   id  cpu    name             pr     sp  wait    to state
   20   0     hello             2     1c    1d       ready,rcv_wait
   1d   0     #hello           ff     1c             ready
    d   0     moe              ff      c     -       ready,rcv_wait
    b   0     sigma0            1      a     -       ready,rcv_wait
    9   1     -----             0      1             ready
    8   3     -----             0      1             ready
    7   2     -----             0      1             ready
    6   0     -----             0      1             ready

(this setup emulates 4 CPUs, thus there are 4 idle threads)

Thread '1d' is the region mapper thread of the hello application. 'hello'
has 2 threads, thread 1d and thread 20. Thread 20 is currently waiting
for an IPC from thread 1d. Therefore thread 20 is the one you want to
inspect. Go there and press enter. Then move the TCB stack cursor down
to 'Return frame: IP' as I told you before, see there:

thread  :  20 <0xffffffff104ec000>      CPU: 0:0        prio: 02
state   : 009 ready,rcv_wait
wait for:   1d  polling:        rcv descr: 00000000                     timeout 
 :
cpu time:   0                   timeslice: 10000/18446744073709551615 us
pager   : [C:   3] D:  1d       task     : D:  1c
exc-hndl: [C:   3] D:  1d       UTCB     : ffffffff10518400/b3000400
vCPU    : ---
vCPU    : ---


RAX=0000000000003003  RSI=----------------  movq    (%rdx), %rax
RBX=----------------  RDI=----------------  testq   %rax, %rax
RCX=0000000000002808  RBP=0000000000000004
RDX=0000000000002808  RSP=00007fff00201dc8  SS=0023
 R8=----------------   R9=----------------
R10=----------------  R11=----------------
R12=----------------  R13=----------------
R14=----------------  R15=----------------
in page fault, error 00002820 (user level registers)

    c80  ffffffff10518048 ffffffff104e0000 fffffffff004a3d1 0000000000000046 
fffffffff0041360 ffffffff104ec000
    cb0  ffffffff10591da0 fffffffffffe0002 ffffffffffffffff ffffffff104ec000 
ffffffff104e0000 0000000000000001
    ce0  ffffffff104e0200 0000000000000000 0000000000000001 fffffffff004c950 
ffffffff105183d8 000000000001d00e
    d10  000000000001d030 ffffffff10518210 ffffffff105183d8 000000000001d00e 
000000000001d030 ffffffff10518210
    d40  ffffffff105183d8 0000000000000000 000000000001d030 0000000000000000 
ffffffff104ede90 0000000000000001
    d70  0000000000000004 0000000000000001 0000000000000000 fffffffff004e6b4 
0000000000000000 ffffffff104ede90
    da0  0000000000000003 0000000000000000 0000000000000000 0000000000000000 
fffffffff0041377 0000000000000000
    dd0  0000000040050047 fffffffffffe0002 ffffffff104ec000 ffffffff104e0000 
0000000000000001 00000000b3000400
    e00  ffffffff104ec000 0000000070005753 0000000000000004 fffffffff0094348 
0000000000000000 fffffffff004977e
    e30  0000000000000000 0000000000000200 0000000000000000 0000000000000000 
0000000000002000 0000000000000000
    e60  0000000000000000 0000000000000000 0000000000000000 ffffffff104edf88 
0000000000000001 00007fff00201db0
    e90  0000000000000001 0000000000000000 fffffffff004e6b4 0000000000000000 
ffffffff104edf88 0000000000003003
    ec0  0000000000000000 fffffffffffe0002 0000000000000000 fffffffff0098320 
ffffffff104ec000 0000000000000004
    ef0  0000000000002808 0000000070005753 0000000000000000 fffffffff0048757 
ffffffff104ec000 000000000001e000
    f20  0000000000001000 00007fff00201db0 000000000000000c 000000000001e000 
fffffffff0002337 0000000000001120
    f50  00007fff00201e98 000000007001e3c0 0000000080000000 0000000000010000 
0000000000000000 fffffffff00029b6
    f80  0000000000002808 0000000000000246 00000000b3000400 000000000001e000 
0000000000000018 0000000000002808
104edfd8 0000000000002820 0000000000002808 0000000000002808 0000000000003003 
0000000000000004 [0000000070005753]
    fe0  000000000000002b 0000000000000287 00007fff00201dc8 0000000000000023 
---------------- ----------------

 ...

tcb: Return frame: IP                                                           
                 <CR>=dump <Space>=Disas

I marked the highlighted word as [0000000070005753], in your terminal it
would be yellow. 

NOW press <space> and you will see the disassembled code of the faulting
instruction.

> Normally, I would attempt to use addr2line, nm or objdump to provide some
> kind of map, but I cannot see any correspondence between their output and
> these values. I also considered, since my code is dynamically linked, that
> I might need to get the linker to tell me a bit more about where it is
> positioning library code in memory. This was suggested in another context a
> couple of years ago:
> 
> http://os.inf.tu-dresden.de/pipermail/l4-hackers/2018/008274.html
> 
> However, the only vaguely pertinent output I can find is this:
> 
> _dl_protect_relro:124: RELRO protecting rom/libuc_c.so:
>   start:0x3a6000, end:0x3a7000
> 
> _dl_protect_relro:124: RELRO protecting rom/libdl.so:
>   start:0x3bf000, end:0x3c0000
> 
> If the 3a8bd9 value is meaningful, it evidently refers to something between
> these two regions, if these regions are valid. I thought that there was a
> more coherent summary of loaded objects, but I cannot find any details of
> that now.
> 
> I imagine that there must be a simpler way of getting back to the source
> from exception addresses and would greatly appreciate being told (or
> reminded) what that is!

The above addresses are most likely NOT related to your faulting instruction.
The .so libraries are dynamic libraries and that code is loaded to an offset
into the userland address space.

In any case you should use a static linked application for debugging.

Kind regards,

Frank
-- 
Dr.-Ing. Frank Mehnert, frank.mehn...@kernkonzept.com, +49-351-41 883 224

Kernkonzept GmbH.  Sitz: Dresden.  Amtsgericht Dresden, HRB 31129.
Geschäftsführer: Dr.-Ing. Michael Hohmuth



_______________________________________________
l4-hackers mailing list
l4-hackers@os.inf.tu-dresden.de
http://os.inf.tu-dresden.de/mailman/listinfo/l4-hackers

Re: L4Re: Identifying the source location of a program exception

Reply via email to