Hi Paul, On Mittwoch, 29. Juli 2020 01:33:32 CEST Paul Boddie wrote: > Hello, > > Returning to L4Re investigations with newer hardware (hosting qemu on the > amd64 architecture), I have been trying to debug some of my hastily written > code that undoubtedly has a race condition lurking in it. Unfortunately, I > cannot remember the way of discovering how to identify the source > instruction responsible (if I ever really knew it to begin with). > > I get an error like this: > > L4Re[rm]: unhandled read page fault at 0x38 pc=0x3a8bd9 > > Previous experiences suggested that I might not get much sense out of the > program counter value, and it was also suggested to me that I might try and > use the kernel debugger to show the thread execution state.
I want to encourage you to take the program counter value serious. The message says that there was an access to the memory at address 0x38 (sounds like an access to offset 38 of an object where the object pointer was not initialized) and the corresponding program counter in userland s 0x3a8bd9. From that value I guess that your host is AMD64. Now the question is of course: Which application triggered this exception? If you know the answer then you should disassemble the corresponding binary with objdump -ldC <filename> | less and search for the program counter. If your binary was compiled with debugging information, you will even see the source code around the faulting instruction. If your binary was not compiled with debugging information: 1. If the application is compiled within the L4Re tree then use the binary from the package build directory because that one is not stripped, for example build-x86-64/pkg/hello/server/src/OBJ-amd64_gen-l4f/hello rather than build-x86-64/bin/amd64_gen/l4f/hello because the latter binary is stripped (i.e. contains no debugging information) if CONFIG_BID_STRIP_PROGS is set to 'y'. 2. If you compiled the binary yourself, make sure to the the '-g' flag to the compiler options. For L4Re applications using the L4Re build infrastructure this is done automatically, see 1. Next question: Is your binary linked statically or does it use dynamic libraries? You can find this out by doing objdump -p <filename> If the output contains at least one line with 'NEEDED' then your binary uses dynamic libraries and looking for the program counter can be more difficult if the fault happens in a dynamic library because the library code is relocated to an unknown address when the library is loaded at program start. Therefore for debugging it's always advisable to use static linked binaries. If your application uses the L4Re build infrastructure, set MODE = static in the Makefile. If you use your own Makefile, make sure to add -static to the linker flags. Exploring your application binary is always the first advisable strategy to such an exception. > I dug up instructions to achieve this... > > 1. Modify pkg/l4re-core/l4re/util/include/region_mapping_svr_2 and > pkg/l4re-core/l4re_kernel/server/src/region.cc to employ enter_kdebug > invocations. > > 2. Add jdb = L4.Env.jdb in the capabilities section of the start or startv > invocation that invokes the task in the appropriate .cfg file. > > 3. Run the system and wait for an exception to trigger the debugger. > > 4. Use the t<enter> command sequence to see the thread state. > > 5. Use <space> to enter the disassembly and see the supposedly problematic > instruction. > > Unfortunately, this seems to add confusion. Firstly, I end up with a > different location: ffffffff10455e98. Even if I assume that the actual > location is 455e98, being somewhat near to the usual payload base address > of 400000, it doesn't help me identify what the code is that resides there. ffffffff10455e98 is a kernel address. NO: You cannot do any deduction from that kernel address to the userland address. The reason is the following: As your userland application generated an exception or a page fault which the L4Re region mapper (L4Re[rm]) of that application cannot serve, the region mapper does not reply anything to the faulting thread. As a result, the resulting thread is waiting in the exception or pagefault IPC and is no longer ready. Same for the region mapper which is also not ready. The 't<enter' command in the kernel debugger shows the TCB of the current running thread. If no other thread in your setup is ready, that's most likely the idle thread (if your setup has multiple CPUs, there are several idle threads). BUT: As you entered the kernel debugging through 'enter_kdebug()' which you placed into the region mapper thread, the current active thread is the region mapper. As you now entered the TCB view, the cursor (the yellow-highlighted word) shows at some program counter inside the kernel. Move the cursor down (use the cursor keys) and go to the word which is related to the userland instruction pointer. Look at the bottom left side when the cursor is placed at the bottom line of the TCB stack: It should display 'tcb: Return frame: IP'. THAT is the userland program counter of the thread you are looking for. Remember: You are inspecting the region mapper thread which is != the thread which triggered the exception! Therefore, if you press <space> at the word marked as 'Return frame: IP', you will see the code for 'enter_kdebug()'. That doesn't help you. Now use the 'lp' view to see the list of present threads in the system. The cursor is placed at the current thread (the region mapper of your application). Look around at threads with the same 'sp' value (sp = space, the address space of the application). See this example: id cpu name pr sp wait to state 20 0 hello 2 1c 1d ready,rcv_wait 1d 0 #hello ff 1c ready d 0 moe ff c - ready,rcv_wait b 0 sigma0 1 a - ready,rcv_wait 9 1 ----- 0 1 ready 8 3 ----- 0 1 ready 7 2 ----- 0 1 ready 6 0 ----- 0 1 ready (this setup emulates 4 CPUs, thus there are 4 idle threads) Thread '1d' is the region mapper thread of the hello application. 'hello' has 2 threads, thread 1d and thread 20. Thread 20 is currently waiting for an IPC from thread 1d. Therefore thread 20 is the one you want to inspect. Go there and press enter. Then move the TCB stack cursor down to 'Return frame: IP' as I told you before, see there: thread : 20 <0xffffffff104ec000> CPU: 0:0 prio: 02 state : 009 ready,rcv_wait wait for: 1d polling: rcv descr: 00000000 timeout : cpu time: 0 timeslice: 10000/18446744073709551615 us pager : [C: 3] D: 1d task : D: 1c exc-hndl: [C: 3] D: 1d UTCB : ffffffff10518400/b3000400 vCPU : --- vCPU : --- RAX=0000000000003003 RSI=---------------- movq (%rdx), %rax RBX=---------------- RDI=---------------- testq %rax, %rax RCX=0000000000002808 RBP=0000000000000004 RDX=0000000000002808 RSP=00007fff00201dc8 SS=0023 R8=---------------- R9=---------------- R10=---------------- R11=---------------- R12=---------------- R13=---------------- R14=---------------- R15=---------------- in page fault, error 00002820 (user level registers) c80 ffffffff10518048 ffffffff104e0000 fffffffff004a3d1 0000000000000046 fffffffff0041360 ffffffff104ec000 cb0 ffffffff10591da0 fffffffffffe0002 ffffffffffffffff ffffffff104ec000 ffffffff104e0000 0000000000000001 ce0 ffffffff104e0200 0000000000000000 0000000000000001 fffffffff004c950 ffffffff105183d8 000000000001d00e d10 000000000001d030 ffffffff10518210 ffffffff105183d8 000000000001d00e 000000000001d030 ffffffff10518210 d40 ffffffff105183d8 0000000000000000 000000000001d030 0000000000000000 ffffffff104ede90 0000000000000001 d70 0000000000000004 0000000000000001 0000000000000000 fffffffff004e6b4 0000000000000000 ffffffff104ede90 da0 0000000000000003 0000000000000000 0000000000000000 0000000000000000 fffffffff0041377 0000000000000000 dd0 0000000040050047 fffffffffffe0002 ffffffff104ec000 ffffffff104e0000 0000000000000001 00000000b3000400 e00 ffffffff104ec000 0000000070005753 0000000000000004 fffffffff0094348 0000000000000000 fffffffff004977e e30 0000000000000000 0000000000000200 0000000000000000 0000000000000000 0000000000002000 0000000000000000 e60 0000000000000000 0000000000000000 0000000000000000 ffffffff104edf88 0000000000000001 00007fff00201db0 e90 0000000000000001 0000000000000000 fffffffff004e6b4 0000000000000000 ffffffff104edf88 0000000000003003 ec0 0000000000000000 fffffffffffe0002 0000000000000000 fffffffff0098320 ffffffff104ec000 0000000000000004 ef0 0000000000002808 0000000070005753 0000000000000000 fffffffff0048757 ffffffff104ec000 000000000001e000 f20 0000000000001000 00007fff00201db0 000000000000000c 000000000001e000 fffffffff0002337 0000000000001120 f50 00007fff00201e98 000000007001e3c0 0000000080000000 0000000000010000 0000000000000000 fffffffff00029b6 f80 0000000000002808 0000000000000246 00000000b3000400 000000000001e000 0000000000000018 0000000000002808 104edfd8 0000000000002820 0000000000002808 0000000000002808 0000000000003003 0000000000000004 [0000000070005753] fe0 000000000000002b 0000000000000287 00007fff00201dc8 0000000000000023 ---------------- ---------------- ... tcb: Return frame: IP <CR>=dump <Space>=Disas I marked the highlighted word as [0000000070005753], in your terminal it would be yellow. NOW press <space> and you will see the disassembled code of the faulting instruction. > Normally, I would attempt to use addr2line, nm or objdump to provide some > kind of map, but I cannot see any correspondence between their output and > these values. I also considered, since my code is dynamically linked, that > I might need to get the linker to tell me a bit more about where it is > positioning library code in memory. This was suggested in another context a > couple of years ago: > > http://os.inf.tu-dresden.de/pipermail/l4-hackers/2018/008274.html > > However, the only vaguely pertinent output I can find is this: > > _dl_protect_relro:124: RELRO protecting rom/libuc_c.so: > start:0x3a6000, end:0x3a7000 > > _dl_protect_relro:124: RELRO protecting rom/libdl.so: > start:0x3bf000, end:0x3c0000 > > If the 3a8bd9 value is meaningful, it evidently refers to something between > these two regions, if these regions are valid. I thought that there was a > more coherent summary of loaded objects, but I cannot find any details of > that now. > > I imagine that there must be a simpler way of getting back to the source > from exception addresses and would greatly appreciate being told (or > reminded) what that is! The above addresses are most likely NOT related to your faulting instruction. The .so libraries are dynamic libraries and that code is loaded to an offset into the userland address space. In any case you should use a static linked application for debugging. Kind regards, Frank -- Dr.-Ing. Frank Mehnert, frank.mehn...@kernkonzept.com, +49-351-41 883 224 Kernkonzept GmbH. Sitz: Dresden. Amtsgericht Dresden, HRB 31129. Geschäftsführer: Dr.-Ing. Michael Hohmuth _______________________________________________ l4-hackers mailing list l4-hackers@os.inf.tu-dresden.de http://os.inf.tu-dresden.de/mailman/listinfo/l4-hackers