Hi Tao, Thanks for digging through the code with me. I'm not very confident or familiar with it, so I don't have 100% confident answers, but I'll do my best to answer:
Tao Liu <l...@redhat.com> writes: > Hi Stephen, > > On Fri, Apr 4, 2025 at 11:43 AM Stephen Brennan > <stephen.s.bren...@oracle.com> wrote: >> >> Stephen Brennan <stephen.s.bren...@oracle.com> writes: >> > Tao Liu <l...@redhat.com> writes: >> >> Hi Stephen, >> >> >> >> Thanks for reporting the issue and patch. Please check if I understood >> >> you correctly. The correct output you are expecting in your case is: >> >> >> >> crash> sym ata_dummy_port_ops >> >> ffffffffc0a71580 (?) ata_dummy_port_ops [libata] >> >> crash> mod -s libata >> >> MODULE NAME TEXT_BASE SIZE >> >> OBJECT FILE >> >> ffffffffc0a7b640 libata ffffffffc0a47000 520192 >> >> >> >> /usr/lib/debug/lib/modules/6.12.0-0.11.8.el9uek.x86_64/kernel/drivers/ata/libata.ko.debug >> >> crash> sym ata_dummy_port_ops >> >> ffffffffc0a71580 (B) ata_dummy_port_ops [libata] >> >> crash> p/x &ata_dummy_port_ops >> >> $1 = 0xffffffffc0a6fe80 <<-------- should be 0xffffffffc0a71580, >> >> same as sym's output, right? >> > >> > Yes, that's correct. >> > >> >> If that is the case, then after applied your patch, the issue still >> >> exists on my machine: >> >> >> >> crash> sym fuse_miscdevice >> >> ffffffffc05f7fe0 (?) fuse_miscdevice [fuse] >> >> crash> mod -s fuse >> >> MODULE NAME TEXT_BASE >> >> SIZE OBJECT FILE >> >> ffffffffc05f8dc0 fuse ffffffffc05da000 >> >> 233472 >> >> /usr/lib/debug/lib/modules/6.11.5-300.fc41.x86_64/kernel/fs/fuse/fuse.ko.debug >> >> crash> sym fuse_miscdevice >> >> ffffffffc05f7fe0 (b) fuse_miscdevice [fuse] >> >> crash> p/x &fuse_miscdevice >> >> $1 = 0xffffffffc05daf20 << ------ unmatch with the previous value. >> > >> > Indeed, I checked with fuse on Fedora (6.11.4-301.fc41.x86_64, so almost >> > the same version). And I see the same behavior with fuse_miscdevice. >> > >> > In your case and mine, the offset is 0x1d0c0, which could be a clue to >> > how this mismatch happens. >> > >> >> The .data section of fuse.ko.debug have non-zero address: >> >> >> >> $ readelf -S >> >> /usr/lib/debug/lib/modules/6.11.5-300.fc41.x86_64/kernel/fs/fuse/fuse.ko.debug >> >> Section Headers: >> >> [Nr] Name Type Address Offset >> >> Size EntSize Flags Link Info Align >> >> ... >> >> [50] .data NOBITS 0000000000000e20 00000100 >> >> 000000000000080f 0000000000000000 WA 0 0 32 >> >> >> >> Could you please re-check your patch for this, or is there something >> >> I'm missing? >> > >> > From what I can tell, the important difference here may be that the >> > symbol you selected is a static variable (which becomes a local ELF >> > symbol). Apparently, my patch only resolves the issue for *global >> > variables*, but not for local variables. >> > >> > $ eu-readelf -s /usr/lib/debug/lib/modules/$(uname >> > -r)/kernel/fs/fuse/fuse.ko.debug | grep -P "fuse_(miscdevice|mutex)" >> > 85: 0000000000000100 80 OBJECT LOCAL DEFAULT 50 >> > fuse_miscdevice >> > 1041: 00000000000005c0 32 OBJECT GLOBAL DEFAULT 50 fuse_mutex >> > >> > Here is the behavior of crash's master branch on these two symbols: >> > >> > crash> sym fuse_miscdevice >> > ffffffffc02fafe0 (?) fuse_miscdevice [fuse] >> > crash> sym fuse_mutex >> > ffffffffc02fb4a0 (?) fuse_mutex [fuse] >> > crash> mod -s fuse >> > MODULE NAME TEXT_BASE >> > SIZE OBJECT FILE >> > ffffffffc02fbdc0 fuse ffffffffc02dd000 >> > 233472 >> > /usr/lib/debug/lib/modules/6.11.4-301.fc41.x86_64/kernel/fs/fuse/fuse.ko.debug >> > crash> sym fuse_miscdevice >> > ffffffffc02fafe0 (b) fuse_miscdevice [fuse] >> > crash> sym fuse_mutex >> > ffffffffc02fb4a0 (B) fuse_mutex [fuse] >> > crash> p/x &fuse_miscdevice >> > $1 = 0xffffffffc02ddf20 >> > crash> p/x &fuse_mutex >> > $2 = 0xffffffffc02fa680 >> > >> > Both fuse_miscdevice, and fuse_mutex are mismatched (with different >> > offsets). And now, here's the behavior of crash with my patch: >> > >> > crash> sym fuse_miscdevice >> > ffffffffc02fafe0 (?) fuse_miscdevice [fuse] >> > crash> sym fuse_mutex >> > ffffffffc02fb4a0 (?) fuse_mutex [fuse] >> > crash> mod -s fuse >> > MODULE NAME TEXT_BASE >> > SIZE OBJECT FILE >> > ffffffffc02fbdc0 fuse ffffffffc02dd000 >> > 233472 >> > /usr/lib/debug/lib/modules/6.11.4-301.fc41.x86_64/kernel/fs/fuse/fuse.ko.debug >> > crash> sym fuse_miscdevice >> > ffffffffc02fafe0 (b) fuse_miscdevice [fuse] >> > crash> sym fuse_mutex >> > ffffffffc02fb4a0 (B) fuse_mutex [fuse] >> > crash> p/x &fuse_miscdevice >> > $1 = 0xffffffffc02ddf20 >> > crash> p/x &fuse_mutex >> > $2 = 0xffffffffc02fb4a0 >> > >> > Here the global fuse_mutex is correct, but the static/local >> > fuse_miscdevice is still incorrect. This indicates that there's more yet >> > to fix. (But at least my patch does part of the work!) >> > >> > I'm wondering whether this has anything to do with ELF relocations. >> >> Hello Tao, > > Thanks for your detailed debug info, which is quite helpful! For your > No.1 patch which fixes the global variable relocation, it looks good. > However, for your No.2 patch in the attached file, I have some > questions. >> >> I've continued debugging and verified that the root cause for this was >> GDB performing relocations to the DWARF info incorrectly. At the bottom > > 1) Is this a gdb bug? If it does, then the proper way is to post No.2 > patch to gdb mailing list and get it fixed there, then we can do gdb > patch backporting to crash. Frankly I haven't figured out whether it > is gdb or crash bug so far by myself... I'm also not sure whether it's a GDB or Crash bug. Honestly, I think the "bug" is in Linux for producing non-zero section addresses in a relocatable file. It seems to be universal that section addresses are zero for relocatble ELF files. > 2) I'm quoting part of the commit log in your No.2 patch has follows > (Please don't send patch as attachment next time, because I cannot > comment directly on it) Understood, sorry. > ... > 1. In default_symfile_offsets(), GDB detects the nonzero sh_addr field > and fails to apply the user-supplied section offsets. The result is > that later, in symfile_relocate_debug_section(), relocations are > applied to the DWARF info using the wrong section addresses, which > > Didn't we have corrected the .data section address in No.1 patch of > the .ko? Then why the "wrong section addresses"? The flow is like this: 1. crash executes "add-symbol-file [...] -s .data ADDRESS" 2. add_symbol_file_command() parses these and stores them in a list 3. We eventually make our way down to syms_from_objfile_1(), which seems to be the main logic. a. addr_info_make_relative() takes this list of section offsets, along with the BFD file, and matches each one against the BFD section. The function description indicates that the purpose is to make the addresses provided be relative to the addresses in the BFD section, which I suppose is why this number is subtracted. b. Then, default_symfile_offsets() is called from syms_from_objfile_1(). This function only cares about ET_REL relocatable files. It verifies that all sections have a zero sh_addr, and assuming that is so, it updates the load address and sets various fields on the BFD file to allow relocations to work correctly. 4. Relocations get applied using the values set in 3.b. Since the values in 3.b were tampered with in 3.a, relocations that are based on the address of the affected address are done incorrectly. I'm still not certain who is at fault, though! > results in invalid addresses for variables. Clearly, this has > happened before, because crash has special-cased the "__ksymtab*" > > Which line of code do you mean the special-cased __ksymtab? I didn't > find one in gdb/symfile.c. It is this part of the diff. I've removed it in my commit and also removed the check for nen-zero section addresses. --- gdb-16.2/gdb/symfile.c.orig +++ gdb-16.2/gdb/symfile.c @@ -633,7 +633,26 @@ default_symfile_offsets (struct objfile *objfile, for (cur_sec = abfd->sections; cur_sec != NULL; cur_sec = cur_sec->next) /* We do not expect this to happen; just skip this step if the relocatable file has a section with an assigned VMA. */ - if (bfd_section_vma (cur_sec) != 0) + if (bfd_section_vma (cur_sec) != 0 + /* + * Kernel modules may have some non-zero VMAs, i.e., like the + * __ksymtab and __ksymtab_gpl sections in this example: + * + * Section Headers: + * [Nr] Name Type Address Offset + * Size EntSize Flags Link Info Align + * ... + * [ 8] __ksymtab PROGBITS 0000000000000060 0000ad90 + * 0000000000000010 0000000000000000 A 0 0 16 + * [ 9] .rela__ksymtab RELA 0000000000000000 0000ada0 + * 0000000000000030 0000000000000018 43 8 8 + * [10] __ksymtab_gpl PROGBITS 0000000000000070 0000add0 + * 00000000000001a0 0000000000000000 A 0 0 16 + * ... + * + * but they should be treated as if they are NULL. + */ + && strncmp (bfd_section_name (cur_sec), "__k", 3) != 0) break; if (cur_sec == NULL) > section names to avoid this condition. To resolve this, we simply > drop the check for nonzero sh_addr altogether: the only ET_REL files > we encounter should be kernel modules, so there's no real reason to > be picky. > > 2. Even with that fixed, the user-supplied section addresses are > clobbered by the addr_info_make_relative(), which subtracts out > section offsets during its operation. To resolve this, undo the > operation for ET_REL files where a section address was provided by > the user (i.e. crash). > ... > > According to your description, it seems to me the gdb cannot handle > the nonzero sh_addr in the .ko case, but .ko isn't the only one whose > .data VMA is non-zero, see this case: > > $ readelf -S /lib64/libc.so.6 > ... > [30] .data PROGBITS 00000000001e8000 001e8000 > 00000000000016c8 0000000000000000 WA 0 0 32 > ... > > I don't know if gdb also fails in the userspace .so case? I mean, if > gdb can handle the user case with no problem, then probably we don't > need to hack gdb like this. One important difference is that libc.so.6 is a ET_DYN dynamic object file. A key difference between dynamic executables and relocatable files is the symbol table: symbols in dynamic executables have an absolute virtual address, and they only change based on the load address. Symbols in relocatable files have an address which is an offset from the beginning of their section. So if a relocation is done based on the section address, and the section address is incorrect, then the relocation will be wrong. Stephen -- Crash-utility mailing list -- devel@lists.crash-utility.osci.io To unsubscribe send an email to devel-le...@lists.crash-utility.osci.io https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/ Contribution Guidelines: https://github.com/crash-utility/crash/wiki