On Tue, Nov 12, 2019 at 7:11 AM Waldek Kozaczuk <jwkozac...@gmail.com>
wrote:

>
> The second issue was related to some old version symbols from the standard
> C++ library (yes like Java, .NET Core is implemented in C++). More
> specifically OSv would crash due to failing to find a symbol:
>
> /libhostfxr.so: failed looking up symbol _ZSt15system_categoryv
> (std::system_category())
>
> [backtrace]
> 0x00000000403562b5 <elf::object::symbol(unsigned int, bool)+1013>
> 0x000000004035637f <elf::object::resolve_pltgot(unsigned int)+127>
> 0x0000000040356559 <elf_resolve_pltgot+57>
> 0x000000004039ce2f <???+1077530159>
> 0x0000000000000001 <???+1>
>
> The symbol actually exists in OSv but apparently there are many versions
> of this symbol in the shared library version of *libstdc++.so.6* which
> apparently are not present in the statically linked version of it in OSv
> kernel.
>

On my host I see:

$ nm -CD /usr/lib64/libstdc++.so.6.0.27
00000000000d7660 T std::_V2::system_category()
00000000000a8820 T std::system_category()

$ nm -C /usr/lib/gcc/x86_64-redhat-linux/9/libstdc++.a
0000000000000000 T std::_V2::system_category()

This "_V2" thing is not a symbol version, it's a namespace in the C++ code.
It seems to me like a bug in the static library which misses the one in the
std namespace. Maybe it should be reported to gcc or Fedora, or you want to
investigate it yourself, but I don't think this is an OSv bug. Apparently
other people have noticed this too:
https://lists.llvm.org/pipermail/llvm-dev/2018-May/123745.html

My guess is that what is happening is that new code compiles to use this
_V2 ABI, so it doesn't have problems. But old code which uses the older ABI
(without _V2) doesn't work with the static library.


>
> Is it because during static linking linker only uses the latest version of
> the symbol. In any case, the solution was to hide the libstdc++.so.6 from
> OSv dynamic linker and ibstdc++.so.6 from the host to the image. But I
> wonder if that is NOT as simple as that because I wonder if I am missing
> something and that something leads to my biggest problem which I am
> describing in the end.
>

> @@ -1193,7 +1217,7 @@ program::program(void* addr)
>            "libpthread.so.0",
>            "libdl.so.2",
>            "librt.so.1",
> -          "libstdc++.so.6",
> +          //"libstdc++.so.6",
>            "libaio.so.1",
>            "libxenstore.so.3.0",
>            "libcrypt.so.1",
>
> So after I fixed that I came across another weird problem most likely
> caused by a linker which somehow Linux deals with but OSv does not. In
> essence one of symbols from .NET Core library libcoreclr.so -
> *gCurrentThreadInfo* - is not found by OSv even though it is there. But
> readelf, for example, complains about it:
>
> readelf -s libcoreclr.so | grep gCurrentThreadInfo
> readelf: Warning: local symbol 31 found at index >= .dynsym's sh_info
> value of 1
>     31: 0000000000000000    24 TLS     LOCAL  HIDDEN    19
> gCurrentThreadInfo
>   9799: 0000000000000000    24 TLS     LOCAL  HIDDEN    19
> gCurrentThreadInfo
>
> So the first occurrence is in the '.dynsym' table which is weird, the
> second one is in the .symtab. Somehow I am able to run the same app just
> fine on Linux.
>

Maybe there's a problem that this is a STB_LOCAL and not STB_GLOBAL?

STB_LOCAL symbols should be visible inside the same object, but not
outside, maybe we didn't implement this correctly. I guess you'll need to
add printouts and see what goes on when the code tries to look for this
symbol.


>
> Here is what found about it in this example -
> https://github.com/dynup/kpatch/issues/854#issuecomment-390330525:
> "the local symbols after the globals in this section"
>
> versus ELF spec:
>
> "The global symbols immediately follow the local symbols in the symbol
> table. The first global symbol is identified by the symbol table sh_info
> value. Local and global symbols are always kept separate in this manner,
> and cannot be mixed together."
>

I wonder why we even need to care about this, though... Since each symbol
is marked global or local, who cares about their order?

>
> I have also found this issue in coreclr (one of the .NET Core components)
> - https://github.com/dotnet/coreclr/issues/23621 - where they report and
> fix almost identical 'symbol not found' for ARM musl by tweaking their
> build chain (switch from golden linker).
>

I don't understand the details there, but we already avoid gold linker by
using in Makefile "LD=ld.bfd" (
https://github.com/cloudius-systems/osv/commit/d21e39fee2fa0b9a90873340509b8f4031e44bf4
)


>
> In that example, they deal with it by sorting the table. Not sure I really
> understand this problem. In either case, I came up with a terrible hack to
> deal with it myself that maybe also leads to my next and final issue which
> is the real blocker.
>
> @@ -688,6 +691,10 @@ void object::relocate_rela()
>          void *addr = _base + p->r_offset;
>          auto addend = p->r_addend;
>
> +        if (sym == 31) {
> +            continue;
> +        }
> +
>
>
But won't this cause this symbol not to work correctly? I assume the code
needs it to work correctly? :-)


> So with all that the app boots but crashes like so:
>
> trying to execute null pointer
> [backtrace]
> 0x000000004039e2de <page_fault+302>
> 0x000000004039d0a6 <???+1077530790>
> 0x0000100000dcf492 <???+14480530>
> 0x0000100000dcf67a <???+14481018>
> 0x0000100000dcf024 <???+14479396>
> 0x0000100000dcee8d <???+14478989>
> 0x0000100000d1a991 <???+13740433>
> 0x0000100000cf375c <???+13580124>
> 0x0000100000a0ad8e <???+10530190>
> 0xffffa000009035df <???+9450975>
> 0xffff006f732e7468 <???+1932424296>
>
> When I connect with dbg I get this stack trace (I believe for thread 36)
> which seems to indicate the stack is corrupt:
> 36 (0xffff8000015e1040) /HelloApp       cpu0 status::running
> sched::thread::switch_to() at arch/x64/arch-switch.hh:108 vruntime
>  1.4495e-20
> 37 (0xffff800001c93040) >/HelloApp      cpu0 status::waiting
> do_poll(std::vector<poll_file, std::allocator<poll_file> >&,
> boost::optional<std::chrono::time_point<osv::clock::uptime,
> std::chrono::duration<long, std::ratio<1l, 1000000000l> > > >) at
> core/poll.cc:274 vruntime  1.68081e-21
>
> (gdb) bt
> #0  0x00000000403a4522 in processor::cli_hlt () at
> arch/x64/processor.hh:247
> #1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
> #2  osv::halt () at arch/x64/power.cc:26
> #3  0x00000000402381a4 in abort (fmt=fmt@entry=0x4061c1a8 "trying to
> execute null pointer") at runtime.cc:132
> #4  0x000000004039e2df in page_fault (ef=0xffff8000015e6068) at
> arch/x64/mmu.cc:30
> #5  <signal handler called>
> #6  0x0000000000000000 in ?? ()
> #7  0x0000100000cf49da in ?? ()
> #8  0x0000200000200780 in ?? ()
> #9  0x0000000000000000 in ?? ()
>
> When I switch to another and only child thread the stack looks much better:
> (gdb) osv thread 37
> (gdb) bt
> #0  sched::thread::switch_to (this=0x230, this@entry=0xffff80000005b040)
> at arch/x64/arch-switch.hh:108
> #1  0x00000000403f7184 in sched::cpu::reschedule_from_interrupt
> (this=0xffff80000001e040, called_from_yield=called_from_yield@entry
> =false,
>     preempt_after=..., preempt_after@entry=...) at core/sched.cc:339
> #2  0x00000000403f767c in sched::cpu::schedule () at
> include/osv/sched.hh:1310
> #3  0x00000000403f7d62 in sched::thread::wait 
> (this=this@entry=0xffff800001c93040)
> at core/sched.cc:1214
> #4  0x0000000040415a08 in
> sched::thread::do_wait_until<sched::noninterruptible,
> sched::thread::dummy_lock, do_poll(std::vector<poll_file>&,
> file::timeout_t)::<lambda()> > (mtx=<synthetic pointer>..., pred=...) at
> /usr/include/c++/8/bits/atomic_base.h:390
> #5  sched::thread::wait_until<do_poll(std::vector<poll_file>&,
> file::timeout_t)::<lambda()> > (pred=...) at include/osv/sched.hh:1077
> #6  do_poll (pfd=std::vector of length 0, capacity 0, _timeout=...) at
> core/poll.cc:274
> #7  0x0000000040415da2 in file::poll_many (_pfd=0x200000300e68, _nfds=1,
> timeout=...) at /usr/include/c++/8/new:169
> #8  0x0000000040416041 in file::poll_sync (timeout=..., pfd=...,
> this=<optimized out>) at /usr/include/c++/8/new:169
> #9  poll_one (timeout=..., pfd=...) at core/poll.cc:334
> #10 poll (_pfd=0x200000300e68, _nfds=<optimized out>, _timeout=<optimized
> out>) at core/poll.cc:351
> #11 0x00001000010c970e in StgIO::ReadFromDisk(void*, unsigned int,
> unsigned int*) ()
> #12 0x00001000010c92e8 in StgIO::GetPtrForMem(unsigned int, unsigned int,
> void*&) ()
> #13 0x00001000010c8f64 in StgIO::FreePageMap() ()
> #14 0x00001000010d186d in MDInternalRW::FindTypeDef(char const*, char
> const*, unsigned int, unsigned int*) ()
> #15 0x000000004045b7e6 in pthread_private::pthread::<lambda()>::operator()
> (__closure=0xffff800000021798) at libc/pthread.cc:114
> #16 std::_Function_handler<void(), pthread_private::pthread::pthread(void*
> (*)(void*), void*, sigset_t, const
> pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const
> std::_Any_data &) (__functor=...) at
> /usr/include/c++/8/bits/std_function.h:297
> #17 0x00000000403f8b07 in sched::thread_main_c (t=0xffff800001c93040) at
> arch/x64/arch-switch.hh:321
> #18 0x000000004039e023 in thread_main () at arch/x64/entry.S:113
>
> I have a feeling this somehow has to do with TLS. I think .dotnet uses
> dynamic TLS which OSv supports well minus any bugs we are not aware of.
>
> Any suggestions on how to debug/fix it?
>
> Thanks in advance,
> Waldek
>
> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to osv-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/osv-dev/e80111ab-b213-4571-87c0-898ab90636ef%40googlegroups.com
> <https://groups.google.com/d/msgid/osv-dev/e80111ab-b213-4571-87c0-898ab90636ef%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/CANEVyjsZx%3D7dj5NTmdLv%3DU8ExM89LsVAFsUryM%2Bvaqq2g1CYtg%40mail.gmail.com.

Reply via email to