http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47960

           Summary: dlopen call during DSO initialization breaks C++ RTTI
           Product: gcc
           Version: 4.3.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: a_salni...@yahoo.com


Hi, 

I am debugging a complex problem with our Linux-based applications sometimes
crashing in mysterious ways. This is kind of usual exception RTTI problem when
the exceptions thrown in one DSO is not correctly recognized in another DSO. We
know so well that DSO and C++ RTTI do not always mix, but we follow all
standard advices about how to build the apps to make RTTI work correctly and
still it breaks.

Our apps are a mixture of the Python interpreter and many C++ shared libraries
loaded from Python (using dlopen). Some C++ libs in turn use dlopen to load
other shared libraries. Everything is linked with the correct flags (no symbol
hiding) and all dlopen calls use RTLD_GLOBAL flags, so we do expect things to
work correctly. Things do work correctly but only when we link the DSOs
together with the C++ main(), thus eliminating top-level dlopen call (other
dlopen calls still remain there). With LD_DEBUG I was able to confirm that in
that case all typeinfo instances are resolved correctly and bound to one
instance in the library linked to main app. In case of Python calling dlopen on
the same library LD_DEBUG shows that typeinfo resolution fails and there are
two instances of the typeinfo object for the Exception type in question.

I tried to reproduce the problem with simple example involving just a couple of
DSOs and after some hair pulling I managed to do it. The peculiarity of the
case (which I did not recognized initially) is that one of the dlopen() calls
happens from the constructor of the global object (that is during the
initialization of the corresponding DSO). If all dlopen calls happen in a
regular way (after main() starts) then there is no problem at all. But if
dlopen() happens during DSO init call then that DSO somehow is not used in the
lookup for the dlopen'ed library symbols even tho DSO has RTLD_GLOBAL set.

The example code that I attach here demonstrates exactly this. To build the
example app just do (should work on Linux without patching):

% tar zxf example.tgz
% make

This will build main app called 'main' and two DSOs: liba.so and libb.so. Main
app calls ldopen for liba.so and calls a run() function from it. liba.so calls
dlopen on libb.so either from run() function or from DSO init code depending on
the particular envvar and then calls run() function from libb. libb's run()
throws an exception that liba's run() tries to catch and analyze. 

To show default correct behavior with dlopen called only from inside main():

% ./main
As expected:
&typeid(ex):            0x2b594ce6e600
&typeid(Exception):     0x2b594ce6e600
typeid(ex).name:        9Exception
typeid(Exception).name: 9Exception
typeid(Exception)==typeid(ex): true

To see what happens when dlopen is called from liba init code:

% TEST_GLOBAL_INIT=1 ./main
*** Not expected:
&typeid(ex):            0x2b4532ad2050
&typeid(Exception):     0x2b45328d0600
typeid(ex).name:        9Exception
typeid(Exception).name: 9Exception
typeid(Exception)==typeid(ex): false

In this case the exception cannot be caught with its real type (it is caught as
std::exception) so RTTI is totally broken. Then name in the exception typeinfo
is still correct, but the addresses of the typeinfo in liba and libb are
different.

>From what I gather the C++ code in the example should be legal, global object
initialization should not have restrictions on what functions it can call. But
it seems like the implementation of the RTTI in gcc relies on the features that
do not always work. 

Is there any way to fix the situation or at least to produce some kind of
diagnostics when this situation happens?

Regards,
Andy

Reply via email to