Number Cruncher wrote:
Many thanks for the informative explanation, Jeff. I appreciate this issue has been the cause of some deliberation!


This was the changeset where we did the ABI fixes -- ensuring that if you compile/link against Open MPI vA.B.C, you will be able to just change your LD_LIBRARY_PATH or replace libmpi.so when upgrading to Open MPI vA.B.x.
But -- to make a long
story short -- the linker will embed the length of the ompi_communicator_t in the hello executable. Among other reasons, it's so that pointer math can be computed properly.

The goal is admirable and a stalwart of good open source practice which avoids "DLL-Hell". However, I simply don't understand how my compiler can *ever* know the size of your opaque ompi_communicator_t?

And in this case the client code doesn't need to know the size of the struct, we are only concerned with the address of the opaque handle so it can be passed on.
mpi.h declares MPI_Comm as a pointer to as-yet-undefined "struct ompi_communicator_t", ompi_mpi_comm_world is an external global of as-yet-undefined "struct ompi_predefined_communicator_t". Then the compiler must try and work out whether the "comm_predefined_t *" aliases the "comm_t *". This isn't possible without full type information. In general, only (void *) and (char *) can alias arbitrary types.

See http://www.cellperformance.com/mike_acton/2006/06/understanding_strict_aliasing.html for an excellent discussion on aliasing.

However, let's assume the user upgrades to OMPI vA.B.(C+1) in place, meaning that libmpi.so is replaced with a new version. If we've changed the back-end ompi_communicator_t struct (e.g., added a member), then the size that is included in the hello executable no longer matches the size that is in libmpi.so -- and Bad Things can (and do) happen.

I don't get how the linker can affect the actual object code generated for my hello.o, except via relocation of symbols. If none of my code makes use of, or even *knows* about your internal struct definitions, and only holds pointers to them, then I can't do any pointer arithmetic or anything that requires internal information; I get: "invalid use of undefined type ‘struct XXXX'" or similar.

I can hold a simple pointer to your private object, but this should maintain ABI compatibility; isn't that what the PIMPL C++ idiom is all about and widely used (e.g. Qt Toolkit). The pointer-to-global gets relocated at shared library load time.

In C++ you can do the above however that is not valid C code when trying to initialize a global pointer. The initializer must be a compile-time constant. It was this artifact that made us go down the path of padded structures.


Hence, the solution of r20627: make a "dummy" type that is guaranteed to be larger than ompi_communicator_t -- ompi_predefined_communicator_t. It's actually a struct that *contains* ompi_communicator_t and then a fixed amount of padding at the end. Since MPI_COMM_WORLD will always be this ompi_predefined_communicator_t, we can ensure that its size stays constant, even if the size of ompi_communicator_t changes. Hence, the size in the hello executable and libmpi.so will be the same. Happiness.

With respect, this feels a little bit like a hack. Who's to say your future communicator won't need to be even bigger than the current padding allows? - and then the assumptions made when linking against the old predefined_t would no longer apply leading to corruption.
However, we've have enough room to store a pointer to an extension to the structure. So, if we reach to the point that we are near the end of our pad the last thing we put in a structure should be a pointer to an extension. Gross but workable. I am actually not that convinced we will actually reach that point but who knows.


The compiler complains because ompi_mpi_comm_world is declared as an
"extern struct ompi_predefined_communicator_t" but the type is
incomplete, so it can't tell whether the cast is a permissible
almost-the-same type pun (e.g. an "int" can alias an "unsigned").

I think this is potentially a serious performance issue for anyone using OpenMPI in a C++ environment, and the profuse warnings preclude it's use
in our build system.


I agree that the warnings are Bad.

The question is -- why do they only show up in gcc 4.1? More specifically -- why do they *not* show up in other versions of gcc? Is it a gcc 4.1 compiler bug?

Older GCC weren't as strict about aliasing issues. Perhaps the latter ones recognise that in this context (a function call with pointer to non-local parameter), no optimisation would be possible anyway.


The bad news is that the only work around I have is to insert (void *)
casts between (MPI_TYPENAME) and the address operator, e.g.:
#define MPI_COMM_WORLD (((MPI_Comm)(void *)&(ompi_mpi_comm_world)))


Hmm.  I guess that's plausible, but ugly.

Actually, I think it's closer to what you've been arguing above: you're insisting that the compiler blindly interpret &ompi_mpi_comm_world as a pointer to some memory that really is equivalent to the other unknown type communicator_t. Without the intermediate (void *), you're suggesting the compiler could possibly do better optimization.

Ultimately, for internal use, the (void *) is bad, but from client code with no knowledge of your types, it should be mandatory and tells the compiler to make no assumptions about aliasing.


An alternative might be to make the full type definition available by
#including some of the internal developer headers such as
ompi/communicator/communicator.h



I'm not sure what the right solution is, but that is not attractive. :-)


Completely agree (on both points!).

In summary, I still don't see why holding pointers to opaque types is not ABI-compatible, and would suggest the (void *) compiler hint in the meantime.

Just to re-iterate what I said above holding pointers to opaque types would be ABI-compatible. However, in C you cannot initialize a global pointer with a non-compile-time constant. Unfortunately, the MPI specification requires one to be able to assign the opaque handles, like MPI_COMM_WORLD and such, to a global variable.

--td

Reply via email to