Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)
> On Wed, 14 Sep 2005, Brian Barrett wrote: > > > I committed some code that should fix the timer problems on SPARC > > linux. Can you either svn up and try again (or, if you are using > > nightly builds) try tomorrow's tarball and see if it works? The test > > tests/util/opal_timer.c should give an indication as to whether > > everything is working ok or not. > > > > Thanks! > > > > Brian > > > > I'll try it tomorrow (the 15th). Thanks for the response. Nightly tarball is missing sparcv9/timer.h Current svn checkout will not compile -- fails: ../../../../../opal/include/sys/sparcv9/timer.h:44: error: `opal_timer_t' undeclared (first use in this function) which is true, because it is commented out with '#if 0' brackets. If you define it, build fails with {standard input}: Assembler messages: {standard input}:61: Error: Illegal operands {standard input}:195: Error: Illegal operands {standard input}:292: Error: Illegal operands from opal_progress.c --- I don't know why yet. > Regards, -- Ferris McCormick (P44646, MI) Developer, Gentoo Linux (Sparc, Devrel) signature.asc Description: This is a digitally signed message part
Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)
Gah... The #if 0 and missing header are my bad - I'll fix those tonight. can you rerun the compiler on the file that errors out, but with the -S option to get the assembly file? It would be useful to know what is on lines 61, 195, and 292. Thanks! Brian On Sep 15, 2005, at 8:36 AM, Ferris McCormick wrote: On Wed, 14 Sep 2005, Brian Barrett wrote: I committed some code that should fix the timer problems on SPARC linux. Can you either svn up and try again (or, if you are using nightly builds) try tomorrow's tarball and see if it works? The test tests/util/opal_timer.c should give an indication as to whether everything is working ok or not. Thanks! Brian I'll try it tomorrow (the 15th). Thanks for the response. Nightly tarball is missing sparcv9/timer.h Current svn checkout will not compile -- fails: ../../../../../opal/include/sys/sparcv9/timer.h:44: error: `opal_timer_t' undeclared (first use in this function) which is true, because it is commented out with '#if 0' brackets. If you define it, build fails with {standard input}: Assembler messages: {standard input}:61: Error: Illegal operands {standard input}:195: Error: Illegal operands {standard input}:292: Error: Illegal operands from opal_progress.c --- I don't know why yet. Regards, -- Ferris McCormick (P44646, MI) Developer, Gentoo Linux (Sparc, Devrel) ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [O-MPI devel] 64bit shared library problems
Followup for the list... a bit of explanation of Nathan's problem about shared libraries and unresolved symbols. Short version: -- It's an OMPI bug when built as a shared library (not an issue for static libraries). The fix is straightforward, but involves grunt work. I'll try to get a student to do it RSN. Long version: - What's happening is that we are not linking OMPI components against the opal/orte/ompi libraries. As such, we are exploiting the fact that when they are dlopened by a standalone application (e.g., a.out), the Libtool portable version of dlopen() exports all the symbols from the parent process such that the child can find and use them at run-time to resolve any unknown symbols. Here's an example (I'm leaving out some fine-grained details, and it's slightly different on different OS's, but this is "true enough" for the purposes of this thread): - a.out, which was linked against libopal.so (and friends), launches - the linker runs into an unresolved symbol - the linker sees that that symbols is supposed to be in "libopal.so", and starts searching LD_LIBRARY_PATH for it - the linker finds libopal.so, loads it, and is able to resolve the symbol It gets interesting at this part: - within MPI_Init()/orte_init()/opal_init() (i.e., however you initialized yourself to OMPI/ORTE/OPAL), we use the Libtool portable dlopen() to open our components - the components will have unresolved symbols as well (i.e., the symbols in libopal, liborte, and libmpi) - when the linker hits these, it tries to resolve them. - first, the linker looks in the public namespace of the process, and if it finds the symbols there, it's done - in this case, libopal (and friends) have already been loaded in the process, so the linker can find the symbols right away -- without loading any additional libraries This is the scheme that we were relying on for libopal/orte/ompi symbols to be resolved in our components. And for standalone executables, it works fine. But for an environment like Eclipse, it doesn't. I don't know anything about Eclipse, but I'm assuming that it does something similar to our component system -- it dlopen's them. However -- here's where my guess comes in -- it doesn't make all the symbols in the opened component be in the public namespace of the process (this is different than what OMPI does, for various reasons). Hence, if you build an Eclipse component against OMPI, the Eclipse component will be dynamically linked against libopal (etc.). So when Eclipse loads in your component, similar to the standalone executable example above, the linker will realize that it has unresolved symbols and will use the normal mechanism to resolve them (e.g., look for libopal.so in LD_LIBRARY_PATH). The problem comes in when we dlopen OMPI/ORTE/OPAL components. Our scheme assumed that we'd be able to find the opal/orte/ompi symbols in the public namespace of the parent process. But they're not -- Eclipse loaded the component in a private namespace, and therefore all the opal/orte/ompi symbols are in that private namespace. And therefore the OMPI/ORTE/OPAL components can't find the symbols, and the linker barfs. The solution is to change our scheme in OMPI a bit. We just need to add a few lines to all the component Makefile.am's to, in the dynamic case, link the components against their relevant libraries (opal components linked against libopal, orte components linked against liborte and libopal, etc.). This does not make the components significantly larger -- it just adds an entry into the dynamic linker section of the component's resulting .so file indicating "if you have unresolved components, go look in libopal.so" (etc.). This allows the components themselves to pull in shared libraries when they are dlopened -- if they need to. If the symbols can be resolved in the parent process' public symbol namespace, they still will be (as in the standalone executable example, above). But if they can't be resolved that way, this gives the ability to explicitly find and pull in a shared library and resolve the symbols that way (as in the Eclipse plugin example, above). Aren't computers fun? :-) On Sep 14, 2005, at 12:47 PM, Nathan DeBardeleben wrote: Let me explain what I'm doing real quickly. I have a piece of Java code which is calling OMPI calls. It's doing this through JNI (java native interface). Don't worry, you don't have to understand Java to try and help me here. The JNI code is C with some funky macros in it provided by Java. I have to compile the JNI C code into a shared library and then the Java code will load it dynamically when that class is instantiated. So - here's my compile line: [sparkplug]~/<2>ompi > mpicc -I /usr/java/jdk1.5.0_04/include -I /usr/java/jdk1.5.0_04/include/linux -c ptp_ompi_jni.c -fPIC [sparkplug]~/<2>ompi > mpicc -I /usr/java/jdk1.5.
Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)
On Thu, 2005-09-15 at 15:26 -0500, Brian Barrett wrote: > Gah... The #if 0 and missing header are my bad - I'll fix those > tonight. can you rerun the compiler on the file that errors out, but > with the -S option to get the assembly file? It would be useful to > know what is on lines 61, 195, and 292. > > Thanks! > > Brian > Yes, I can. I tried compiling a dummy program with just the time.h and val = opal_sys_timer_get_cycles(); At first glance, it looks like mov %tick, %o4 is generating the error. I've been fighting other things all day, so I can't provide much more than that right now. I'll verify with the actual failure tomorrow, if the problem persists. (I am using the svn pull right now.) > Regards, -- Ferris McCormick (P44646, MI) Developer, Gentoo Linux (Sparc, Devrel) signature.asc Description: This is a digitally signed message part
Re: [O-MPI devel] 64bit shared library problems
On Sep 15, 2005, at 4:32 PM, Jeff Squyres wrote: This allows the components themselves to pull in shared libraries when they are dlopened -- if they need to. If the symbols can be resolved in the parent process' public symbol namespace, they still will be (as in the standalone executable example, above). But if they can't be resolved that way, this gives the ability to explicitly find and pull in a shared library and resolve the symbols that way (as in the Eclipse plugin example, above). I forgot to include the simple example that shows this. Here's how our components are today (this is the paffinity linux component, but they're all this way): [15:15] odin:~/svn/ompi/opal/mca/paffinity/linux % ldd .libs/mca_paffinity_linux.so libm.so.6 => /lib/libm.so.6 (0x002a9566b000) libutil.so.1 => /lib/libutil.so.1 (0x002a957f1000) libnsl.so.1 => /lib/libnsl.so.1 (0x002a958f4000) libc.so.6 => /lib/libc.so.6 (0x002a95a0b000) /lib64/ld-linux-x86-64.so.2 (0x00552000) You can see that there's no mention of libopal, even though the paffinity linux component makes libopal function calls. Here's how they are after I have fixed the Makefile.am and re-linked it: [15:16] odin:~/svn/ompi/opal/mca/paffinity/linux % ldd .libs/mca_paffinity_linux.so libopal.so.0 => /u/jsquyres/bogus/lib/libopal.so.0 (0x002a9565a000) libm.so.6 => /lib/libm.so.6 (0x002a957c8000) libutil.so.1 => /lib/libutil.so.1 (0x002a9594e000) libnsl.so.1 => /lib/libnsl.so.1 (0x002a95a52000) libc.so.6 => /lib/libc.so.6 (0x002a95b68000) libdl.so.2 => /lib/libdl.so.2 (0x002a95d8d000) /lib64/ld-linux-x86-64.so.2 (0x00552000) Notice the explicit mention of libopal.so. This is what allows the component to resolve symbols independent of the parent process, if necessary. Hope that all makes sense! And if it doesn't, what do you care? :-) -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI devel] 64bit shared library problems
Jeff and everyone else I contacted about this: thanks for helping track down the problem. I've been beating my head on this for a few days and don't have the library experience to have caught these nuances. Thanks again! -- Nathan Correspondence - Nathan DeBardeleben, Ph.D. Los Alamos National Laboratory Parallel Tools Team High Performance Computing Environments phone: 505-667-3428 email: ndeb...@lanl.gov - Jeff Squyres wrote: Followup for the list... a bit of explanation of Nathan's problem about shared libraries and unresolved symbols. Short version: -- It's an OMPI bug when built as a shared library (not an issue for static libraries). The fix is straightforward, but involves grunt work. I'll try to get a student to do it RSN. Long version: - What's happening is that we are not linking OMPI components against the opal/orte/ompi libraries. As such, we are exploiting the fact that when they are dlopened by a standalone application (e.g., a.out), the Libtool portable version of dlopen() exports all the symbols from the parent process such that the child can find and use them at run-time to resolve any unknown symbols. Here's an example (I'm leaving out some fine-grained details, and it's slightly different on different OS's, but this is "true enough" for the purposes of this thread): - a.out, which was linked against libopal.so (and friends), launches - the linker runs into an unresolved symbol - the linker sees that that symbols is supposed to be in "libopal.so", and starts searching LD_LIBRARY_PATH for it - the linker finds libopal.so, loads it, and is able to resolve the symbol It gets interesting at this part: - within MPI_Init()/orte_init()/opal_init() (i.e., however you initialized yourself to OMPI/ORTE/OPAL), we use the Libtool portable dlopen() to open our components - the components will have unresolved symbols as well (i.e., the symbols in libopal, liborte, and libmpi) - when the linker hits these, it tries to resolve them. - first, the linker looks in the public namespace of the process, and if it finds the symbols there, it's done - in this case, libopal (and friends) have already been loaded in the process, so the linker can find the symbols right away -- without loading any additional libraries This is the scheme that we were relying on for libopal/orte/ompi symbols to be resolved in our components. And for standalone executables, it works fine. But for an environment like Eclipse, it doesn't. I don't know anything about Eclipse, but I'm assuming that it does something similar to our component system -- it dlopen's them. However -- here's where my guess comes in -- it doesn't make all the symbols in the opened component be in the public namespace of the process (this is different than what OMPI does, for various reasons). Hence, if you build an Eclipse component against OMPI, the Eclipse component will be dynamically linked against libopal (etc.). So when Eclipse loads in your component, similar to the standalone executable example above, the linker will realize that it has unresolved symbols and will use the normal mechanism to resolve them (e.g., look for libopal.so in LD_LIBRARY_PATH). The problem comes in when we dlopen OMPI/ORTE/OPAL components. Our scheme assumed that we'd be able to find the opal/orte/ompi symbols in the public namespace of the parent process. But they're not -- Eclipse loaded the component in a private namespace, and therefore all the opal/orte/ompi symbols are in that private namespace. And therefore the OMPI/ORTE/OPAL components can't find the symbols, and the linker barfs. The solution is to change our scheme in OMPI a bit. We just need to add a few lines to all the component Makefile.am's to, in the dynamic case, link the components against their relevant libraries (opal components linked against libopal, orte components linked against liborte and libopal, etc.). This does not make the components significantly larger -- it just adds an entry into the dynamic linker section of the component's resulting .so file indicating "if you have unresolved components, go look in libopal.so" (etc.). This allows the components themselves to pull in shared libraries when they are dlopened -- if they need to. If the symbols can be resolved in the parent process' public symbol namespace, they still will be (as in the standalone executable example, above). But if they can't be resolved that way, this gives the ability to explicitly find and pull in a shared library and resolve the symbols that way (as in the Eclipse plugin example, above). Aren't computers fun? :-) On Sep 14, 2005, at 12:47 PM, Nathan DeBardeleben wrote: Let me explain what I'm doing real quickly. I have a piece of Java code which is calling OMPI call
Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Thu, 15 Sep 2005, Ferris McCormick wrote: On Thu, 2005-09-15 at 15:26 -0500, Brian Barrett wrote: Gah... The #if 0 and missing header are my bad - I'll fix those tonight. can you rerun the compiler on the file that errors out, but with the -S option to get the assembly file? It would be useful to know what is on lines 61, 195, and 292. Thanks! Brian Yes, I can. I tried compiling a dummy program with just the time.h and val = opal_sys_timer_get_cycles(); At first glance, it looks like mov %tick, %o4 is generating the error. I've been fighting other things all day, so I can't provide much more than that right now. I'll verify with the actual failure tomorrow, if the problem persists. (I am using the svn pull right now.) A little experimentation suggests that instead of "mov %tick, ..." we need "rd %tick,%o4". I'll verify tomorrow when I am on a system with a build on it, but at least "rd %tick,o4" assembles properly but "mov %tick,%o4" gives an error. Regards, Ferris - -- Ferris McCormick (P44646, MI) Developer, Gentoo Linux (sparc, devrel) -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFDKh1IQa6M3+I///cRAtW2AJ45/BWdJWa/S5ZZULNS5B/OWm4T3gCeJIPV pJKnCcs4PJ+fi19dyH38eXE= =sc97 -END PGP SIGNATURE-
Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)
On Sep 15, 2005, at 8:17 PM, Ferris McCormick wrote: On Thu, 15 Sep 2005, Ferris McCormick wrote: On Thu, 2005-09-15 at 15:26 -0500, Brian Barrett wrote: Gah... The #if 0 and missing header are my bad - I'll fix those tonight. can you rerun the compiler on the file that errors out, but with the -S option to get the assembly file? It would be useful to know what is on lines 61, 195, and 292. Thanks! Brian Yes, I can. I tried compiling a dummy program with just the time.h and val = opal_sys_timer_get_cycles(); At first glance, it looks like mov %tick, %o4 is generating the error. I've been fighting other things all day, so I can't provide much more than that right now. I'll verify with the actual failure tomorrow, if the problem persists. (I am using the svn pull right now.) A little experimentation suggests that instead of "mov %tick, ..." we need "rd %tick,%o4". I'll verify tomorrow when I am on a system with a build on it, but at least "rd %tick,o4" assembles properly but "mov %tick,%o4" gives an error. Yeah, ok, that makes sense. Damn Solaris for letting me be lazy ;). Let me know if that works and I'll commit the change. I already committed the fixes so that the type wouldn't be #if 0'ed and the header would be in the dist tarball. Thanks! Brian