Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)

2005-09-15 Thread Ferris McCormick

> On Wed, 14 Sep 2005, Brian Barrett wrote:
> 
> > I committed some code that should fix the timer problems on SPARC
> > linux.  Can you either svn up and try again (or, if you are using
> > nightly builds) try tomorrow's tarball and see if it works?  The test
> > tests/util/opal_timer.c should give an indication as to whether
> > everything is working ok or not.
> >
> > Thanks!
> >
> > Brian
> >
> 
> I'll try it tomorrow (the 15th).  Thanks for the response.

Nightly tarball is missing sparcv9/timer.h
Current svn checkout will not compile -- fails:
../../../../../opal/include/sys/sparcv9/timer.h:44: error:
`opal_timer_t' undeclared (first use in this function)
which is true, because it is commented out with '#if 0' brackets.

If you define it, build fails with 
{standard input}: Assembler messages:
{standard input}:61: Error: Illegal operands
{standard input}:195: Error: Illegal operands
{standard input}:292: Error: Illegal operands
from opal_progress.c --- I don't know why yet.

> 
Regards,
-- 
Ferris McCormick (P44646, MI) 
Developer, Gentoo Linux (Sparc, Devrel)


signature.asc
Description: This is a digitally signed message part


Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)

2005-09-15 Thread Brian Barrett
Gah...  The #if 0 and missing header are my bad - I'll fix those  
tonight.  can you rerun the compiler on the file that errors out, but  
with the -S option to get the assembly file?  It would be useful to  
know what is on lines 61, 195, and 292.


Thanks!

Brian


On Sep 15, 2005, at 8:36 AM, Ferris McCormick wrote:





On Wed, 14 Sep 2005, Brian Barrett wrote:



I committed some code that should fix the timer problems on SPARC
linux.  Can you either svn up and try again (or, if you are using
nightly builds) try tomorrow's tarball and see if it works?  The  
test

tests/util/opal_timer.c should give an indication as to whether
everything is working ok or not.

Thanks!

Brian




I'll try it tomorrow (the 15th).  Thanks for the response.



Nightly tarball is missing sparcv9/timer.h
Current svn checkout will not compile -- fails:
../../../../../opal/include/sys/sparcv9/timer.h:44: error:
`opal_timer_t' undeclared (first use in this function)
which is true, because it is commented out with '#if 0' brackets.

If you define it, build fails with
{standard input}: Assembler messages:
{standard input}:61: Error: Illegal operands
{standard input}:195: Error: Illegal operands
{standard input}:292: Error: Illegal operands
from opal_progress.c --- I don't know why yet.






Regards,
--
Ferris McCormick (P44646, MI) 
Developer, Gentoo Linux (Sparc, Devrel)
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [O-MPI devel] 64bit shared library problems

2005-09-15 Thread Jeff Squyres
Followup for the list... a bit of explanation of Nathan's problem about 
shared libraries and unresolved symbols.


Short version:
--

It's an OMPI bug when built as a shared library (not an issue for 
static libraries).  The fix is straightforward, but involves grunt 
work.  I'll try to get a student to do it RSN.


Long version:
-

What's happening is that we are not linking OMPI components against the 
opal/orte/ompi libraries.  As such, we are exploiting the fact that 
when they are dlopened by a standalone application (e.g., a.out), the 
Libtool portable version of dlopen() exports all the symbols from the 
parent process such that the child can find and use them at run-time to 
resolve any unknown symbols.  Here's an example (I'm leaving out some 
fine-grained details, and it's slightly different on different OS's, 
but this is "true enough" for the purposes of this thread):


- a.out, which was linked against libopal.so (and friends), launches
- the linker runs into an unresolved symbol
- the linker sees that that symbols is supposed to be in "libopal.so", 
and starts searching LD_LIBRARY_PATH for it
- the linker finds libopal.so, loads it, and is able to resolve the 
symbol


It gets interesting at this part:

- within MPI_Init()/orte_init()/opal_init() (i.e., however you 
initialized yourself to OMPI/ORTE/OPAL), we use the Libtool portable 
dlopen() to open our components
- the components will have unresolved symbols as well (i.e., the 
symbols in libopal, liborte, and libmpi)

- when the linker hits these, it tries to resolve them.
- first, the linker looks in the public namespace of the process, and 
if it finds the symbols there, it's done
- in this case, libopal (and friends) have already been loaded in the 
process, so the linker can find the symbols right away -- without 
loading any additional libraries


This is the scheme that we were relying on for libopal/orte/ompi 
symbols to be resolved in our components.  And for standalone 
executables, it works fine.


But for an environment like Eclipse, it doesn't.

I don't know anything about Eclipse, but I'm assuming that it does 
something similar to our component system -- it dlopen's them.  However 
-- here's where my guess comes in -- it doesn't make all the symbols in 
the opened component be in the public namespace of the process (this is 
different than what OMPI does, for various reasons).  Hence, if you 
build an Eclipse component against OMPI, the Eclipse component will be 
dynamically linked against libopal (etc.).  So when Eclipse loads in 
your component, similar to the standalone executable example above, the 
linker will realize that it has unresolved symbols and will use the 
normal mechanism to resolve them (e.g., look for libopal.so in 
LD_LIBRARY_PATH).


The problem comes in when we dlopen OMPI/ORTE/OPAL components.

Our scheme assumed that we'd be able to find the opal/orte/ompi symbols 
in the public namespace of the parent process.  But they're not -- 
Eclipse loaded the component in a private namespace, and therefore all 
the opal/orte/ompi symbols are in that private namespace.  And 
therefore the OMPI/ORTE/OPAL components can't find the symbols, and the 
linker barfs.


The solution is to change our scheme in OMPI a bit.  We just need to 
add a few lines to all the component Makefile.am's to, in the dynamic 
case, link the components against their relevant libraries (opal 
components linked against libopal, orte components linked against 
liborte and libopal, etc.).  This does not make the components 
significantly larger -- it just adds an entry into the dynamic linker 
section of the component's resulting .so file indicating "if you have 
unresolved components, go look in libopal.so" (etc.).


This allows the components themselves to pull in shared libraries when 
they are dlopened -- if they need to.  If the symbols can be resolved 
in the parent process' public symbol namespace, they still will be (as 
in the standalone executable example, above).  But if they can't be 
resolved that way, this gives the ability to explicitly find and pull 
in a shared library and resolve the symbols that way (as in the Eclipse 
plugin example, above).


Aren't computers fun?  :-)


On Sep 14, 2005, at 12:47 PM, Nathan DeBardeleben wrote:


Let me explain what I'm doing real quickly.

I have a piece of Java code which is calling OMPI calls.  It's doing 
this through JNI (java native interface).  Don't worry, you don't have 
to understand Java to try and help me here.  The JNI code is C with 
some funky macros in it provided by Java.


I have to compile the JNI C code into a shared library and then the 
Java code will load it dynamically when that class is instantiated.


So - here's my compile line:

[sparkplug]~/<2>ompi > mpicc -I /usr/java/jdk1.5.0_04/include -I 
/usr/java/jdk1.5.0_04/include/linux -c ptp_ompi_jni.c -fPIC   
 [sparkplug]~/<2>ompi > mpicc -I 
/usr/java/jdk1.5.

Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)

2005-09-15 Thread Ferris McCormick
On Thu, 2005-09-15 at 15:26 -0500, Brian Barrett wrote:
> Gah...  The #if 0 and missing header are my bad - I'll fix those  
> tonight.  can you rerun the compiler on the file that errors out, but  
> with the -S option to get the assembly file?  It would be useful to  
> know what is on lines 61, 195, and 292.
> 
> Thanks!
> 
> Brian
> 
Yes, I can.  I tried compiling a dummy program with just the time.h and
 val = opal_sys_timer_get_cycles();

At first glance, it looks like
mov %tick, %o4
is generating the error.  I've been fighting other things all day, so I
can't provide much more than that right now.  I'll verify with the
actual failure tomorrow, if the problem persists.  (I am using the svn
pull right now.)

> 
Regards,

-- 
Ferris McCormick (P44646, MI) 
Developer, Gentoo Linux (Sparc, Devrel)


signature.asc
Description: This is a digitally signed message part


Re: [O-MPI devel] 64bit shared library problems

2005-09-15 Thread Jeff Squyres

On Sep 15, 2005, at 4:32 PM, Jeff Squyres wrote:


This allows the components themselves to pull in shared libraries when
they are dlopened -- if they need to.  If the symbols can be resolved
in the parent process' public symbol namespace, they still will be (as
in the standalone executable example, above).  But if they can't be
resolved that way, this gives the ability to explicitly find and pull
in a shared library and resolve the symbols that way (as in the Eclipse
plugin example, above).


I forgot to include the simple example that shows this.  Here's how our 
components are today (this is the paffinity linux component, but 
they're all this way):


[15:15] odin:~/svn/ompi/opal/mca/paffinity/linux % ldd 
.libs/mca_paffinity_linux.so

libm.so.6 => /lib/libm.so.6 (0x002a9566b000)
libutil.so.1 => /lib/libutil.so.1 (0x002a957f1000)
libnsl.so.1 => /lib/libnsl.so.1 (0x002a958f4000)
libc.so.6 => /lib/libc.so.6 (0x002a95a0b000)
/lib64/ld-linux-x86-64.so.2 (0x00552000)

You can see that there's no mention of libopal, even though the 
paffinity linux component makes libopal function calls.


Here's how they are after I have fixed the Makefile.am and re-linked it:

[15:16] odin:~/svn/ompi/opal/mca/paffinity/linux % ldd 
.libs/mca_paffinity_linux.so
libopal.so.0 => /u/jsquyres/bogus/lib/libopal.so.0 
(0x002a9565a000)

libm.so.6 => /lib/libm.so.6 (0x002a957c8000)
libutil.so.1 => /lib/libutil.so.1 (0x002a9594e000)
libnsl.so.1 => /lib/libnsl.so.1 (0x002a95a52000)
libc.so.6 => /lib/libc.so.6 (0x002a95b68000)
libdl.so.2 => /lib/libdl.so.2 (0x002a95d8d000)
/lib64/ld-linux-x86-64.so.2 (0x00552000)

Notice the explicit mention of libopal.so.  This is what allows the 
component to resolve symbols independent of the parent process, if 
necessary.


Hope that all makes sense!  And if it doesn't, what do you care?  :-)

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/



Re: [O-MPI devel] 64bit shared library problems

2005-09-15 Thread Nathan DeBardeleben
Jeff and everyone else I contacted about this: thanks for helping track 
down the problem.  I've been beating my head on this for a few days and 
don't have the library experience to have caught these nuances.  Thanks 
again!


-- Nathan
Correspondence
-
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
Parallel Tools Team
High Performance Computing Environments
phone: 505-667-3428
email: ndeb...@lanl.gov
-



Jeff Squyres wrote:

Followup for the list... a bit of explanation of Nathan's problem about 
shared libraries and unresolved symbols.


Short version:
--

It's an OMPI bug when built as a shared library (not an issue for 
static libraries).  The fix is straightforward, but involves grunt 
work.  I'll try to get a student to do it RSN.


Long version:
-

What's happening is that we are not linking OMPI components against the 
opal/orte/ompi libraries.  As such, we are exploiting the fact that 
when they are dlopened by a standalone application (e.g., a.out), the 
Libtool portable version of dlopen() exports all the symbols from the 
parent process such that the child can find and use them at run-time to 
resolve any unknown symbols.  Here's an example (I'm leaving out some 
fine-grained details, and it's slightly different on different OS's, 
but this is "true enough" for the purposes of this thread):


- a.out, which was linked against libopal.so (and friends), launches
- the linker runs into an unresolved symbol
- the linker sees that that symbols is supposed to be in "libopal.so", 
and starts searching LD_LIBRARY_PATH for it
- the linker finds libopal.so, loads it, and is able to resolve the 
symbol


It gets interesting at this part:

- within MPI_Init()/orte_init()/opal_init() (i.e., however you 
initialized yourself to OMPI/ORTE/OPAL), we use the Libtool portable 
dlopen() to open our components
- the components will have unresolved symbols as well (i.e., the 
symbols in libopal, liborte, and libmpi)

- when the linker hits these, it tries to resolve them.
- first, the linker looks in the public namespace of the process, and 
if it finds the symbols there, it's done
- in this case, libopal (and friends) have already been loaded in the 
process, so the linker can find the symbols right away -- without 
loading any additional libraries


This is the scheme that we were relying on for libopal/orte/ompi 
symbols to be resolved in our components.  And for standalone 
executables, it works fine.


But for an environment like Eclipse, it doesn't.

I don't know anything about Eclipse, but I'm assuming that it does 
something similar to our component system -- it dlopen's them.  However 
-- here's where my guess comes in -- it doesn't make all the symbols in 
the opened component be in the public namespace of the process (this is 
different than what OMPI does, for various reasons).  Hence, if you 
build an Eclipse component against OMPI, the Eclipse component will be 
dynamically linked against libopal (etc.).  So when Eclipse loads in 
your component, similar to the standalone executable example above, the 
linker will realize that it has unresolved symbols and will use the 
normal mechanism to resolve them (e.g., look for libopal.so in 
LD_LIBRARY_PATH).


The problem comes in when we dlopen OMPI/ORTE/OPAL components.

Our scheme assumed that we'd be able to find the opal/orte/ompi symbols 
in the public namespace of the parent process.  But they're not -- 
Eclipse loaded the component in a private namespace, and therefore all 
the opal/orte/ompi symbols are in that private namespace.  And 
therefore the OMPI/ORTE/OPAL components can't find the symbols, and the 
linker barfs.


The solution is to change our scheme in OMPI a bit.  We just need to 
add a few lines to all the component Makefile.am's to, in the dynamic 
case, link the components against their relevant libraries (opal 
components linked against libopal, orte components linked against 
liborte and libopal, etc.).  This does not make the components 
significantly larger -- it just adds an entry into the dynamic linker 
section of the component's resulting .so file indicating "if you have 
unresolved components, go look in libopal.so" (etc.).


This allows the components themselves to pull in shared libraries when 
they are dlopened -- if they need to.  If the symbols can be resolved 
in the parent process' public symbol namespace, they still will be (as 
in the standalone executable example, above).  But if they can't be 
resolved that way, this gives the ability to explicitly find and pull 
in a shared library and resolve the symbols that way (as in the Eclipse 
plugin example, above).


Aren't computers fun?  :-)


On Sep 14, 2005, at 12:47 PM, Nathan DeBardeleben wrote:

 


Let me explain what I'm doing real quickly.

I have a piece of Java code which is calling OMPI call

Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)

2005-09-15 Thread Ferris McCormick

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Thu, 15 Sep 2005, Ferris McCormick wrote:


On Thu, 2005-09-15 at 15:26 -0500, Brian Barrett wrote:

Gah...  The #if 0 and missing header are my bad - I'll fix those
tonight.  can you rerun the compiler on the file that errors out, but
with the -S option to get the assembly file?  It would be useful to
know what is on lines 61, 195, and 292.

Thanks!

Brian


Yes, I can.  I tried compiling a dummy program with just the time.h and
val = opal_sys_timer_get_cycles();

At first glance, it looks like
   mov %tick, %o4
is generating the error.  I've been fighting other things all day, so I
can't provide much more than that right now.  I'll verify with the
actual failure tomorrow, if the problem persists.  (I am using the svn
pull right now.)



A little experimentation suggests that instead of "mov %tick, ..." we need
"rd %tick,%o4".  I'll verify tomorrow when I am on a system with a build 
on it, but at least "rd %tick,o4" assembles properly but "mov %tick,%o4" 
gives an error.


Regards,
Ferris

- --
Ferris McCormick (P44646, MI) 
Developer, Gentoo Linux (sparc, devrel)
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDKh1IQa6M3+I///cRAtW2AJ45/BWdJWa/S5ZZULNS5B/OWm4T3gCeJIPV
pJKnCcs4PJ+fi19dyH38eXE=
=sc97
-END PGP SIGNATURE-


Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)

2005-09-15 Thread Brian Barrett

On Sep 15, 2005, at 8:17 PM, Ferris McCormick wrote:


On Thu, 15 Sep 2005, Ferris McCormick wrote:


On Thu, 2005-09-15 at 15:26 -0500, Brian Barrett wrote:

Gah...  The #if 0 and missing header are my bad - I'll fix those
tonight.  can you rerun the compiler on the file that errors out, but
with the -S option to get the assembly file?  It would be useful to
know what is on lines 61, 195, and 292.

Thanks!

Brian

Yes, I can.  I tried compiling a dummy program with just the time.h 
and

val = opal_sys_timer_get_cycles();

At first glance, it looks like
   mov %tick, %o4
is generating the error.  I've been fighting other things all day, so 
I

can't provide much more than that right now.  I'll verify with the
actual failure tomorrow, if the problem persists.  (I am using the svn
pull right now.)



A little experimentation suggests that instead of "mov %tick, ..." we 
need
"rd %tick,%o4".  I'll verify tomorrow when I am on a system with a 
build
on it, but at least "rd %tick,o4" assembles properly but "mov 
%tick,%o4"

gives an error.



Yeah, ok, that makes sense.  Damn Solaris for letting me be lazy ;).

Let me know if that works and I'll commit the change.  I already 
committed the fixes so that the type wouldn't be #if 0'ed and the 
header would be in the dist tarball.


Thanks!

Brian