Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2006-01-04 Thread Jeff Squyres
On Dec 30, 2005, at 4:15 AM, Graziano Giuliani wrote: #0 0xb7ca2599 in orte_pls_rsh_launch (jobid=1) at pls_rsh_module.c: 716 716 if (mca_pls_rsh_component.debug) { which means we have a memory corruption somewhere else... Agreed. Investigating from outside on what

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-30 Thread Graziano Giuliani
Ok Brian, for the build part, attached is my config.log. About stacktrace, I have with my compile options from gdb: #0 0xb7d105b9 in orte_pls_rsh_launch () from /home/cluster/openmpi/lib/openmpi/mca_pls_rsh.so and recompiling with -g #0 0xb7ca2599 in orte_pls_rsh_launch (jobid=1) at pls_

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-28 Thread Brian Barrett
On Dec 28, 2005, at 4:50 AM, Graziano Giuliani wrote: Hi all, can confirm this bug also on Linux Debian testing with kernel 2.6.14 and gcc (GCC) 4.0.3 20051201 (prerelease) (Debian 4.0.2-5) running WRF atmospheric model compiled with portland pgf90. For who cares about this, it needs just

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-28 Thread Graziano Giuliani
Hi all, can confirm this bug also on Linux Debian testing with kernel 2.6.14 and gcc (GCC) 4.0.3 20051201 (prerelease) (Debian 4.0.2-5) running WRF atmospheric model compiled with portland pgf90. For who cares about this, it needs just a little patch in the RSL layer of the model to convert fort

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-22 Thread Greg Watson
Yes it appears to be the exact same error. Greg On Dec 22, 2005, at 5:25 AM, Jeff Squyres wrote: Blast! Is it still a segv in the rsh component? (I should be able to try this myself on an FC4 machine in a day or two) On Dec 21, 2005, at 2:11 PM, Greg Watson wrote: I just tried 1.0.2a1

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-22 Thread Jeff Squyres
Blast! Is it still a segv in the rsh component? (I should be able to try this myself on an FC4 machine in a day or two) On Dec 21, 2005, at 2:11 PM, Greg Watson wrote: I just tried 1.0.2a1r8580 but the problem is still there... Greg On Dec 20, 2005, at 5:02 PM, Jeff Squyres wrote: I thin

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-21 Thread Greg Watson
I just tried 1.0.2a1r8580 but the problem is still there... Greg On Dec 20, 2005, at 5:02 PM, Jeff Squyres wrote: I think we found the problem and committed a fix this afternoon to both the trunk and v1.0 branch. Anything after r8564 should have the fix. Greg -- could you try again? On Dec

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-20 Thread Jeff Squyres
I think we found the problem and committed a fix this afternoon to both the trunk and v1.0 branch. Anything after r8564 should have the fix. Greg -- could you try again? On Dec 19, 2005, at 4:59 PM, Paul H. Hargrove wrote: Jeff, I have an FC4 x86 w/ OSCAR bits on it :-). Let me know

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-19 Thread Paul H. Hargrove
Jeff, I have an FC4 x86 w/ OSCAR bits on it :-). Let me know if you want access. -Paul Jeff Squyres wrote: Yoinks. Let me try to scrounge up an FC4 box to reproduce this on. If it really is an -O problem, this segv may just be the symptom, not the cause (seems likely, because mca_rsh

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-19 Thread Jeff Squyres
Yoinks. Let me try to scrounge up an FC4 box to reproduce this on. If it really is an -O problem, this segv may just be the symptom, not the cause (seems likely, because mca_rsh_pls_component is a statically-defined variable -- accessing a member on it should definitely not cause a segv).

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-18 Thread Greg Watson
Sure seems like it: (gdb) p *mca_pls_rsh_component.argv@4 $12 = {0x90e0428 "ssh", 0x90e0438 "-x", 0x0, 0x11 of bounds>} (gdb) p mca_pls_rsh_component.argc $13 = 2 (gdb) p local_exec_index $14 = 3 Greg On Dec 18, 2005, at 4:56 AM, Rainer Keller wrote: Hello Greg, I don't know, whether it's s

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-18 Thread Rainer Keller
Hello Greg, I don't know, whether it's segfaulting at that particular line, but could You please print the argv, since I guess, that might be the local_exec_index into the argv being wrong? Thanks, Rainer On Saturday 17 December 2005 19:16, Greg Watson wrote: > Here's the stacktrace: > > #0 0

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-17 Thread Greg Watson
Here's the stacktrace: #0 0x00ae1fe8 in orte_pls_rsh_launch (jobid=1) at pls_rsh_module.c:714 714 if (mca_pls_rsh_component.debug) { (gdb) where #0 0x00ae1fe8 in orte_pls_rsh_launch (jobid=1) at pls_rsh_module.c:714 #1 0x00a29642 in orte_rmgr_urm_spawn () from /usr/l

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-16 Thread Jeff Squyres
On Dec 16, 2005, at 10:47 AM, Greg Watson wrote: I finally worked out why I couldn't reproduce the problem. You're not going to like it though. You're right -- this kind of buglet is among the most un-fun. :-( Here's the stacktracefrom the core file: #0 0x00e93fe8 in orte_pls_rsh_launch (

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-16 Thread Greg Watson
Jeff, I finally worked out why I couldn't reproduce the problem. You're not going to like it though. As before, this is running on FC4 and I'm using 1.0.1r8453 (the 1.0.1 release version). First test: $ ./configure --with-devel-headers --prefix=/usr/local/ompi $ make $ make install $ mpi

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-01 Thread Jeff Squyres
On Dec 1, 2005, at 10:58 AM, Greg Watson wrote: @#$%^& it! I can't get the problem to manifest for either branch now. Well, that's good for me. :-) FWIW, the problem existed on systems that could/would return different addresses in different processes from mmap() for memory that was common

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-12-01 Thread Greg Watson
@#$%^& it! I can't get the problem to manifest for either branch now. Greg On 30/11/2005, at 2:12 PM, Jeff Squyres wrote: On Nov 30, 2005, at 2:12 PM, Greg Watson wrote: Fedora Core 4 on x86. I installed the overnight snapshot from trunk first and immediately got the error. Then I tried 1.0.

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-11-30 Thread Jeff Squyres
On Nov 30, 2005, at 2:12 PM, Greg Watson wrote: Fedora Core 4 on x86. I installed the overnight snapshot from trunk first and immediately got the error. Then I tried 1.0.x and it worked. Want debugging info? Blah! Yes, please send any debugging info that you have -- SVN r number and a backtr

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-11-30 Thread Greg Watson
Fedora Core 4 on x86. I installed the overnight snapshot from trunk first and immediately got the error. Then I tried 1.0.x and it worked. Want debugging info? Greg On 30/11/2005, at 10:14 AM, Jeff Squyres wrote: No, I was not aware of this -- I migrated all the changes from the trunk to t

Re: [O-MPI devel] sm btl/signal 11 problem on Linux

2005-11-30 Thread Jeff Squyres
No, I was not aware of this -- I migrated all the changes from the trunk to the v1.0 branch (not the other way around). What kind of systems are you running into this on? On Nov 30, 2005, at 10:37 AM, Greg Watson wrote: You probably already know this, but this problem is still present in the

[O-MPI devel] sm btl/signal 11 problem on Linux

2005-11-30 Thread Greg Watson
You probably already know this, but this problem is still present in the trunk. It appears to be fixed in the 1.0.x tree. Greg