Jeff,

I have an FC4 x86 w/ OSCAR bits on it :-). Let me know if you want access.

-Paul

Jeff Squyres wrote:
Yoinks. Let me try to scrounge up an FC4 box to reproduce this on. If it really is an -O problem, this segv may just be the symptom, not the cause (seems likely, because mca_rsh_pls_component is a statically-defined variable -- accessing a member on it should definitely not cause a segv). :-(


On Dec 18, 2005, at 12:11 PM, Greg Watson wrote:

Sure seems like it:

(gdb) p *mca_pls_rsh_component.argv@4
$12 = {0x90e0428 "ssh", 0x90e0438 "-x", 0x0, 0x11 <Address 0x11 out
of bounds>}
(gdb) p mca_pls_rsh_component.argc
$13 = 2
(gdb) p local_exec_index
$14 = 3


Greg

On Dec 18, 2005, at 4:56 AM, Rainer Keller wrote:

Hello Greg,
I don't know, whether it's segfaulting at that particular line, but
could You
please print the  argv, since I guess, that might be the
local_exec_index
into the argv being wrong?

Thanks,
Rainer

On Saturday 17 December 2005 19:16, Greg Watson wrote:
Here's the stacktrace:

#0  0x00ae1fe8 in orte_pls_rsh_launch (jobid=1) at
pls_rsh_module.c:714
714                         if (mca_pls_rsh_component.debug) {
(gdb) where
#0  0x00ae1fe8 in orte_pls_rsh_launch (jobid=1) at
pls_rsh_module.c:714
#1  0x00a29642 in orte_rmgr_urm_spawn ()
    from /usr/local/ompi/lib/openmpi/mca_rmgr_urm.so
#2  0x0804a0d4 in orterun (argc=4, argv=0xbff88594) at orterun.c:373
#3  0x08049b16 in main (argc=4, argv=0xbff88594) at main.c:13

And the contents of mca_pls_rsh_component:

(gdb) p mca_pls_rsh_component
$2 = {super = {pls_version = {mca_major_version = 1,
mca_minor_version = 0,
       mca_release_version = 0, mca_type_name = "pls", '\0' <repeats
28 times>,
       mca_type_major_version = 1, mca_type_minor_version = 0,
       mca_type_release_version = 0,
       mca_component_name = "rsh", '\0' <repeats 60 times>,
       mca_component_major_version = 1,
mca_component_minor_version = 0,
       mca_component_release_version = 1,
       mca_open_component = 0xae0a80 <orte_pls_rsh_component_open>,
       mca_close_component = 0xae09a0
<orte_pls_rsh_component_close>},
     pls_data = {mca_is_checkpointable = true},
     pls_init = 0xae093c <orte_pls_rsh_component_init>}, debug =
false,
   reap = true, assume_same_shell = true, delay = 1, priority = 10,
   argv = 0x90e0418, argc = 2, orted = 0x90de438 "orted",
   path = 0x90e0960 "/usr/bin/ssh", num_children = 0, num_concurrent
= 128,
   lock = {super = {obj_class = 0x804ec38, obj_reference_count = 1},
     m_lock_pthread = {__data = {__lock = 0, __count = 0, __owner
= 0,
         __kind = 0, __nusers = 0, __spins = 0},
       __size = '\0' <repeats 23 times>, __align = 0}, m_lock_atomic
= {u = {
         lock = 0, sparc_lock = 0 '\0', padding = "\000\000\000"}}},
cond = {
     super = {obj_class = 0x804ec18, obj_reference_count = 1},
c_waiting = 0,
     c_signaled = 0, c_cond = {__data = {__lock = 0, __futex = 0,
         __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex
= 0x0,
         __nwaiters = 0, __broadcast_seq = 0},
       __size = '\0' <repeats 47 times>, __align = 0}}}

I can't see why it is segfaulting at this particular line.

Greg

On Dec 16, 2005, at 5:55 PM, Jeff Squyres wrote:
On Dec 16, 2005, at 10:47 AM, Greg Watson wrote:
I finally worked out why I couldn't reproduce the problem.
You're not
going to like it though.
You're right -- this kind of buglet is among the most un-fun.  :-(

Here's the stacktracefrom the core file:

#0  0x00e93fe8 in orte_pls_rsh_launch ()
    from /usr/local/ompi/lib/openmpi/mca_pls_rsh.so
#1  0x0023c642 in orte_rmgr_urm_spawn ()
    from /usr/local/ompi/lib/openmpi/mca_rmgr_urm.so
#2  0x0804a0d4 in orterun (argc=5, argv=0xbfab2e84) at orterun.c:
373
#3  0x08049b16 in main (argc=5, argv=0xbfab2e84) at main.c:13
Can you recompile this one file with -g?  Specifically, cd into the
orte/mca/pla/rsh dir and "make clean".  Then "make".  Then cut-n-
paste the compile line for that one file to a shell prompt, and put
in a -g.

Then either re-install that component (it looks like you're doing a
dynamic build with separate components, so you can do "make install"
right from the rsh dir) or re-link liborte and re-install that
and re-
run.  The corefile might give something a little more meaningful in
this case...?

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
---------------------------------------------------------------------
Dipl.-Inf. Rainer Keller       email: kel...@hlrs.de
  High Performance Computing     Tel: ++49 (0)711-685 5858
    Center Stuttgart (HLRS)        Fax: ++49 (0)711-678 7626
  POSTAL:Nobelstrasse 19             http://www.hlrs.de/people/keller
  ACTUAL:Allmandring 30, R. O.030
  70550 Stuttgart
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Paul H. Hargrove                          phhargr...@lbl.gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to