Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

Ralph Castain Tue, 8 Feb 2011 17:16:15 -0500

See below


On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote:

> 
> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:
> 
>> Hi Michael,
>> 
>> You may have tried to send some debug information to the list, but it 
>> appears to have been blocked.  Compressed text output of the backtrace text 
>> is sufficient.
> 
> 
> Odd, I thought I sent it to you directly.  In any case, here is the backtrace 
> and some information from gdb:
> 
> $ salloc -n16 gdb -args mpirun mpi
> (gdb) run
> Starting program: /mnt/f1/michael/openmpi/bin/mpirun 
> /mnt/f1/michael/home/ServerAdmin/mpi
> [Thread debugging using libthread_db enabled]
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
> data=0x681170) at base/plm_base_launch_support.c:342
> 342       pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
> (gdb) bt
> #0  0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
> data=0x681170) at base/plm_base_launch_support.c:342
> #1  0x00007ffff78a7338 in event_process_active (base=0x615240) at event.c:651
> #2  0x00007ffff78a797e in opal_event_base_loop (base=0x615240, flags=1) at 
> event.c:823
> #3  0x00007ffff78a756f in opal_event_loop (flags=1) at event.c:730
> #4  0x00007ffff789b916 in opal_progress () at runtime/opal_progress.c:189
> #5  0x00007ffff7b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at 
> base/plm_base_launch_support.c:459
> #6  0x00007ffff7b7bed7 in plm_slurm_launch_job (jdata=0x610560) at 
> plm_slurm_module.c:360
> #7  0x0000000000403f46 in orterun (argc=2, argv=0x7fffffffe7d8) at 
> orterun.c:754
> #8  0x0000000000402fb4 in main (argc=2, argv=0x7fffffffe7d8) at main.c:13
> (gdb) print pdatorted
> $1 = (orte_proc_t **) 0x67c610
> (gdb) print mev
> $2 = (orte_message_event_t *) 0x681550
> (gdb) print mev->sender.vpid
> $3 = 4294967295
> (gdb) print mev->sender
> $4 = {jobid = 1721696256, vpid = 4294967295}
> (gdb) print *mev
> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 
> 0x7ffff7dd4f40, obj_reference_count = 1, cls_init_file_name = 0x7ffff7bb9a78 
> "base/plm_base_launch_support.c", 
>   cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 
> 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 
> "rml_oob_component.c", line = 279}

The jobid and vpid look like the defined INVALID values, indicating that 
something is quite wrong. This would quite likely lead to the segfault.

>From this, it would indeed appear that you are getting some kind of library 
>confusion - the most likely cause of such an error is a daemon from a 
>different version trying to respond, and so the returned message isn't correct.

Not sure why else it would be happening...you could try setting -mca 
plm_base_verbose 5 to get more debug output displayed on your screen, assuming 
you built OMPI with --enable-debug.


> 
> That vpid looks suspiciously like -1.
> 
> Further debugging:
> Breakpoint 3, orted_report_launch (status=32767, sender=0x7fffffffe170, 
> buffer=0x7ffff7b1a85f, tag=32767, cbdata=0x612d20) at 
> base/plm_base_launch_support.c:411
> 411   {
> (gdb) print sender
> $2 = (orte_process_name_t *) 0x7fffffffe170
> (gdb) print *sender
> $3 = {jobid = 6822016, vpid = 0}
> (gdb) continue
> Continuing.
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
> data=0x681550) at base/plm_base_launch_support.c:342
> 342       pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
> (gdb) print mev->sender
> $4 = {jobid = 1778450432, vpid = 4294967295}
> 
> The daemon probably died as I spent too long thinking about my gdb input ;)

I'm not sure why that would happen - there are no timers in the system, so it 
won't care how long it takes to initialize. I'm guessing this is another 
indicator of a library issue.


> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

Reply via email to