Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

Ralph Castain Tue, 8 Feb 2011 22:22:07 -0500

I would personally suggest not reconfiguring your system simply to support a 
particular version of OMPI. The only difference between the 1.4 and 1.5 series 
wrt slurm is that we changed a few things to support a more recent version of 
slurm. It is relatively easy to backport that code to the 1.4 series, and it 
should be (mostly) backward compatible.


OMPI is agnostic wrt resource managers. We try to support all platforms, with 
our effort reflective of the needs of our developers and their organizations, 
and our perception of the relative size of the user community for a particular 
platform. Slurm is a fairly small community, mostly centered in the three DOE 
weapons labs, so our support for that platform tends to focus on their usage.

So, with that understanding...

Sam: can you confirm that 1.5.1 works on your TLCC machines?

I have created a ticket to upgrade the 1.4.4 release (due out any time now) 
with the 1.5.1 slurm support. Any interested parties can follow it here:

https://svn.open-mpi.org/trac/ompi/ticket/2717

Ralph


On Feb 8, 2011, at 6:23 PM, Michael Curtis wrote:

> 
> On 09/02/2011, at 9:16 AM, Ralph Castain wrote:
> 
>> See below
>> 
>> 
>> On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote:
>> 
>>> 
>>> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:
>>> 
>>>> Hi Michael,
>>>> 
>>>> You may have tried to send some debug information to the list, but it 
>>>> appears to have been blocked.  Compressed text output of the backtrace 
>>>> text is sufficient.
>>> 
>>> 
>>> Odd, I thought I sent it to you directly.  In any case, here is the 
>>> backtrace and some information from gdb:
>>> 
>>> $ salloc -n16 gdb -args mpirun mpi
>>> (gdb) run
>>> Starting program: /mnt/f1/michael/openmpi/bin/mpirun 
>>> /mnt/f1/michael/home/ServerAdmin/mpi
>>> [Thread debugging using libthread_db enabled]
>>> 
>>> Program received signal SIGSEGV, Segmentation fault.
>>> 0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
>>> data=0x681170) at base/plm_base_launch_support.c:342
>>> 342     pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
>>> (gdb) bt
>>> #0  0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
>>> data=0x681170) at base/plm_base_launch_support.c:342
>>> #1  0x00007ffff78a7338 in event_process_active (base=0x615240) at 
>>> event.c:651
>>> #2  0x00007ffff78a797e in opal_event_base_loop (base=0x615240, flags=1) at 
>>> event.c:823
>>> #3  0x00007ffff78a756f in opal_event_loop (flags=1) at event.c:730
>>> #4  0x00007ffff789b916 in opal_progress () at runtime/opal_progress.c:189
>>> #5  0x00007ffff7b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at 
>>> base/plm_base_launch_support.c:459
>>> #6  0x00007ffff7b7bed7 in plm_slurm_launch_job (jdata=0x610560) at 
>>> plm_slurm_module.c:360
>>> #7  0x0000000000403f46 in orterun (argc=2, argv=0x7fffffffe7d8) at 
>>> orterun.c:754
>>> #8  0x0000000000402fb4 in main (argc=2, argv=0x7fffffffe7d8) at main.c:13
>>> (gdb) print pdatorted
>>> $1 = (orte_proc_t **) 0x67c610
>>> (gdb) print mev
>>> $2 = (orte_message_event_t *) 0x681550
>>> (gdb) print mev->sender.vpid
>>> $3 = 4294967295
>>> (gdb) print mev->sender
>>> $4 = {jobid = 1721696256, vpid = 4294967295}
>>> (gdb) print *mev
>>> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 
>>> 0x7ffff7dd4f40, obj_reference_count = 1, cls_init_file_name = 
>>> 0x7ffff7bb9a78 "base/plm_base_launch_support.c", 
>>> cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 
>>> 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 
>>> "rml_oob_component.c", line = 279}
>> 
>> The jobid and vpid look like the defined INVALID values, indicating that 
>> something is quite wrong. This would quite likely lead to the segfault.
>> 
>>> From this, it would indeed appear that you are getting some kind of library 
>>> confusion - the most likely cause of such an error is a daemon from a 
>>> different version trying to respond, and so the returned message isn't 
>>> correct.
>> 
>> Not sure why else it would be happening...you could try setting -mca 
>> plm_base_verbose 5 to get more debug output displayed on your screen, 
>> assuming you built OMPI with --enable-debug.
>> 
> 
> Found the problem.... It is a site configuration issue, which I'll need to 
> find a workaround for.
> 
> [bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Query of component [slurm] set 
> priority to 75
> [bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Selected component [slurm]
> [bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed
> [bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh
> [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523 nodename 
> hash 1936089714
> [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383
> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm
> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job [31383,1]
> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:setup_job for job [31383,1]
> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching on nodes ipc3
> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: final top-level argv:
>       srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=ipc3 orted -mca 
> ess slurm -mca orte_ess_jobid 2056716288 -mca orte_ess_vpid 1 -mca 
> orte_ess_num_procs 2 --hnp-uri 
> "2056716288.0;tcp://lanip:37493;tcp://globalip:37493;tcp://lanip2:37493" -mca 
> plm_base_verbose 20
> 
> I then inserted some printf's into the ess_slurm_module (rough and ready, I 
> know, but I was in a hurry).
> 
> Just after initialisation: (at around line 345)
> orte_ess_slurm: jobid 2056716288 vpid 1
> So it gets that...
> I narrowed it down to the get_slurm_nodename  function, as the method didn't 
> proceed past that point....
> 
> line 401:
>    tmp = strdup(orte_process_info.nodename);
>    printf( "Our node name == %s\n", tmp );
> line 409:
>    for (i=0; NULL !=  names[i]; i++) {
>      printf( "Checking %s\n", names[ i ]);
> 
> Result:
> Our node name == eng-ipc3.{FQDN}
> Checking ipc3
> 
> So it's down to the mismatch of the slurm name and the hostname.  slurm 
> really encourages you not to use the fully qualified hostname, and I'd prefer 
> not to have to reconfigure the whole system to use the shortname as 
> hostnames.  However, I note that 1.5.1 worked and backported some of the code 
> -- it uses getenv( "SLURM_NODE_ID" ) to get that node number, which doesn't 
> rely on an exact string match.  Patching this makes things kind of work, but 
> failures still occur during wire-up for more than one node. 
> 
> I think the solution will have to be to change the hostnames on the system to 
> match what is needed by slurm+openmpi.  (doing this temporarily makes 
> everything work with an unpatched 1.4.3 and the wireup completes 
> successfully).  Perhaps a note about system hostnames needs to be made 
> somewhere in the OpenMPI / SLURM documentation? 
> 
> Thank you Ralph & Sam for your help.  
> 
> Cheers,
> Michael
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

Reply via email to