Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
On Feb 8, 2011, at 8:21 PM, Ralph Castain wrote: I would personally suggest not reconfiguring your system simply to support a particular version of OMPI. The only difference between the 1.4 and 1.5 series wrt slurm is that we changed a few things to support a more recent version of slurm. It is relatively easy to backport that code to the 1.4 series, and it should be (mostly) backward compatible. OMPI is agnostic wrt resource managers. We try to support all platforms, with our effort reflective of the needs of our developers and their organizations, and our perception of the relative size of the user community for a particular platform. Slurm is a fairly small community, mostly centered in the three DOE weapons labs, so our support for that platform tends to focus on their usage. So, with that understanding... Sam: can you confirm that 1.5.1 works on your TLCC machines? Open MPI 1.5.1 works as expected on our TLCC machines. Open MPI 1.4.3 with your SLURM update also tested. I have created a ticket to upgrade the 1.4.4 release (due out any time now) with the 1.5.1 slurm support. Any interested parties can follow it here: Thanks Ralph! Sam https://svn.open-mpi.org/trac/ompi/ticket/2717 Ralph On Feb 8, 2011, at 6:23 PM, Michael Curtis wrote: On 09/02/2011, at 9:16 AM, Ralph Castain wrote: See below On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote: On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: Hi Michael, You may have tried to send some debug information to the list, but it appears to have been blocked. Compressed text output of the backtrace text is sufficient. Odd, I thought I sent it to you directly. In any case, here is the backtrace and some information from gdb: $ salloc -n16 gdb -args mpirun mpi (gdb) run Starting program: /mnt/f1/michael/openmpi/bin/mpirun /mnt/f1/ michael/home/ServerAdmin/mpi [Thread debugging using libthread_db enabled] Program received signal SIGSEGV, Segmentation fault. 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; (gdb) bt #0 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342 #1 0x778a7338 in event_process_active (base=0x615240) at event.c:651 #2 0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at event.c:823 #3 0x778a756f in opal_event_loop (flags=1) at event.c:730 #4 0x7789b916 in opal_progress () at runtime/ opal_progress.c:189 #5 0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at base/plm_base_launch_support.c:459 #6 0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at plm_slurm_module.c:360 #7 0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at orterun.c:754 #8 0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13 (gdb) print pdatorted $1 = (orte_proc_t **) 0x67c610 (gdb) print mev $2 = (orte_message_event_t *) 0x681550 (gdb) print mev->sender.vpid $3 = 4294967295 (gdb) print mev->sender $4 = {jobid = 1721696256, vpid = 4294967295} (gdb) print *mev $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 "base/plm_base_launch_support.c", cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 "rml_oob_component.c", line = 279} The jobid and vpid look like the defined INVALID values, indicating that something is quite wrong. This would quite likely lead to the segfault. From this, it would indeed appear that you are getting some kind of library confusion - the most likely cause of such an error is a daemon from a different version trying to respond, and so the returned message isn't correct. Not sure why else it would be happening...you could try setting - mca plm_base_verbose 5 to get more debug output displayed on your screen, assuming you built OMPI with --enable-debug. Found the problem It is a site configuration issue, which I'll need to find a workaround for. [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Query of component [slurm] set priority to 75 [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Selected component [slurm] [bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed [bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523 nodename hash 1936089714 [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383 [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job [31383,1] [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:setup_job for job [31383,1] [bio-ipc.{FQDN}:27523]
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
I would personally suggest not reconfiguring your system simply to support a particular version of OMPI. The only difference between the 1.4 and 1.5 series wrt slurm is that we changed a few things to support a more recent version of slurm. It is relatively easy to backport that code to the 1.4 series, and it should be (mostly) backward compatible. OMPI is agnostic wrt resource managers. We try to support all platforms, with our effort reflective of the needs of our developers and their organizations, and our perception of the relative size of the user community for a particular platform. Slurm is a fairly small community, mostly centered in the three DOE weapons labs, so our support for that platform tends to focus on their usage. So, with that understanding... Sam: can you confirm that 1.5.1 works on your TLCC machines? I have created a ticket to upgrade the 1.4.4 release (due out any time now) with the 1.5.1 slurm support. Any interested parties can follow it here: https://svn.open-mpi.org/trac/ompi/ticket/2717 Ralph On Feb 8, 2011, at 6:23 PM, Michael Curtis wrote: > > On 09/02/2011, at 9:16 AM, Ralph Castain wrote: > >> See below >> >> >> On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote: >> >>> >>> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: >>> Hi Michael, You may have tried to send some debug information to the list, but it appears to have been blocked. Compressed text output of the backtrace text is sufficient. >>> >>> >>> Odd, I thought I sent it to you directly. In any case, here is the >>> backtrace and some information from gdb: >>> >>> $ salloc -n16 gdb -args mpirun mpi >>> (gdb) run >>> Starting program: /mnt/f1/michael/openmpi/bin/mpirun >>> /mnt/f1/michael/home/ServerAdmin/mpi >>> [Thread debugging using libthread_db enabled] >>> >>> Program received signal SIGSEGV, Segmentation fault. >>> 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, >>> data=0x681170) at base/plm_base_launch_support.c:342 >>> 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; >>> (gdb) bt >>> #0 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, >>> data=0x681170) at base/plm_base_launch_support.c:342 >>> #1 0x778a7338 in event_process_active (base=0x615240) at >>> event.c:651 >>> #2 0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at >>> event.c:823 >>> #3 0x778a756f in opal_event_loop (flags=1) at event.c:730 >>> #4 0x7789b916 in opal_progress () at runtime/opal_progress.c:189 >>> #5 0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at >>> base/plm_base_launch_support.c:459 >>> #6 0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at >>> plm_slurm_module.c:360 >>> #7 0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at >>> orterun.c:754 >>> #8 0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13 >>> (gdb) print pdatorted >>> $1 = (orte_proc_t **) 0x67c610 >>> (gdb) print mev >>> $2 = (orte_message_event_t *) 0x681550 >>> (gdb) print mev->sender.vpid >>> $3 = 4294967295 >>> (gdb) print mev->sender >>> $4 = {jobid = 1721696256, vpid = 4294967295} >>> (gdb) print *mev >>> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = >>> 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = >>> 0x77bb9a78 "base/plm_base_launch_support.c", >>> cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = >>> 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 >>> "rml_oob_component.c", line = 279} >> >> The jobid and vpid look like the defined INVALID values, indicating that >> something is quite wrong. This would quite likely lead to the segfault. >> >>> From this, it would indeed appear that you are getting some kind of library >>> confusion - the most likely cause of such an error is a daemon from a >>> different version trying to respond, and so the returned message isn't >>> correct. >> >> Not sure why else it would be happening...you could try setting -mca >> plm_base_verbose 5 to get more debug output displayed on your screen, >> assuming you built OMPI with --enable-debug. >> > > Found the problem It is a site configuration issue, which I'll need to > find a workaround for. > > [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Query of component [slurm] set > priority to 75 > [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Selected component [slurm] > [bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed > [bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh > [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523 nodename > hash 1936089714 > [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383 > [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm > [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job [31383,1] > [bio-ipc.{FQDN}:27523] [[31383,0],0]
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
On 09/02/2011, at 9:16 AM, Ralph Castain wrote: > See below > > > On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote: > >> >> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: >> >>> Hi Michael, >>> >>> You may have tried to send some debug information to the list, but it >>> appears to have been blocked. Compressed text output of the backtrace text >>> is sufficient. >> >> >> Odd, I thought I sent it to you directly. In any case, here is the >> backtrace and some information from gdb: >> >> $ salloc -n16 gdb -args mpirun mpi >> (gdb) run >> Starting program: /mnt/f1/michael/openmpi/bin/mpirun >> /mnt/f1/michael/home/ServerAdmin/mpi >> [Thread debugging using libthread_db enabled] >> >> Program received signal SIGSEGV, Segmentation fault. >> 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, >> data=0x681170) at base/plm_base_launch_support.c:342 >> 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; >> (gdb) bt >> #0 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, >> data=0x681170) at base/plm_base_launch_support.c:342 >> #1 0x778a7338 in event_process_active (base=0x615240) at event.c:651 >> #2 0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at >> event.c:823 >> #3 0x778a756f in opal_event_loop (flags=1) at event.c:730 >> #4 0x7789b916 in opal_progress () at runtime/opal_progress.c:189 >> #5 0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at >> base/plm_base_launch_support.c:459 >> #6 0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at >> plm_slurm_module.c:360 >> #7 0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at >> orterun.c:754 >> #8 0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13 >> (gdb) print pdatorted >> $1 = (orte_proc_t **) 0x67c610 >> (gdb) print mev >> $2 = (orte_message_event_t *) 0x681550 >> (gdb) print mev->sender.vpid >> $3 = 4294967295 >> (gdb) print mev->sender >> $4 = {jobid = 1721696256, vpid = 4294967295} >> (gdb) print *mev >> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = >> 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 >> "base/plm_base_launch_support.c", >> cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = >> 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 >> "rml_oob_component.c", line = 279} > > The jobid and vpid look like the defined INVALID values, indicating that > something is quite wrong. This would quite likely lead to the segfault. > >> From this, it would indeed appear that you are getting some kind of library >> confusion - the most likely cause of such an error is a daemon from a >> different version trying to respond, and so the returned message isn't >> correct. > > Not sure why else it would be happening...you could try setting -mca > plm_base_verbose 5 to get more debug output displayed on your screen, > assuming you built OMPI with --enable-debug. > Found the problem It is a site configuration issue, which I'll need to find a workaround for. [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Query of component [slurm] set priority to 75 [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Selected component [slurm] [bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed [bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523 nodename hash 1936089714 [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383 [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job [31383,1] [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:setup_job for job [31383,1] [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching on nodes ipc3 [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: final top-level argv: srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=ipc3 orted -mca ess slurm -mca orte_ess_jobid 2056716288 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "2056716288.0;tcp://lanip:37493;tcp://globalip:37493;tcp://lanip2:37493" -mca plm_base_verbose 20 I then inserted some printf's into the ess_slurm_module (rough and ready, I know, but I was in a hurry). Just after initialisation: (at around line 345) orte_ess_slurm: jobid 2056716288 vpid 1 So it gets that... I narrowed it down to the get_slurm_nodename function, as the method didn't proceed past that point line 401: tmp = strdup(orte_process_info.nodename); printf( "Our node name == %s\n", tmp ); line 409: for (i=0; NULL != names[i]; i++) { printf( "Checking %s\n", names[ i ]); Result: Our node name == eng-ipc3.{FQDN} Checking ipc3 So it's down to the mismatch of the slurm name and the hostname. slurm really encourages you not to use the fully qualified hostname, and I'd prefer not to have to reconfigure the
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
See below On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote: > > On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: > >> Hi Michael, >> >> You may have tried to send some debug information to the list, but it >> appears to have been blocked. Compressed text output of the backtrace text >> is sufficient. > > > Odd, I thought I sent it to you directly. In any case, here is the backtrace > and some information from gdb: > > $ salloc -n16 gdb -args mpirun mpi > (gdb) run > Starting program: /mnt/f1/michael/openmpi/bin/mpirun > /mnt/f1/michael/home/ServerAdmin/mpi > [Thread debugging using libthread_db enabled] > > Program received signal SIGSEGV, Segmentation fault. > 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, > data=0x681170) at base/plm_base_launch_support.c:342 > 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; > (gdb) bt > #0 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, > data=0x681170) at base/plm_base_launch_support.c:342 > #1 0x778a7338 in event_process_active (base=0x615240) at event.c:651 > #2 0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at > event.c:823 > #3 0x778a756f in opal_event_loop (flags=1) at event.c:730 > #4 0x7789b916 in opal_progress () at runtime/opal_progress.c:189 > #5 0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at > base/plm_base_launch_support.c:459 > #6 0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at > plm_slurm_module.c:360 > #7 0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at > orterun.c:754 > #8 0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13 > (gdb) print pdatorted > $1 = (orte_proc_t **) 0x67c610 > (gdb) print mev > $2 = (orte_message_event_t *) 0x681550 > (gdb) print mev->sender.vpid > $3 = 4294967295 > (gdb) print mev->sender > $4 = {jobid = 1721696256, vpid = 4294967295} > (gdb) print *mev > $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = > 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 > "base/plm_base_launch_support.c", > cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = > 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 > "rml_oob_component.c", line = 279} The jobid and vpid look like the defined INVALID values, indicating that something is quite wrong. This would quite likely lead to the segfault. >From this, it would indeed appear that you are getting some kind of library >confusion - the most likely cause of such an error is a daemon from a >different version trying to respond, and so the returned message isn't correct. Not sure why else it would be happening...you could try setting -mca plm_base_verbose 5 to get more debug output displayed on your screen, assuming you built OMPI with --enable-debug. > > That vpid looks suspiciously like -1. > > Further debugging: > Breakpoint 3, orted_report_launch (status=32767, sender=0x7fffe170, > buffer=0x77b1a85f, tag=32767, cbdata=0x612d20) at > base/plm_base_launch_support.c:411 > 411 { > (gdb) print sender > $2 = (orte_process_name_t *) 0x7fffe170 > (gdb) print *sender > $3 = {jobid = 6822016, vpid = 0} > (gdb) continue > Continuing. > -- > A daemon (pid unknown) died unexpectedly with status 1 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -- > > Program received signal SIGSEGV, Segmentation fault. > 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, > data=0x681550) at base/plm_base_launch_support.c:342 > 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; > (gdb) print mev->sender > $4 = {jobid = 1778450432, vpid = 4294967295} > > The daemon probably died as I spent too long thinking about my gdb input ;) I'm not sure why that would happen - there are no timers in the system, so it won't care how long it takes to initialize. I'm guessing this is another indicator of a library issue. > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: > Hi Michael, > > You may have tried to send some debug information to the list, but it appears > to have been blocked. Compressed text output of the backtrace text is > sufficient. Odd, I thought I sent it to you directly. In any case, here is the backtrace and some information from gdb: $ salloc -n16 gdb -args mpirun mpi (gdb) run Starting program: /mnt/f1/michael/openmpi/bin/mpirun /mnt/f1/michael/home/ServerAdmin/mpi [Thread debugging using libthread_db enabled] Program received signal SIGSEGV, Segmentation fault. 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; (gdb) bt #0 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342 #1 0x778a7338 in event_process_active (base=0x615240) at event.c:651 #2 0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at event.c:823 #3 0x778a756f in opal_event_loop (flags=1) at event.c:730 #4 0x7789b916 in opal_progress () at runtime/opal_progress.c:189 #5 0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at base/plm_base_launch_support.c:459 #6 0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at plm_slurm_module.c:360 #7 0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at orterun.c:754 #8 0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13 (gdb) print pdatorted $1 = (orte_proc_t **) 0x67c610 (gdb) print mev $2 = (orte_message_event_t *) 0x681550 (gdb) print mev->sender.vpid $3 = 4294967295 (gdb) print mev->sender $4 = {jobid = 1721696256, vpid = 4294967295} (gdb) print *mev $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 "base/plm_base_launch_support.c", cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 "rml_oob_component.c", line = 279} That vpid looks suspiciously like -1. Further debugging: Breakpoint 3, orted_report_launch (status=32767, sender=0x7fffe170, buffer=0x77b1a85f, tag=32767, cbdata=0x612d20) at base/plm_base_launch_support.c:411 411 { (gdb) print sender $2 = (orte_process_name_t *) 0x7fffe170 (gdb) print *sender $3 = {jobid = 6822016, vpid = 0} (gdb) continue Continuing. -- A daemon (pid unknown) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- Program received signal SIGSEGV, Segmentation fault. 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681550) at base/plm_base_launch_support.c:342 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; (gdb) print mev->sender $4 = {jobid = 1778450432, vpid = 4294967295} The daemon probably died as I spent too long thinking about my gdb input ;)
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
On 09/02/2011, at 2:38 AM, Ralph Castain wrote: > Another possibility to check - are you sure you are getting the same OMPI > version on the backend nodes? When I see it work on local node, but fail > multi-node, the most common problem is that you are picking up a different > OMPI version due to path differences on the backend nodes. It's installed as a system package, and the software set on all machines is managed by a configuration tool, so the machines should be identical. However, it may be worth checking the dependency versions and I'll double check that the OMPI versions really do match.
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
Another possibility to check - are you sure you are getting the same OMPI version on the backend nodes? When I see it work on local node, but fail multi-node, the most common problem is that you are picking up a different OMPI version due to path differences on the backend nodes. On Feb 8, 2011, at 8:17 AM, Samuel K. Gutierrez wrote: > Hi Michael, > > You may have tried to send some debug information to the list, but it appears > to have been blocked. Compressed text output of the backtrace text is > sufficient. > > Thanks, > > -- > Samuel K. Gutierrez > Los Alamos National Laboratory > > On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote: > >> Hi, >> >> A detailed backtrace from a core dump may help us debug this. Would you be >> willing to provide that information for us? >> >> Thanks, >> >> -- >> Samuel K. Gutierrez >> Los Alamos National Laboratory >> >> On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote: >> >>> >>> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: >>> >>> Hi, >>> I just tried to reproduce the problem that you are experiencing and was unable to. SLURM 2.1.15 Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas >>> >>> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same >>> platform file (the only change was to re-enable btl-tcp). >>> >>> Unfortunately, the result is the same: >>> salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi >>> salloc: Granted job allocation 145 >>> >>> JOB MAP >>> >>> Data for node: Name: eng-ipc4.{FQDN}Num procs: 8 >>> Process OMPI jobid: [6932,1] Process rank: 0 >>> Process OMPI jobid: [6932,1] Process rank: 1 >>> Process OMPI jobid: [6932,1] Process rank: 2 >>> Process OMPI jobid: [6932,1] Process rank: 3 >>> Process OMPI jobid: [6932,1] Process rank: 4 >>> Process OMPI jobid: [6932,1] Process rank: 5 >>> Process OMPI jobid: [6932,1] Process rank: 6 >>> Process OMPI jobid: [6932,1] Process rank: 7 >>> >>> Data for node: Name: ipc3 Num procs: 8 >>> Process OMPI jobid: [6932,1] Process rank: 8 >>> Process OMPI jobid: [6932,1] Process rank: 9 >>> Process OMPI jobid: [6932,1] Process rank: 10 >>> Process OMPI jobid: [6932,1] Process rank: 11 >>> Process OMPI jobid: [6932,1] Process rank: 12 >>> Process OMPI jobid: [6932,1] Process rank: 13 >>> Process OMPI jobid: [6932,1] Process rank: 14 >>> Process OMPI jobid: [6932,1] Process rank: 15 >>> >>> = >>> [eng-ipc4:31754] *** Process received signal *** >>> [eng-ipc4:31754] Signal: Segmentation fault (11) >>> [eng-ipc4:31754] Signal code: Address not mapped (1) >>> [eng-ipc4:31754] Failing at address: 0x8012eb748 >>> [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0] >>> [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) >>> [0x7f81cf262869] >>> [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) >>> [0x7f81cef93338] >>> [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) >>> [0x7f81cef9397e] >>> [eng-ipc4:31754] [ 4] >>> ~/../openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f) [0x7f81cef9356f] >>> [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress+0x89) >>> [0x7f81cef87916] >>> [eng-ipc4:31754] [ 6] >>> ~/../openmpi/lib/libopen-rte.so.0(orte_plm_base_daemon_callback+0x13f) >>> [0x7f81cf262e20] >>> [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) >>> [0x7f81cf267ed7] >>> [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46] >>> [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4] >>> [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) >>> [0x7f81ce14bc4d] >>> [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9] >>> [eng-ipc4:31754] *** End of error message *** >>> salloc: Relinquishing job allocation 145 >>> salloc: Job allocation 145 has been revoked. >>> zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map >>> ~/ServerAdmin/mpi >>> >>> I've anonymised the paths and domain, otherwise pasted verbatim. The only >>> odd thing I notice is that the launching machine uses its full domain name, >>> whereas the other machine is referred to by the short name. Despite the >>> FQDN, the domain does not exist in the DNS (for historical reasons), but >>> does exist in the /etc/hosts file. >>> >>> Any further clues would be appreciated. In case it may be relevant, core >>> system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32. One other point >>> of difference may be that our environment is tcp (ethernet) based whereas >>> the LANL test environment is not? >>> >>> Michael >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>>
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
Hi Michael, You may have tried to send some debug information to the list, but it appears to have been blocked. Compressed text output of the backtrace text is sufficient. Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote: Hi, A detailed backtrace from a core dump may help us debug this. Would you be willing to provide that information for us? Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote: On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: Hi, I just tried to reproduce the problem that you are experiencing and was unable to. SLURM 2.1.15 Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ lanl/tlcc/debug-nopanasas I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same platform file (the only change was to re-enable btl-tcp). Unfortunately, the result is the same: salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi salloc: Granted job allocation 145 JOB MAP Data for node: Name: eng-ipc4.{FQDN}Num procs: 8 Process OMPI jobid: [6932,1] Process rank: 0 Process OMPI jobid: [6932,1] Process rank: 1 Process OMPI jobid: [6932,1] Process rank: 2 Process OMPI jobid: [6932,1] Process rank: 3 Process OMPI jobid: [6932,1] Process rank: 4 Process OMPI jobid: [6932,1] Process rank: 5 Process OMPI jobid: [6932,1] Process rank: 6 Process OMPI jobid: [6932,1] Process rank: 7 Data for node: Name: ipc3 Num procs: 8 Process OMPI jobid: [6932,1] Process rank: 8 Process OMPI jobid: [6932,1] Process rank: 9 Process OMPI jobid: [6932,1] Process rank: 10 Process OMPI jobid: [6932,1] Process rank: 11 Process OMPI jobid: [6932,1] Process rank: 12 Process OMPI jobid: [6932,1] Process rank: 13 Process OMPI jobid: [6932,1] Process rank: 14 Process OMPI jobid: [6932,1] Process rank: 15 = [eng-ipc4:31754] *** Process received signal *** [eng-ipc4:31754] Signal: Segmentation fault (11) [eng-ipc4:31754] Signal code: Address not mapped (1) [eng-ipc4:31754] Failing at address: 0x8012eb748 [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0] [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) [0x7f81cf262869] [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) [0x7f81cef93338] [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) [0x7f81cef9397e] [eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so. 0(opal_event_loop+0x1f) [0x7f81cef9356f] [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so. 0(opal_progress+0x89) [0x7f81cef87916] [eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so. 0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20] [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) [0x7f81cf267ed7] [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46] [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4] [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f81ce14bc4d] [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9] [eng-ipc4:31754] *** End of error message *** salloc: Relinquishing job allocation 145 salloc: Job allocation 145 has been revoked. zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ ServerAdmin/mpi I've anonymised the paths and domain, otherwise pasted verbatim. The only odd thing I notice is that the launching machine uses its full domain name, whereas the other machine is referred to by the short name. Despite the FQDN, the domain does not exist in the DNS (for historical reasons), but does exist in the /etc/hosts file. Any further clues would be appreciated. In case it may be relevant, core system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32. One other point of difference may be that our environment is tcp (ethernet) based whereas the LANL test environment is not? Michael ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
Hi, A detailed backtrace from a core dump may help us debug this. Would you be willing to provide that information for us? Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote: On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: Hi, I just tried to reproduce the problem that you are experiencing and was unable to. SLURM 2.1.15 Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ lanl/tlcc/debug-nopanasas I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same platform file (the only change was to re-enable btl-tcp). Unfortunately, the result is the same: salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi salloc: Granted job allocation 145 JOB MAP Data for node: Name: eng-ipc4.{FQDN}Num procs: 8 Process OMPI jobid: [6932,1] Process rank: 0 Process OMPI jobid: [6932,1] Process rank: 1 Process OMPI jobid: [6932,1] Process rank: 2 Process OMPI jobid: [6932,1] Process rank: 3 Process OMPI jobid: [6932,1] Process rank: 4 Process OMPI jobid: [6932,1] Process rank: 5 Process OMPI jobid: [6932,1] Process rank: 6 Process OMPI jobid: [6932,1] Process rank: 7 Data for node: Name: ipc3 Num procs: 8 Process OMPI jobid: [6932,1] Process rank: 8 Process OMPI jobid: [6932,1] Process rank: 9 Process OMPI jobid: [6932,1] Process rank: 10 Process OMPI jobid: [6932,1] Process rank: 11 Process OMPI jobid: [6932,1] Process rank: 12 Process OMPI jobid: [6932,1] Process rank: 13 Process OMPI jobid: [6932,1] Process rank: 14 Process OMPI jobid: [6932,1] Process rank: 15 = [eng-ipc4:31754] *** Process received signal *** [eng-ipc4:31754] Signal: Segmentation fault (11) [eng-ipc4:31754] Signal code: Address not mapped (1) [eng-ipc4:31754] Failing at address: 0x8012eb748 [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0] [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) [0x7f81cf262869] [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) [0x7f81cef93338] [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) [0x7f81cef9397e] [eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so. 0(opal_event_loop+0x1f) [0x7f81cef9356f] [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress +0x89) [0x7f81cef87916] [eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so. 0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20] [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) [0x7f81cf267ed7] [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46] [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4] [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f81ce14bc4d] [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9] [eng-ipc4:31754] *** End of error message *** salloc: Relinquishing job allocation 145 salloc: Job allocation 145 has been revoked. zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ ServerAdmin/mpi I've anonymised the paths and domain, otherwise pasted verbatim. The only odd thing I notice is that the launching machine uses its full domain name, whereas the other machine is referred to by the short name. Despite the FQDN, the domain does not exist in the DNS (for historical reasons), but does exist in the /etc/hosts file. Any further clues would be appreciated. In case it may be relevant, core system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32. One other point of difference may be that our environment is tcp (ethernet) based whereas the LANL test environment is not? Michael ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
The 1.4 series is regularly tested on slurm machines after every modification, and has been running at LANL (and other slurm installations) for quite some time, so I doubt that's the core issue. Likewise, nothing in the system depends upon the FQDN (or anything regarding hostname) - it's just used to print diagnostics. Not sure of the issue, and I don't have an ability to test/debug slurm any more, so I'll have to let Sam continue to look into this for you. It's probably some trivial difference in setup, unfortunately. I don't know if you said before, but it might help to know what slurm version you are using. Slurm tends to change a lot between versions (even minor releases), and it is one of the more finicky platforms we support. On Feb 6, 2011, at 9:12 PM, Michael Curtis wrote: > > On 07/02/2011, at 12:36 PM, Michael Curtis wrote: > >> >> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: >> >> Hi, >> >>> I just tried to reproduce the problem that you are experiencing and was >>> unable to. >>> >>> SLURM 2.1.15 >>> Open MPI 1.4.3 configured with: >>> --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas >> >> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same >> platform file (the only change was to re-enable btl-tcp). >> >> Unfortunately, the result is the same: > > To reply to my own post again (sorry!), I tried OpenMPI 1.5.1. This works > fine: > salloc -n16 ~/../openmpi/bin/mpirun --display-map mpi > salloc: Granted job allocation 151 > > JOB MAP > > Data for node: ipc3 Num procs: 8 > Process OMPI jobid: [3365,1] Process rank: 0 > Process OMPI jobid: [3365,1] Process rank: 1 > Process OMPI jobid: [3365,1] Process rank: 2 > Process OMPI jobid: [3365,1] Process rank: 3 > Process OMPI jobid: [3365,1] Process rank: 4 > Process OMPI jobid: [3365,1] Process rank: 5 > Process OMPI jobid: [3365,1] Process rank: 6 > Process OMPI jobid: [3365,1] Process rank: 7 > > Data for node: ipc4 Num procs: 8 > Process OMPI jobid: [3365,1] Process rank: 8 > Process OMPI jobid: [3365,1] Process rank: 9 > Process OMPI jobid: [3365,1] Process rank: 10 > Process OMPI jobid: [3365,1] Process rank: 11 > Process OMPI jobid: [3365,1] Process rank: 12 > Process OMPI jobid: [3365,1] Process rank: 13 > Process OMPI jobid: [3365,1] Process rank: 14 > Process OMPI jobid: [3365,1] Process rank: 15 > > = > Process 2 on eng-ipc3.{FQDN} out of 16 > Process 4 on eng-ipc3.{FQDN} out of 16 > Process 5 on eng-ipc3.{FQDN} out of 16 > Process 0 on eng-ipc3.{FQDN} out of 16 > Process 1 on eng-ipc3.{FQDN} out of 16 > Process 6 on eng-ipc3.{FQDN} out of 16 > Process 3 on eng-ipc3.{FQDN} out of 16 > Process 7 on eng-ipc3.{FQDN} out of 16 > Process 8 on eng-ipc4.{FQDN} out of 16 > Process 11 on eng-ipc4.{FQDN} out of 16 > Process 12 on eng-ipc4.{FQDN} out of 16 > Process 14 on eng-ipc4.{FQDN} out of 16 > Process 15 on eng-ipc4.{FQDN} out of 16 > Process 10 on eng-ipc4.{FQDN} out of 16 > Process 9 on eng-ipc4.{FQDN} out of 16 > Process 13 on eng-ipc4.{FQDN} out of 16 > salloc: Relinquishing job allocation 151 > > It does seem very much like there is a bug of some sort in 1.4.3? > > Michael > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
On 07/02/2011, at 12:36 PM, Michael Curtis wrote: > > On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: > > Hi, > >> I just tried to reproduce the problem that you are experiencing and was >> unable to. >> >> SLURM 2.1.15 >> Open MPI 1.4.3 configured with: >> --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas > > I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same platform > file (the only change was to re-enable btl-tcp). > > Unfortunately, the result is the same: To reply to my own post again (sorry!), I tried OpenMPI 1.5.1. This works fine: salloc -n16 ~/../openmpi/bin/mpirun --display-map mpi salloc: Granted job allocation 151 JOB MAP Data for node: ipc3Num procs: 8 Process OMPI jobid: [3365,1] Process rank: 0 Process OMPI jobid: [3365,1] Process rank: 1 Process OMPI jobid: [3365,1] Process rank: 2 Process OMPI jobid: [3365,1] Process rank: 3 Process OMPI jobid: [3365,1] Process rank: 4 Process OMPI jobid: [3365,1] Process rank: 5 Process OMPI jobid: [3365,1] Process rank: 6 Process OMPI jobid: [3365,1] Process rank: 7 Data for node: ipc4Num procs: 8 Process OMPI jobid: [3365,1] Process rank: 8 Process OMPI jobid: [3365,1] Process rank: 9 Process OMPI jobid: [3365,1] Process rank: 10 Process OMPI jobid: [3365,1] Process rank: 11 Process OMPI jobid: [3365,1] Process rank: 12 Process OMPI jobid: [3365,1] Process rank: 13 Process OMPI jobid: [3365,1] Process rank: 14 Process OMPI jobid: [3365,1] Process rank: 15 = Process 2 on eng-ipc3.{FQDN} out of 16 Process 4 on eng-ipc3.{FQDN} out of 16 Process 5 on eng-ipc3.{FQDN} out of 16 Process 0 on eng-ipc3.{FQDN} out of 16 Process 1 on eng-ipc3.{FQDN} out of 16 Process 6 on eng-ipc3.{FQDN} out of 16 Process 3 on eng-ipc3.{FQDN} out of 16 Process 7 on eng-ipc3.{FQDN} out of 16 Process 8 on eng-ipc4.{FQDN} out of 16 Process 11 on eng-ipc4.{FQDN} out of 16 Process 12 on eng-ipc4.{FQDN} out of 16 Process 14 on eng-ipc4.{FQDN} out of 16 Process 15 on eng-ipc4.{FQDN} out of 16 Process 10 on eng-ipc4.{FQDN} out of 16 Process 9 on eng-ipc4.{FQDN} out of 16 Process 13 on eng-ipc4.{FQDN} out of 16 salloc: Relinquishing job allocation 151 It does seem very much like there is a bug of some sort in 1.4.3? Michael
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: Hi, > I just tried to reproduce the problem that you are experiencing and was > unable to. > > SLURM 2.1.15 > Open MPI 1.4.3 configured with: > --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same platform file (the only change was to re-enable btl-tcp). Unfortunately, the result is the same: salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi salloc: Granted job allocation 145 JOB MAP Data for node: Name: eng-ipc4.{FQDN} Num procs: 8 Process OMPI jobid: [6932,1] Process rank: 0 Process OMPI jobid: [6932,1] Process rank: 1 Process OMPI jobid: [6932,1] Process rank: 2 Process OMPI jobid: [6932,1] Process rank: 3 Process OMPI jobid: [6932,1] Process rank: 4 Process OMPI jobid: [6932,1] Process rank: 5 Process OMPI jobid: [6932,1] Process rank: 6 Process OMPI jobid: [6932,1] Process rank: 7 Data for node: Name: ipc3 Num procs: 8 Process OMPI jobid: [6932,1] Process rank: 8 Process OMPI jobid: [6932,1] Process rank: 9 Process OMPI jobid: [6932,1] Process rank: 10 Process OMPI jobid: [6932,1] Process rank: 11 Process OMPI jobid: [6932,1] Process rank: 12 Process OMPI jobid: [6932,1] Process rank: 13 Process OMPI jobid: [6932,1] Process rank: 14 Process OMPI jobid: [6932,1] Process rank: 15 = [eng-ipc4:31754] *** Process received signal *** [eng-ipc4:31754] Signal: Segmentation fault (11) [eng-ipc4:31754] Signal code: Address not mapped (1) [eng-ipc4:31754] Failing at address: 0x8012eb748 [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0] [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) [0x7f81cf262869] [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) [0x7f81cef93338] [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) [0x7f81cef9397e] [eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f) [0x7f81cef9356f] [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress+0x89) [0x7f81cef87916] [eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so.0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20] [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) [0x7f81cf267ed7] [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46] [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4] [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f81ce14bc4d] [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9] [eng-ipc4:31754] *** End of error message *** salloc: Relinquishing job allocation 145 salloc: Job allocation 145 has been revoked. zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi I've anonymised the paths and domain, otherwise pasted verbatim. The only odd thing I notice is that the launching machine uses its full domain name, whereas the other machine is referred to by the short name. Despite the FQDN, the domain does not exist in the DNS (for historical reasons), but does exist in the /etc/hosts file. Any further clues would be appreciated. In case it may be relevant, core system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32. One other point of difference may be that our environment is tcp (ethernet) based whereas the LANL test environment is not? Michael
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: > I just tried to reproduce the problem that you are experiencing and was > unable to. > > > SLURM 2.1.15 > Open MPI 1.4.3 configured with: > --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas > > I'll dig a bit further. Interesting. I'll try a local, vanilla (ie, non-debian) build and report back. Michael
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
Hi, I just tried to reproduce the problem that you are experiencing and was unable to. [samuel@lo1-fe ~]$ salloc -n32 mpirun --display-map ./mpi_app salloc: Job is in held state, pending scheduler release salloc: Pending job allocation 138319 salloc: job 138319 queued and waiting for resources salloc: job 138319 has been allocated resources salloc: Granted job allocation 138319 JOB MAP Data for node: Name: lob083Num procs: 16 Process OMPI jobid: [26464,1] Process rank: 0 Process OMPI jobid: [26464,1] Process rank: 1 Process OMPI jobid: [26464,1] Process rank: 2 Process OMPI jobid: [26464,1] Process rank: 3 Process OMPI jobid: [26464,1] Process rank: 4 Process OMPI jobid: [26464,1] Process rank: 5 Process OMPI jobid: [26464,1] Process rank: 6 Process OMPI jobid: [26464,1] Process rank: 7 Process OMPI jobid: [26464,1] Process rank: 8 Process OMPI jobid: [26464,1] Process rank: 9 Process OMPI jobid: [26464,1] Process rank: 10 Process OMPI jobid: [26464,1] Process rank: 11 Process OMPI jobid: [26464,1] Process rank: 12 Process OMPI jobid: [26464,1] Process rank: 13 Process OMPI jobid: [26464,1] Process rank: 14 Process OMPI jobid: [26464,1] Process rank: 15 Data for node: Name: lob084Num procs: 16 Process OMPI jobid: [26464,1] Process rank: 16 Process OMPI jobid: [26464,1] Process rank: 17 Process OMPI jobid: [26464,1] Process rank: 18 Process OMPI jobid: [26464,1] Process rank: 19 Process OMPI jobid: [26464,1] Process rank: 20 Process OMPI jobid: [26464,1] Process rank: 21 Process OMPI jobid: [26464,1] Process rank: 22 Process OMPI jobid: [26464,1] Process rank: 23 Process OMPI jobid: [26464,1] Process rank: 24 Process OMPI jobid: [26464,1] Process rank: 25 Process OMPI jobid: [26464,1] Process rank: 26 Process OMPI jobid: [26464,1] Process rank: 27 Process OMPI jobid: [26464,1] Process rank: 28 Process OMPI jobid: [26464,1] Process rank: 29 Process OMPI jobid: [26464,1] Process rank: 30 Process OMPI jobid: [26464,1] Process rank: 31 SLURM 2.1.15 Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ lanl/tlcc/debug-nopanasas I'll dig a bit further. Sam On Feb 2, 2011, at 9:53 AM, Samuel K. Gutierrez wrote: Hi, We'll try to reproduce the problem. Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 2, 2011, at 2:55 AM, Michael Curtis wrote: On 28/01/2011, at 8:16 PM, Michael Curtis wrote: On 27/01/2011, at 4:51 PM, Michael Curtis wrote: Some more debugging information: Is anyone able to help with this problem? As far as I can tell it's a stock-standard recently installed SLURM installation. I can try 1.5.1 but hesitant to deploy this as it would require a recompile of some rather large pieces of software. Should I re- post to the -devel lists? Regards, ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
Hi, We'll try to reproduce the problem. Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 2, 2011, at 2:55 AM, Michael Curtis wrote: On 28/01/2011, at 8:16 PM, Michael Curtis wrote: On 27/01/2011, at 4:51 PM, Michael Curtis wrote: Some more debugging information: Is anyone able to help with this problem? As far as I can tell it's a stock-standard recently installed SLURM installation. I can try 1.5.1 but hesitant to deploy this as it would require a recompile of some rather large pieces of software. Should I re-post to the -devel lists? Regards, ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
On 28/01/2011, at 8:16 PM, Michael Curtis wrote: > > On 27/01/2011, at 4:51 PM, Michael Curtis wrote: > > Some more debugging information: Is anyone able to help with this problem? As far as I can tell it's a stock-standard recently installed SLURM installation. I can try 1.5.1 but hesitant to deploy this as it would require a recompile of some rather large pieces of software. Should I re-post to the -devel lists? Regards,
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
On 27/01/2011, at 4:51 PM, Michael Curtis wrote: Some more debugging information: > Failing case: > michael@ipc ~ $ salloc -n8 mpirun --display-map ./mpi > JOB MAP Backtrace with debugging symbols #0 0x77bb5c1e in ?? () from /usr/lib/libopen-rte.so.0 #1 0x7792e23f in ?? () from /usr/lib/libopen-pal.so.0 #2 0x77920679 in opal_progress () from /usr/lib/libopen-pal.so.0 #3 0x77bb6e5d in orte_plm_base_daemon_callback () from /usr/lib/libopen-rte.so.0 #4 0x762b67e7 in plm_slurm_launch_job (jdata=) at ../../../../../../orte/mca/plm/slurm/plm_slurm_module.c:360 #5 0x004041c8 in orterun (argc=4, argv=0x7fffe7d8) at ../../../../../orte/tools/orterun/orterun.c:754 #6 0x00403234 in main (argc=4, argv=0x7fffe7d8) at ../../../../../orte/tools/orterun/main.c:13 Trace output with -d100 and --enable-trace: [:10821] progressed_wait: ../../../../../orte/mca/plm/base/plm_base_launch_support.c 459 [:10821] defining message event: ../../../../../orte/mca/plm/base/plm_base_launch_support.c 423 I'm guessing from this that it's crashing in the event loop, maybe at : static void process_orted_launch_report(int fd, short event, void *data) strace: poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}], 6, 1000) = 1 ([{fd=13, revents=POLLIN}]) readv(13, [{"R\333\0\0\377\377\377\377R\333\0\0\377\377\377\377R\333\0\0\0\0\0\0\0\0\0\4\0\0\0\232"..., 36}], 1) = 36 readv(13, [{"R\333\0\0\377\377\377\377R\333\0\0\0\0\0\0\0\0\0\n\0\0\0\1\0\0\0u1390"..., 154}], 1) = 154 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}], 6, 0) = 0 (Timeout) --- SIGSEGV (Segmentation fault) @ 0 (0) --- OK, I matched the disassemblies and confirmed that the crash originates in process_orted_launch_report, and therefore matched up the source code line with where gdb reckons the program counter was at that point: /* update state */ pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; Hopefully all this information helps a little!
[OMPI users] Segmentation fault with SLURM and non-local nodes
Hi, I'm not sure whether this problem is with SLURM or OpenMPI, but the stack traces (below) point to an issue within OpenMPI. Whenever I try to launch an MPI job within SLURM, mpirun immediately segmentation faults -- but only if the machine that SLURM allocated to MPI is different to the one that I launched the MPI job. However, if I force SLURM to allocate only the local node (ie, the one on which salloc was called), everything works fine. Failing case: michael@ipc ~ $ salloc -n8 mpirun --display-map ./mpi JOB MAP Data for node: Name: ipc4 Num procs: 8 Process OMPI jobid: [21326,1] Process rank: 0 Process OMPI jobid: [21326,1] Process rank: 1 Process OMPI jobid: [21326,1] Process rank: 2 Process OMPI jobid: [21326,1] Process rank: 3 Process OMPI jobid: [21326,1] Process rank: 4 Process OMPI jobid: [21326,1] Process rank: 5 Process OMPI jobid: [21326,1] Process rank: 6 Process OMPI jobid: [21326,1] Process rank: 7 = [ipc:16986] *** Process received signal *** [ipc:16986] Signal: Segmentation fault (11) [ipc:16986] Signal code: Address not mapped (1) [ipc:16986] Failing at address: 0x801328268 [ipc:16986] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7ff85c7638f0] [ipc:16986] [ 1] /usr/lib/libopen-rte.so.0(+0x3459a) [0x7ff85d4a059a] [ipc:16986] [ 2] /usr/lib/libopen-pal.so.0(+0x1eeb8) [0x7ff85d233eb8] [ipc:16986] [ 3] /usr/lib/libopen-pal.so.0(opal_progress+0x99) [0x7ff85d228439] [ipc:16986] [ 4] /usr/lib/libopen-rte.so.0(orte_plm_base_daemon_callback+0x9d) [0x7ff85d4a002d] [ipc:16986] [ 5] /usr/lib/openmpi/lib/openmpi/mca_plm_slurm.so(+0x211a) [0x7ff85bbc311a] [ipc:16986] [ 6] mpirun() [0x403c1f] [ipc:16986] [ 7] mpirun() [0x403014] [ipc:16986] [ 8] /lib/libc.so.6(__libc_start_main+0xfd) [0x7ff85c3efc4d] [ipc:16986] [ 9] mpirun() [0x402f39] [ipc:16986] *** End of error message *** Non-failing case: michael@eng-ipc4 ~ $ salloc -n8 -w ipc4 mpirun --display-map ./mpi JOB MAP Data for node: Name: eng-ipc4.FQDN Num procs: 8 Process OMPI jobid: [12467,1] Process rank: 0 Process OMPI jobid: [12467,1] Process rank: 1 Process OMPI jobid: [12467,1] Process rank: 2 Process OMPI jobid: [12467,1] Process rank: 3 Process OMPI jobid: [12467,1] Process rank: 4 Process OMPI jobid: [12467,1] Process rank: 5 Process OMPI jobid: [12467,1] Process rank: 6 Process OMPI jobid: [12467,1] Process rank: 7 = Process 1 on eng-ipc4.FQDN out of 8 Process 3 on eng-ipc4.FQDN out of 8 Process 4 on eng-ipc4.FQDN out of 8 Process 6 on eng-ipc4.FQDN out of 8 Process 7 on eng-ipc4.FQDN out of 8 Process 0 on eng-ipc4.FQDN out of 8 Process 2 on eng-ipc4.FQDN out of 8 Process 5 on eng-ipc4.FQDN out of 8 Using mpi directly is fine: eg mpirun -H 'ipc3,ipc4' -np 8 ./mpi Works as expected This is a (small) homogenous cluster, all Xeon class machines with plenty of RAM and shared filesystem over NFS, running 64-bit Ubuntu server. I was running stock OpenMPI (1.4.1) and SLURM (2.1.1), I have since upgraded to latest stable OpenMPI (1.4.3) and SLURM (2.2.0), with no effect. (the newer binaries were compiled from the respective upstream Debian packages). strace (not shown) shows that the job is launched via srun and a connection is received back from the child process over TCP/IP. Soon after this, mpirun crashes. Nodes communicate over a semi-dedicated TCP/IP GigE connection. Is this a known bug? What is going wrong? Regards, Michael Curtis