See below
On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote: > > On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: > >> Hi Michael, >> >> You may have tried to send some debug information to the list, but it >> appears to have been blocked. Compressed text output of the backtrace text >> is sufficient. > > > Odd, I thought I sent it to you directly. In any case, here is the backtrace > and some information from gdb: > > $ salloc -n16 gdb -args mpirun mpi > (gdb) run > Starting program: /mnt/f1/michael/openmpi/bin/mpirun > /mnt/f1/michael/home/ServerAdmin/mpi > [Thread debugging using libthread_db enabled] > > Program received signal SIGSEGV, Segmentation fault. > 0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, > data=0x681170) at base/plm_base_launch_support.c:342 > 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; > (gdb) bt > #0 0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, > data=0x681170) at base/plm_base_launch_support.c:342 > #1 0x00007ffff78a7338 in event_process_active (base=0x615240) at event.c:651 > #2 0x00007ffff78a797e in opal_event_base_loop (base=0x615240, flags=1) at > event.c:823 > #3 0x00007ffff78a756f in opal_event_loop (flags=1) at event.c:730 > #4 0x00007ffff789b916 in opal_progress () at runtime/opal_progress.c:189 > #5 0x00007ffff7b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at > base/plm_base_launch_support.c:459 > #6 0x00007ffff7b7bed7 in plm_slurm_launch_job (jdata=0x610560) at > plm_slurm_module.c:360 > #7 0x0000000000403f46 in orterun (argc=2, argv=0x7fffffffe7d8) at > orterun.c:754 > #8 0x0000000000402fb4 in main (argc=2, argv=0x7fffffffe7d8) at main.c:13 > (gdb) print pdatorted > $1 = (orte_proc_t **) 0x67c610 > (gdb) print mev > $2 = (orte_message_event_t *) 0x681550 > (gdb) print mev->sender.vpid > $3 = 4294967295 > (gdb) print mev->sender > $4 = {jobid = 1721696256, vpid = 4294967295} > (gdb) print *mev > $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = > 0x7ffff7dd4f40, obj_reference_count = 1, cls_init_file_name = 0x7ffff7bb9a78 > "base/plm_base_launch_support.c", > cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = > 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 > "rml_oob_component.c", line = 279} The jobid and vpid look like the defined INVALID values, indicating that something is quite wrong. This would quite likely lead to the segfault. >From this, it would indeed appear that you are getting some kind of library >confusion - the most likely cause of such an error is a daemon from a >different version trying to respond, and so the returned message isn't correct. Not sure why else it would be happening...you could try setting -mca plm_base_verbose 5 to get more debug output displayed on your screen, assuming you built OMPI with --enable-debug. > > That vpid looks suspiciously like -1. > > Further debugging: > Breakpoint 3, orted_report_launch (status=32767, sender=0x7fffffffe170, > buffer=0x7ffff7b1a85f, tag=32767, cbdata=0x612d20) at > base/plm_base_launch_support.c:411 > 411 { > (gdb) print sender > $2 = (orte_process_name_t *) 0x7fffffffe170 > (gdb) print *sender > $3 = {jobid = 6822016, vpid = 0} > (gdb) continue > Continuing. > -------------------------------------------------------------------------- > A daemon (pid unknown) died unexpectedly with status 1 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > > Program received signal SIGSEGV, Segmentation fault. > 0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, > data=0x681550) at base/plm_base_launch_support.c:342 > 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; > (gdb) print mev->sender > $4 = {jobid = 1778450432, vpid = 4294967295} > > The daemon probably died as I spent too long thinking about my gdb input ;) I'm not sure why that would happen - there are no timers in the system, so it won't care how long it takes to initialize. I'm guessing this is another indicator of a library issue. > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users