Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-09 Thread Samuel K. Gutierrez

On Feb 8, 2011, at 8:21 PM, Ralph Castain wrote:

I would personally suggest not reconfiguring your system simply to  
support a particular version of OMPI. The only difference between  
the 1.4 and 1.5 series wrt slurm is that we changed a few things to  
support a more recent version of slurm. It is relatively easy to  
backport that code to the 1.4 series, and it should be (mostly)  
backward compatible.


OMPI is agnostic wrt resource managers. We try to support all  
platforms, with our effort reflective of the needs of our developers  
and their organizations, and our perception of the relative size of  
the user community for a particular platform. Slurm is a fairly  
small community, mostly centered in the three DOE weapons labs, so  
our support for that platform tends to focus on their usage.


So, with that understanding...

Sam: can you confirm that 1.5.1 works on your TLCC machines?


Open MPI 1.5.1 works as expected on our TLCC machines.  Open MPI 1.4.3  
with your SLURM update also tested.




I have created a ticket to upgrade the 1.4.4 release (due out any  
time now) with the 1.5.1 slurm support. Any interested parties can  
follow it here:


Thanks Ralph!

Sam



https://svn.open-mpi.org/trac/ompi/ticket/2717

Ralph


On Feb 8, 2011, at 6:23 PM, Michael Curtis wrote:



On 09/02/2011, at 9:16 AM, Ralph Castain wrote:


See below


On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote:



On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:


Hi Michael,

You may have tried to send some debug information to the list,  
but it appears to have been blocked.  Compressed text output of  
the backtrace text is sufficient.



Odd, I thought I sent it to you directly.  In any case, here is  
the backtrace and some information from gdb:


$ salloc -n16 gdb -args mpirun mpi
(gdb) run
Starting program: /mnt/f1/michael/openmpi/bin/mpirun /mnt/f1/ 
michael/home/ServerAdmin/mpi

[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x77b76869 in process_orted_launch_report (fd=-1,  
opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342
342	pdatorted[mev->sender.vpid]->state =  
ORTE_PROC_STATE_RUNNING;

(gdb) bt
#0  0x77b76869 in process_orted_launch_report (fd=-1,  
opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342
#1  0x778a7338 in event_process_active (base=0x615240) at  
event.c:651
#2  0x778a797e in opal_event_base_loop (base=0x615240,  
flags=1) at event.c:823

#3  0x778a756f in opal_event_loop (flags=1) at event.c:730
#4  0x7789b916 in opal_progress () at runtime/ 
opal_progress.c:189
#5  0x77b76e20 in orte_plm_base_daemon_callback  
(num_daemons=2) at base/plm_base_launch_support.c:459
#6  0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560)  
at plm_slurm_module.c:360
#7  0x00403f46 in orterun (argc=2, argv=0x7fffe7d8)  
at orterun.c:754
#8  0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at  
main.c:13

(gdb) print pdatorted
$1 = (orte_proc_t **) 0x67c610
(gdb) print mev
$2 = (orte_message_event_t *) 0x681550
(gdb) print mev->sender.vpid
$3 = 4294967295
(gdb) print mev->sender
$4 = {jobid = 1721696256, vpid = 4294967295}
(gdb) print *mev
$5 = {super = {obj_magic_id = 16046253926196952813, obj_class =  
0x77dd4f40, obj_reference_count = 1, cls_init_file_name =  
0x77bb9a78 "base/plm_base_launch_support.c",
cls_init_lineno = 423}, ev = 0x680850, sender = {jobid =  
1721696256, vpid = 4294967295}, buffer = 0x6811b0, tag = 10, file  
= 0x680640 "rml_oob_component.c", line = 279}


The jobid and vpid look like the defined INVALID values,  
indicating that something is quite wrong. This would quite likely  
lead to the segfault.


From this, it would indeed appear that you are getting some kind  
of library confusion - the most likely cause of such an error is  
a daemon from a different version trying to respond, and so the  
returned message isn't correct.


Not sure why else it would be happening...you could try setting - 
mca plm_base_verbose 5 to get more debug output displayed on your  
screen, assuming you built OMPI with --enable-debug.




Found the problem It is a site configuration issue, which I'll  
need to find a workaround for.


[bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Query of component  
[slurm] set priority to 75
[bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Selected component  
[slurm]

[bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed
[bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh
[bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523  
nodename hash 1936089714

[bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job  
[31383,1]
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:setup_job for job  
[31383,1]
[bio-ipc.{FQDN}:27523] 

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Ralph Castain
I would personally suggest not reconfiguring your system simply to support a 
particular version of OMPI. The only difference between the 1.4 and 1.5 series 
wrt slurm is that we changed a few things to support a more recent version of 
slurm. It is relatively easy to backport that code to the 1.4 series, and it 
should be (mostly) backward compatible.

OMPI is agnostic wrt resource managers. We try to support all platforms, with 
our effort reflective of the needs of our developers and their organizations, 
and our perception of the relative size of the user community for a particular 
platform. Slurm is a fairly small community, mostly centered in the three DOE 
weapons labs, so our support for that platform tends to focus on their usage.

So, with that understanding...

Sam: can you confirm that 1.5.1 works on your TLCC machines?

I have created a ticket to upgrade the 1.4.4 release (due out any time now) 
with the 1.5.1 slurm support. Any interested parties can follow it here:

https://svn.open-mpi.org/trac/ompi/ticket/2717

Ralph


On Feb 8, 2011, at 6:23 PM, Michael Curtis wrote:

> 
> On 09/02/2011, at 9:16 AM, Ralph Castain wrote:
> 
>> See below
>> 
>> 
>> On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote:
>> 
>>> 
>>> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:
>>> 
 Hi Michael,
 
 You may have tried to send some debug information to the list, but it 
 appears to have been blocked.  Compressed text output of the backtrace 
 text is sufficient.
>>> 
>>> 
>>> Odd, I thought I sent it to you directly.  In any case, here is the 
>>> backtrace and some information from gdb:
>>> 
>>> $ salloc -n16 gdb -args mpirun mpi
>>> (gdb) run
>>> Starting program: /mnt/f1/michael/openmpi/bin/mpirun 
>>> /mnt/f1/michael/home/ServerAdmin/mpi
>>> [Thread debugging using libthread_db enabled]
>>> 
>>> Program received signal SIGSEGV, Segmentation fault.
>>> 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
>>> data=0x681170) at base/plm_base_launch_support.c:342
>>> 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
>>> (gdb) bt
>>> #0  0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
>>> data=0x681170) at base/plm_base_launch_support.c:342
>>> #1  0x778a7338 in event_process_active (base=0x615240) at 
>>> event.c:651
>>> #2  0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at 
>>> event.c:823
>>> #3  0x778a756f in opal_event_loop (flags=1) at event.c:730
>>> #4  0x7789b916 in opal_progress () at runtime/opal_progress.c:189
>>> #5  0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at 
>>> base/plm_base_launch_support.c:459
>>> #6  0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at 
>>> plm_slurm_module.c:360
>>> #7  0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at 
>>> orterun.c:754
>>> #8  0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13
>>> (gdb) print pdatorted
>>> $1 = (orte_proc_t **) 0x67c610
>>> (gdb) print mev
>>> $2 = (orte_message_event_t *) 0x681550
>>> (gdb) print mev->sender.vpid
>>> $3 = 4294967295
>>> (gdb) print mev->sender
>>> $4 = {jobid = 1721696256, vpid = 4294967295}
>>> (gdb) print *mev
>>> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 
>>> 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 
>>> 0x77bb9a78 "base/plm_base_launch_support.c", 
>>> cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 
>>> 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 
>>> "rml_oob_component.c", line = 279}
>> 
>> The jobid and vpid look like the defined INVALID values, indicating that 
>> something is quite wrong. This would quite likely lead to the segfault.
>> 
>>> From this, it would indeed appear that you are getting some kind of library 
>>> confusion - the most likely cause of such an error is a daemon from a 
>>> different version trying to respond, and so the returned message isn't 
>>> correct.
>> 
>> Not sure why else it would be happening...you could try setting -mca 
>> plm_base_verbose 5 to get more debug output displayed on your screen, 
>> assuming you built OMPI with --enable-debug.
>> 
> 
> Found the problem It is a site configuration issue, which I'll need to 
> find a workaround for.
> 
> [bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Query of component [slurm] set 
> priority to 75
> [bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Selected component [slurm]
> [bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed
> [bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh
> [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523 nodename 
> hash 1936089714
> [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383
> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm
> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job [31383,1]
> [bio-ipc.{FQDN}:27523] [[31383,0],0] 

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Michael Curtis

On 09/02/2011, at 9:16 AM, Ralph Castain wrote:

> See below
> 
> 
> On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote:
> 
>> 
>> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:
>> 
>>> Hi Michael,
>>> 
>>> You may have tried to send some debug information to the list, but it 
>>> appears to have been blocked.  Compressed text output of the backtrace text 
>>> is sufficient.
>> 
>> 
>> Odd, I thought I sent it to you directly.  In any case, here is the 
>> backtrace and some information from gdb:
>> 
>> $ salloc -n16 gdb -args mpirun mpi
>> (gdb) run
>> Starting program: /mnt/f1/michael/openmpi/bin/mpirun 
>> /mnt/f1/michael/home/ServerAdmin/mpi
>> [Thread debugging using libthread_db enabled]
>> 
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
>> data=0x681170) at base/plm_base_launch_support.c:342
>> 342  pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
>> (gdb) bt
>> #0  0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
>> data=0x681170) at base/plm_base_launch_support.c:342
>> #1  0x778a7338 in event_process_active (base=0x615240) at event.c:651
>> #2  0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at 
>> event.c:823
>> #3  0x778a756f in opal_event_loop (flags=1) at event.c:730
>> #4  0x7789b916 in opal_progress () at runtime/opal_progress.c:189
>> #5  0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at 
>> base/plm_base_launch_support.c:459
>> #6  0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at 
>> plm_slurm_module.c:360
>> #7  0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at 
>> orterun.c:754
>> #8  0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13
>> (gdb) print pdatorted
>> $1 = (orte_proc_t **) 0x67c610
>> (gdb) print mev
>> $2 = (orte_message_event_t *) 0x681550
>> (gdb) print mev->sender.vpid
>> $3 = 4294967295
>> (gdb) print mev->sender
>> $4 = {jobid = 1721696256, vpid = 4294967295}
>> (gdb) print *mev
>> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 
>> 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 
>> "base/plm_base_launch_support.c", 
>>  cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 
>> 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 
>> "rml_oob_component.c", line = 279}
> 
> The jobid and vpid look like the defined INVALID values, indicating that 
> something is quite wrong. This would quite likely lead to the segfault.
> 
>> From this, it would indeed appear that you are getting some kind of library 
>> confusion - the most likely cause of such an error is a daemon from a 
>> different version trying to respond, and so the returned message isn't 
>> correct.
> 
> Not sure why else it would be happening...you could try setting -mca 
> plm_base_verbose 5 to get more debug output displayed on your screen, 
> assuming you built OMPI with --enable-debug.
> 

Found the problem It is a site configuration issue, which I'll need to find 
a workaround for.

[bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Query of component [slurm] set 
priority to 75
[bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Selected component [slurm]
[bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed
[bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh
[bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523 nodename hash 
1936089714
[bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job [31383,1]
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:setup_job for job [31383,1]
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching on nodes ipc3
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: final top-level argv:
srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=ipc3 orted -mca 
ess slurm -mca orte_ess_jobid 2056716288 -mca orte_ess_vpid 1 -mca 
orte_ess_num_procs 2 --hnp-uri 
"2056716288.0;tcp://lanip:37493;tcp://globalip:37493;tcp://lanip2:37493" -mca 
plm_base_verbose 20

I then inserted some printf's into the ess_slurm_module (rough and ready, I 
know, but I was in a hurry).

Just after initialisation: (at around line 345)
orte_ess_slurm: jobid 2056716288 vpid 1
So it gets that...
I narrowed it down to the get_slurm_nodename  function, as the method didn't 
proceed past that point

line 401:
tmp = strdup(orte_process_info.nodename);
printf( "Our node name == %s\n", tmp );
line 409:
for (i=0; NULL !=  names[i]; i++) {
  printf( "Checking %s\n", names[ i ]);

Result:
Our node name == eng-ipc3.{FQDN}
Checking ipc3

So it's down to the mismatch of the slurm name and the hostname.  slurm really 
encourages you not to use the fully qualified hostname, and I'd prefer not to 
have to reconfigure the 

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Ralph Castain
See below


On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote:

> 
> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:
> 
>> Hi Michael,
>> 
>> You may have tried to send some debug information to the list, but it 
>> appears to have been blocked.  Compressed text output of the backtrace text 
>> is sufficient.
> 
> 
> Odd, I thought I sent it to you directly.  In any case, here is the backtrace 
> and some information from gdb:
> 
> $ salloc -n16 gdb -args mpirun mpi
> (gdb) run
> Starting program: /mnt/f1/michael/openmpi/bin/mpirun 
> /mnt/f1/michael/home/ServerAdmin/mpi
> [Thread debugging using libthread_db enabled]
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
> data=0x681170) at base/plm_base_launch_support.c:342
> 342   pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
> (gdb) bt
> #0  0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
> data=0x681170) at base/plm_base_launch_support.c:342
> #1  0x778a7338 in event_process_active (base=0x615240) at event.c:651
> #2  0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at 
> event.c:823
> #3  0x778a756f in opal_event_loop (flags=1) at event.c:730
> #4  0x7789b916 in opal_progress () at runtime/opal_progress.c:189
> #5  0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at 
> base/plm_base_launch_support.c:459
> #6  0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at 
> plm_slurm_module.c:360
> #7  0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at 
> orterun.c:754
> #8  0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13
> (gdb) print pdatorted
> $1 = (orte_proc_t **) 0x67c610
> (gdb) print mev
> $2 = (orte_message_event_t *) 0x681550
> (gdb) print mev->sender.vpid
> $3 = 4294967295
> (gdb) print mev->sender
> $4 = {jobid = 1721696256, vpid = 4294967295}
> (gdb) print *mev
> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 
> 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 
> "base/plm_base_launch_support.c", 
>   cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 
> 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 
> "rml_oob_component.c", line = 279}

The jobid and vpid look like the defined INVALID values, indicating that 
something is quite wrong. This would quite likely lead to the segfault.

>From this, it would indeed appear that you are getting some kind of library 
>confusion - the most likely cause of such an error is a daemon from a 
>different version trying to respond, and so the returned message isn't correct.

Not sure why else it would be happening...you could try setting -mca 
plm_base_verbose 5 to get more debug output displayed on your screen, assuming 
you built OMPI with --enable-debug.


> 
> That vpid looks suspiciously like -1.
> 
> Further debugging:
> Breakpoint 3, orted_report_launch (status=32767, sender=0x7fffe170, 
> buffer=0x77b1a85f, tag=32767, cbdata=0x612d20) at 
> base/plm_base_launch_support.c:411
> 411   {
> (gdb) print sender
> $2 = (orte_process_name_t *) 0x7fffe170
> (gdb) print *sender
> $3 = {jobid = 6822016, vpid = 0}
> (gdb) continue
> Continuing.
> --
> A daemon (pid unknown) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
> data=0x681550) at base/plm_base_launch_support.c:342
> 342   pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
> (gdb) print mev->sender
> $4 = {jobid = 1778450432, vpid = 4294967295}
> 
> The daemon probably died as I spent too long thinking about my gdb input ;)

I'm not sure why that would happen - there are no timers in the system, so it 
won't care how long it takes to initialize. I'm guessing this is another 
indicator of a library issue.


> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Michael Curtis

On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:

> Hi Michael,
> 
> You may have tried to send some debug information to the list, but it appears 
> to have been blocked.  Compressed text output of the backtrace text is 
> sufficient.


Odd, I thought I sent it to you directly.  In any case, here is the backtrace 
and some information from gdb:

$ salloc -n16 gdb -args mpirun mpi
(gdb) run
Starting program: /mnt/f1/michael/openmpi/bin/mpirun 
/mnt/f1/michael/home/ServerAdmin/mpi
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
data=0x681170) at base/plm_base_launch_support.c:342
342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
(gdb) bt
#0  0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
data=0x681170) at base/plm_base_launch_support.c:342
#1  0x778a7338 in event_process_active (base=0x615240) at event.c:651
#2  0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at 
event.c:823
#3  0x778a756f in opal_event_loop (flags=1) at event.c:730
#4  0x7789b916 in opal_progress () at runtime/opal_progress.c:189
#5  0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at 
base/plm_base_launch_support.c:459
#6  0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at 
plm_slurm_module.c:360
#7  0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at orterun.c:754
#8  0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13
(gdb) print pdatorted
$1 = (orte_proc_t **) 0x67c610
(gdb) print mev
$2 = (orte_message_event_t *) 0x681550
(gdb) print mev->sender.vpid
$3 = 4294967295
(gdb) print mev->sender
$4 = {jobid = 1721696256, vpid = 4294967295}
(gdb) print *mev
$5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x77dd4f40, 
obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 
"base/plm_base_launch_support.c", 
   cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 
4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 
"rml_oob_component.c", line = 279}

That vpid looks suspiciously like -1.

Further debugging:
Breakpoint 3, orted_report_launch (status=32767, sender=0x7fffe170, 
buffer=0x77b1a85f, tag=32767, cbdata=0x612d20) at 
base/plm_base_launch_support.c:411
411 {
(gdb) print sender
$2 = (orte_process_name_t *) 0x7fffe170
(gdb) print *sender
$3 = {jobid = 6822016, vpid = 0}
(gdb) continue
Continuing.
--
A daemon (pid unknown) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--

Program received signal SIGSEGV, Segmentation fault.
0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
data=0x681550) at base/plm_base_launch_support.c:342
342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
(gdb) print mev->sender
$4 = {jobid = 1778450432, vpid = 4294967295}

The daemon probably died as I spent too long thinking about my gdb input ;)





Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Michael Curtis

On 09/02/2011, at 2:38 AM, Ralph Castain wrote:

> Another possibility to check - are you sure you are getting the same OMPI 
> version on the backend nodes? When I see it work on local node, but fail 
> multi-node, the most common problem is that you are picking up a different 
> OMPI version due to path differences on the backend nodes.

It's installed as a system package, and the software set on all machines is 
managed by a configuration tool, so the machines should be identical.  However, 
it may be worth checking the dependency versions and I'll double check that the 
OMPI versions really do match.





Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Ralph Castain
Another possibility to check - are you sure you are getting the same OMPI 
version on the backend nodes? When I see it work on local node, but fail 
multi-node, the most common problem is that you are picking up a different OMPI 
version due to path differences on the backend nodes.


On Feb 8, 2011, at 8:17 AM, Samuel K. Gutierrez wrote:

> Hi Michael,
> 
> You may have tried to send some debug information to the list, but it appears 
> to have been blocked.  Compressed text output of the backtrace text is 
> sufficient.
> 
> Thanks,
> 
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> 
> On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote:
> 
>> Hi,
>> 
>> A detailed backtrace from a core dump may help us debug this.  Would you be 
>> willing to provide that information for us?
>> 
>> Thanks,
>> 
>> --
>> Samuel K. Gutierrez
>> Los Alamos National Laboratory
>> 
>> On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote:
>> 
>>> 
>>> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:
>>> 
>>> Hi,
>>> 
 I just tried to reproduce the problem that you are experiencing and was 
 unable to.
 
 SLURM 2.1.15
 Open MPI 1.4.3 configured with: 
 --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas
>>> 
>>> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same 
>>> platform file (the only change was to re-enable btl-tcp).
>>> 
>>> Unfortunately, the result is the same:
>>> salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
>>> salloc: Granted job allocation 145
>>> 
>>>    JOB MAP   
>>> 
>>> Data for node: Name: eng-ipc4.{FQDN}Num procs: 8
>>> Process OMPI jobid: [6932,1] Process rank: 0
>>> Process OMPI jobid: [6932,1] Process rank: 1
>>> Process OMPI jobid: [6932,1] Process rank: 2
>>> Process OMPI jobid: [6932,1] Process rank: 3
>>> Process OMPI jobid: [6932,1] Process rank: 4
>>> Process OMPI jobid: [6932,1] Process rank: 5
>>> Process OMPI jobid: [6932,1] Process rank: 6
>>> Process OMPI jobid: [6932,1] Process rank: 7
>>> 
>>> Data for node: Name: ipc3   Num procs: 8
>>> Process OMPI jobid: [6932,1] Process rank: 8
>>> Process OMPI jobid: [6932,1] Process rank: 9
>>> Process OMPI jobid: [6932,1] Process rank: 10
>>> Process OMPI jobid: [6932,1] Process rank: 11
>>> Process OMPI jobid: [6932,1] Process rank: 12
>>> Process OMPI jobid: [6932,1] Process rank: 13
>>> Process OMPI jobid: [6932,1] Process rank: 14
>>> Process OMPI jobid: [6932,1] Process rank: 15
>>> 
>>> =
>>> [eng-ipc4:31754] *** Process received signal ***
>>> [eng-ipc4:31754] Signal: Segmentation fault (11)
>>> [eng-ipc4:31754] Signal code: Address not mapped (1)
>>> [eng-ipc4:31754] Failing at address: 0x8012eb748
>>> [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
>>> [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) 
>>> [0x7f81cf262869]
>>> [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) 
>>> [0x7f81cef93338]
>>> [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) 
>>> [0x7f81cef9397e]
>>> [eng-ipc4:31754] [ 4] 
>>> ~/../openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f) [0x7f81cef9356f]
>>> [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress+0x89) 
>>> [0x7f81cef87916]
>>> [eng-ipc4:31754] [ 6] 
>>> ~/../openmpi/lib/libopen-rte.so.0(orte_plm_base_daemon_callback+0x13f) 
>>> [0x7f81cf262e20]
>>> [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) 
>>> [0x7f81cf267ed7]
>>> [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
>>> [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
>>> [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) 
>>> [0x7f81ce14bc4d]
>>> [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
>>> [eng-ipc4:31754] *** End of error message ***
>>> salloc: Relinquishing job allocation 145
>>> salloc: Job allocation 145 has been revoked.
>>> zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map 
>>> ~/ServerAdmin/mpi
>>> 
>>> I've anonymised the paths and domain, otherwise pasted verbatim.  The only 
>>> odd thing I notice is that the launching machine uses its full domain name, 
>>> whereas the other machine is referred to by the short name.  Despite the 
>>> FQDN, the domain does not exist in the DNS (for historical reasons), but 
>>> does exist in the /etc/hosts file.
>>> 
>>> Any further clues would be appreciated.  In case it may be relevant, core 
>>> system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32.  One other point 
>>> of difference may be that our environment is tcp (ethernet) based whereas 
>>> the LANL test environment is not?
>>> 
>>> Michael
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> 

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Samuel K. Gutierrez

Hi Michael,

You may have tried to send some debug information to the list, but it  
appears to have been blocked.  Compressed text output of the backtrace  
text is sufficient.


Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote:


Hi,

A detailed backtrace from a core dump may help us debug this.  Would  
you be willing to provide that information for us?


Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote:



On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:

Hi,

I just tried to reproduce the problem that you are experiencing  
and was unable to.


SLURM 2.1.15
Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ 
lanl/tlcc/debug-nopanasas


I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the  
same platform file (the only change was to re-enable btl-tcp).


Unfortunately, the result is the same:
salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
salloc: Granted job allocation 145

   JOB MAP   

Data for node: Name: eng-ipc4.{FQDN}Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 0
Process OMPI jobid: [6932,1] Process rank: 1
Process OMPI jobid: [6932,1] Process rank: 2
Process OMPI jobid: [6932,1] Process rank: 3
Process OMPI jobid: [6932,1] Process rank: 4
Process OMPI jobid: [6932,1] Process rank: 5
Process OMPI jobid: [6932,1] Process rank: 6
Process OMPI jobid: [6932,1] Process rank: 7

Data for node: Name: ipc3   Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 8
Process OMPI jobid: [6932,1] Process rank: 9
Process OMPI jobid: [6932,1] Process rank: 10
Process OMPI jobid: [6932,1] Process rank: 11
Process OMPI jobid: [6932,1] Process rank: 12
Process OMPI jobid: [6932,1] Process rank: 13
Process OMPI jobid: [6932,1] Process rank: 14
Process OMPI jobid: [6932,1] Process rank: 15

=
[eng-ipc4:31754] *** Process received signal ***
[eng-ipc4:31754] Signal: Segmentation fault (11)
[eng-ipc4:31754] Signal code: Address not mapped (1)
[eng-ipc4:31754] Failing at address: 0x8012eb748
[eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
[eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869)  
[0x7f81cf262869]
[eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338)  
[0x7f81cef93338]
[eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e)  
[0x7f81cef9397e]
[eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so. 
0(opal_event_loop+0x1f) [0x7f81cef9356f]
[eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so. 
0(opal_progress+0x89) [0x7f81cef87916]
[eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so. 
0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20]
[eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7)  
[0x7f81cf267ed7]

[eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
[eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
[eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd)  
[0x7f81ce14bc4d]

[eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
[eng-ipc4:31754] *** End of error message ***
salloc: Relinquishing job allocation 145
salloc: Job allocation 145 has been revoked.
zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ 
ServerAdmin/mpi


I've anonymised the paths and domain, otherwise pasted verbatim.   
The only odd thing I notice is that the launching machine uses its  
full domain name, whereas the other machine is referred to by the  
short name.  Despite the FQDN, the domain does not exist in the DNS  
(for historical reasons), but does exist in the /etc/hosts file.


Any further clues would be appreciated.  In case it may be  
relevant, core system versions are: glibc 2.11, gcc 4.4.3, kernel  
2.6.32.  One other point of difference may be that our environment  
is tcp (ethernet) based whereas the LANL test environment is not?


Michael


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-07 Thread Samuel K. Gutierrez

Hi,

A detailed backtrace from a core dump may help us debug this.  Would  
you be willing to provide that information for us?


Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote:



On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:

Hi,

I just tried to reproduce the problem that you are experiencing and  
was unable to.


SLURM 2.1.15
Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ 
lanl/tlcc/debug-nopanasas


I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same  
platform file (the only change was to re-enable btl-tcp).


Unfortunately, the result is the same:
salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
salloc: Granted job allocation 145

   JOB MAP   

Data for node: Name: eng-ipc4.{FQDN}Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 0
Process OMPI jobid: [6932,1] Process rank: 1
Process OMPI jobid: [6932,1] Process rank: 2
Process OMPI jobid: [6932,1] Process rank: 3
Process OMPI jobid: [6932,1] Process rank: 4
Process OMPI jobid: [6932,1] Process rank: 5
Process OMPI jobid: [6932,1] Process rank: 6
Process OMPI jobid: [6932,1] Process rank: 7

Data for node: Name: ipc3   Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 8
Process OMPI jobid: [6932,1] Process rank: 9
Process OMPI jobid: [6932,1] Process rank: 10
Process OMPI jobid: [6932,1] Process rank: 11
Process OMPI jobid: [6932,1] Process rank: 12
Process OMPI jobid: [6932,1] Process rank: 13
Process OMPI jobid: [6932,1] Process rank: 14
Process OMPI jobid: [6932,1] Process rank: 15

=
[eng-ipc4:31754] *** Process received signal ***
[eng-ipc4:31754] Signal: Segmentation fault (11)
[eng-ipc4:31754] Signal code: Address not mapped (1)
[eng-ipc4:31754] Failing at address: 0x8012eb748
[eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
[eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869)  
[0x7f81cf262869]
[eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338)  
[0x7f81cef93338]
[eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e)  
[0x7f81cef9397e]
[eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so. 
0(opal_event_loop+0x1f) [0x7f81cef9356f]
[eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress 
+0x89) [0x7f81cef87916]
[eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so. 
0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20]
[eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7)  
[0x7f81cf267ed7]

[eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
[eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
[eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd)  
[0x7f81ce14bc4d]

[eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
[eng-ipc4:31754] *** End of error message ***
salloc: Relinquishing job allocation 145
salloc: Job allocation 145 has been revoked.
zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ 
ServerAdmin/mpi


I've anonymised the paths and domain, otherwise pasted verbatim.   
The only odd thing I notice is that the launching machine uses its  
full domain name, whereas the other machine is referred to by the  
short name.  Despite the FQDN, the domain does not exist in the DNS  
(for historical reasons), but does exist in the /etc/hosts file.


Any further clues would be appreciated.  In case it may be relevant,  
core system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32.  One  
other point of difference may be that our environment is tcp  
(ethernet) based whereas the LANL test environment is not?


Michael


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-07 Thread Ralph Castain
The 1.4 series is regularly tested on slurm machines after every modification, 
and has been running at LANL (and other slurm installations) for quite some 
time, so I doubt that's the core issue. Likewise, nothing in the system depends 
upon the FQDN (or anything regarding hostname) - it's just used to print 
diagnostics.

Not sure of the issue, and I don't have an ability to test/debug slurm any 
more, so I'll have to let Sam continue to look into this for you. It's probably 
some trivial difference in setup, unfortunately. I don't know if you said 
before, but it might help to know what slurm version you are using. Slurm tends 
to change a lot between versions (even minor releases), and it is one of the 
more finicky platforms we support.


On Feb 6, 2011, at 9:12 PM, Michael Curtis wrote:

> 
> On 07/02/2011, at 12:36 PM, Michael Curtis wrote:
> 
>> 
>> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:
>> 
>> Hi,
>> 
>>> I just tried to reproduce the problem that you are experiencing and was 
>>> unable to.
>>> 
>>> SLURM 2.1.15
>>> Open MPI 1.4.3 configured with: 
>>> --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas
>> 
>> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same 
>> platform file (the only change was to re-enable btl-tcp).
>> 
>> Unfortunately, the result is the same:
> 
> To reply to my own post again (sorry!), I tried OpenMPI 1.5.1.  This works 
> fine:
> salloc -n16 ~/../openmpi/bin/mpirun --display-map mpi
> salloc: Granted job allocation 151
> 
>    JOB MAP   
> 
> Data for node: ipc3   Num procs: 8
>   Process OMPI jobid: [3365,1] Process rank: 0
>   Process OMPI jobid: [3365,1] Process rank: 1
>   Process OMPI jobid: [3365,1] Process rank: 2
>   Process OMPI jobid: [3365,1] Process rank: 3
>   Process OMPI jobid: [3365,1] Process rank: 4
>   Process OMPI jobid: [3365,1] Process rank: 5
>   Process OMPI jobid: [3365,1] Process rank: 6
>   Process OMPI jobid: [3365,1] Process rank: 7
> 
> Data for node: ipc4   Num procs: 8
>   Process OMPI jobid: [3365,1] Process rank: 8
>   Process OMPI jobid: [3365,1] Process rank: 9
>   Process OMPI jobid: [3365,1] Process rank: 10
>   Process OMPI jobid: [3365,1] Process rank: 11
>   Process OMPI jobid: [3365,1] Process rank: 12
>   Process OMPI jobid: [3365,1] Process rank: 13
>   Process OMPI jobid: [3365,1] Process rank: 14
>   Process OMPI jobid: [3365,1] Process rank: 15
> 
> =
> Process 2 on eng-ipc3.{FQDN} out of 16
> Process 4 on eng-ipc3.{FQDN} out of 16
> Process 5 on eng-ipc3.{FQDN} out of 16
> Process 0 on eng-ipc3.{FQDN} out of 16
> Process 1 on eng-ipc3.{FQDN} out of 16
> Process 6 on eng-ipc3.{FQDN} out of 16
> Process 3 on eng-ipc3.{FQDN} out of 16
> Process 7 on eng-ipc3.{FQDN} out of 16
> Process 8 on eng-ipc4.{FQDN} out of 16
> Process 11 on eng-ipc4.{FQDN} out of 16
> Process 12 on eng-ipc4.{FQDN} out of 16
> Process 14 on eng-ipc4.{FQDN} out of 16
> Process 15 on eng-ipc4.{FQDN} out of 16
> Process 10 on eng-ipc4.{FQDN} out of 16
> Process 9 on eng-ipc4.{FQDN} out of 16
> Process 13 on eng-ipc4.{FQDN} out of 16
> salloc: Relinquishing job allocation 151
> 
> It does seem very much like there is a bug of some sort in 1.4.3?
> 
> Michael
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-06 Thread Michael Curtis

On 07/02/2011, at 12:36 PM, Michael Curtis wrote:

> 
> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:
> 
> Hi,
> 
>> I just tried to reproduce the problem that you are experiencing and was 
>> unable to.
>> 
>> SLURM 2.1.15
>> Open MPI 1.4.3 configured with: 
>> --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas
> 
> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same platform 
> file (the only change was to re-enable btl-tcp).
> 
> Unfortunately, the result is the same:

To reply to my own post again (sorry!), I tried OpenMPI 1.5.1.  This works fine:
salloc -n16 ~/../openmpi/bin/mpirun --display-map mpi
salloc: Granted job allocation 151

    JOB MAP   

 Data for node: ipc3Num procs: 8
Process OMPI jobid: [3365,1] Process rank: 0
Process OMPI jobid: [3365,1] Process rank: 1
Process OMPI jobid: [3365,1] Process rank: 2
Process OMPI jobid: [3365,1] Process rank: 3
Process OMPI jobid: [3365,1] Process rank: 4
Process OMPI jobid: [3365,1] Process rank: 5
Process OMPI jobid: [3365,1] Process rank: 6
Process OMPI jobid: [3365,1] Process rank: 7

 Data for node: ipc4Num procs: 8
Process OMPI jobid: [3365,1] Process rank: 8
Process OMPI jobid: [3365,1] Process rank: 9
Process OMPI jobid: [3365,1] Process rank: 10
Process OMPI jobid: [3365,1] Process rank: 11
Process OMPI jobid: [3365,1] Process rank: 12
Process OMPI jobid: [3365,1] Process rank: 13
Process OMPI jobid: [3365,1] Process rank: 14
Process OMPI jobid: [3365,1] Process rank: 15

 =
Process 2 on eng-ipc3.{FQDN} out of 16
Process 4 on eng-ipc3.{FQDN} out of 16
Process 5 on eng-ipc3.{FQDN} out of 16
Process 0 on eng-ipc3.{FQDN} out of 16
Process 1 on eng-ipc3.{FQDN} out of 16
Process 6 on eng-ipc3.{FQDN} out of 16
Process 3 on eng-ipc3.{FQDN} out of 16
Process 7 on eng-ipc3.{FQDN} out of 16
Process 8 on eng-ipc4.{FQDN} out of 16
Process 11 on eng-ipc4.{FQDN} out of 16
Process 12 on eng-ipc4.{FQDN} out of 16
Process 14 on eng-ipc4.{FQDN} out of 16
Process 15 on eng-ipc4.{FQDN} out of 16
Process 10 on eng-ipc4.{FQDN} out of 16
Process 9 on eng-ipc4.{FQDN} out of 16
Process 13 on eng-ipc4.{FQDN} out of 16
salloc: Relinquishing job allocation 151

It does seem very much like there is a bug of some sort in 1.4.3?

Michael




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-06 Thread Michael Curtis

On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:

Hi,

> I just tried to reproduce the problem that you are experiencing and was 
> unable to.
> 
> SLURM 2.1.15
> Open MPI 1.4.3 configured with: 
> --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas

I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same platform 
file (the only change was to re-enable btl-tcp).

Unfortunately, the result is the same:
salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
salloc: Granted job allocation 145

    JOB MAP   

 Data for node: Name: eng-ipc4.{FQDN}   Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 0
Process OMPI jobid: [6932,1] Process rank: 1
Process OMPI jobid: [6932,1] Process rank: 2
Process OMPI jobid: [6932,1] Process rank: 3
Process OMPI jobid: [6932,1] Process rank: 4
Process OMPI jobid: [6932,1] Process rank: 5
Process OMPI jobid: [6932,1] Process rank: 6
Process OMPI jobid: [6932,1] Process rank: 7

 Data for node: Name: ipc3  Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 8
Process OMPI jobid: [6932,1] Process rank: 9
Process OMPI jobid: [6932,1] Process rank: 10
Process OMPI jobid: [6932,1] Process rank: 11
Process OMPI jobid: [6932,1] Process rank: 12
Process OMPI jobid: [6932,1] Process rank: 13
Process OMPI jobid: [6932,1] Process rank: 14
Process OMPI jobid: [6932,1] Process rank: 15

 =
[eng-ipc4:31754] *** Process received signal ***
[eng-ipc4:31754] Signal: Segmentation fault (11)
[eng-ipc4:31754] Signal code: Address not mapped (1)
[eng-ipc4:31754] Failing at address: 0x8012eb748
[eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
[eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) 
[0x7f81cf262869]
[eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) 
[0x7f81cef93338]
[eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) 
[0x7f81cef9397e]
[eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f) 
[0x7f81cef9356f]
[eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress+0x89) 
[0x7f81cef87916]
[eng-ipc4:31754] [ 6] 
~/../openmpi/lib/libopen-rte.so.0(orte_plm_base_daemon_callback+0x13f) 
[0x7f81cf262e20]
[eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) 
[0x7f81cf267ed7]
[eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
[eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
[eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f81ce14bc4d]
[eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
[eng-ipc4:31754] *** End of error message ***
salloc: Relinquishing job allocation 145
salloc: Job allocation 145 has been revoked.
zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map 
~/ServerAdmin/mpi

I've anonymised the paths and domain, otherwise pasted verbatim.  The only odd 
thing I notice is that the launching machine uses its full domain name, whereas 
the other machine is referred to by the short name.  Despite the FQDN, the 
domain does not exist in the DNS (for historical reasons), but does exist in 
the /etc/hosts file.  

Any further clues would be appreciated.  In case it may be relevant, core 
system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32.  One other point of 
difference may be that our environment is tcp (ethernet) based whereas the LANL 
test environment is not?

Michael




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-06 Thread Michael Curtis

On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:

> I just tried to reproduce the problem that you are experiencing and was 
> unable to.
> 
> 
> SLURM 2.1.15
> Open MPI 1.4.3 configured with: 
> --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas
> 
> I'll dig a bit further.

Interesting.  I'll try a local, vanilla (ie, non-debian) build and report back. 
 

Michael




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-03 Thread Samuel K. Gutierrez

Hi,

I just tried to reproduce the problem that you are experiencing and  
was unable to.


[samuel@lo1-fe ~]$ salloc -n32 mpirun --display-map ./mpi_app
salloc: Job is in held state, pending scheduler release
salloc: Pending job allocation 138319
salloc: job 138319 queued and waiting for resources
salloc: job 138319 has been allocated resources
salloc: Granted job allocation 138319

    JOB MAP   

 Data for node: Name: lob083Num procs: 16
Process OMPI jobid: [26464,1] Process rank: 0
Process OMPI jobid: [26464,1] Process rank: 1
Process OMPI jobid: [26464,1] Process rank: 2
Process OMPI jobid: [26464,1] Process rank: 3
Process OMPI jobid: [26464,1] Process rank: 4
Process OMPI jobid: [26464,1] Process rank: 5
Process OMPI jobid: [26464,1] Process rank: 6
Process OMPI jobid: [26464,1] Process rank: 7
Process OMPI jobid: [26464,1] Process rank: 8
Process OMPI jobid: [26464,1] Process rank: 9
Process OMPI jobid: [26464,1] Process rank: 10
Process OMPI jobid: [26464,1] Process rank: 11
Process OMPI jobid: [26464,1] Process rank: 12
Process OMPI jobid: [26464,1] Process rank: 13
Process OMPI jobid: [26464,1] Process rank: 14
Process OMPI jobid: [26464,1] Process rank: 15

 Data for node: Name: lob084Num procs: 16
Process OMPI jobid: [26464,1] Process rank: 16
Process OMPI jobid: [26464,1] Process rank: 17
Process OMPI jobid: [26464,1] Process rank: 18
Process OMPI jobid: [26464,1] Process rank: 19
Process OMPI jobid: [26464,1] Process rank: 20
Process OMPI jobid: [26464,1] Process rank: 21
Process OMPI jobid: [26464,1] Process rank: 22
Process OMPI jobid: [26464,1] Process rank: 23
Process OMPI jobid: [26464,1] Process rank: 24
Process OMPI jobid: [26464,1] Process rank: 25
Process OMPI jobid: [26464,1] Process rank: 26
Process OMPI jobid: [26464,1] Process rank: 27
Process OMPI jobid: [26464,1] Process rank: 28
Process OMPI jobid: [26464,1] Process rank: 29
Process OMPI jobid: [26464,1] Process rank: 30
Process OMPI jobid: [26464,1] Process rank: 31


SLURM 2.1.15
Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ 
lanl/tlcc/debug-nopanasas


I'll dig a bit further.

Sam

On Feb 2, 2011, at 9:53 AM, Samuel K. Gutierrez wrote:


Hi,

We'll try to reproduce the problem.

Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory


On Feb 2, 2011, at 2:55 AM, Michael Curtis wrote:



On 28/01/2011, at 8:16 PM, Michael Curtis wrote:



On 27/01/2011, at 4:51 PM, Michael Curtis wrote:

Some more debugging information:
Is anyone able to help with this problem?  As far as I can tell  
it's a stock-standard recently installed SLURM installation.


I can try 1.5.1 but hesitant to deploy this as it would require a  
recompile of some rather large pieces of software.  Should I re- 
post to the -devel lists?


Regards,


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-02 Thread Samuel K. Gutierrez

Hi,

We'll try to reproduce the problem.

Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory


On Feb 2, 2011, at 2:55 AM, Michael Curtis wrote:



On 28/01/2011, at 8:16 PM, Michael Curtis wrote:



On 27/01/2011, at 4:51 PM, Michael Curtis wrote:

Some more debugging information:
Is anyone able to help with this problem?  As far as I can tell it's  
a stock-standard recently installed SLURM installation.


I can try 1.5.1 but hesitant to deploy this as it would require a  
recompile of some rather large pieces of software.  Should I re-post  
to the -devel lists?


Regards,


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-02 Thread Michael Curtis

On 28/01/2011, at 8:16 PM, Michael Curtis wrote:

> 
> On 27/01/2011, at 4:51 PM, Michael Curtis wrote:
> 
> Some more debugging information:
Is anyone able to help with this problem?  As far as I can tell it's a 
stock-standard recently installed SLURM installation.

I can try 1.5.1 but hesitant to deploy this as it would require a recompile of 
some rather large pieces of software.  Should I re-post to the -devel lists?

Regards,




Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-01-28 Thread Michael Curtis

On 27/01/2011, at 4:51 PM, Michael Curtis wrote:

Some more debugging information:

> Failing case:
> michael@ipc ~ $ salloc -n8 mpirun --display-map ./mpi
>    JOB MAP   

Backtrace with debugging symbols
#0  0x77bb5c1e in ?? () from /usr/lib/libopen-rte.so.0
#1  0x7792e23f in ?? () from /usr/lib/libopen-pal.so.0
#2  0x77920679 in opal_progress () from /usr/lib/libopen-pal.so.0
#3  0x77bb6e5d in orte_plm_base_daemon_callback () from 
/usr/lib/libopen-rte.so.0
#4  0x762b67e7 in plm_slurm_launch_job (jdata=) at 
../../../../../../orte/mca/plm/slurm/plm_slurm_module.c:360
#5  0x004041c8 in orterun (argc=4, argv=0x7fffe7d8) at 
../../../../../orte/tools/orterun/orterun.c:754
#6  0x00403234 in main (argc=4, argv=0x7fffe7d8) at 
../../../../../orte/tools/orterun/main.c:13

Trace output with -d100 and --enable-trace:
[:10821] progressed_wait: 
../../../../../orte/mca/plm/base/plm_base_launch_support.c 459
[:10821] defining message event: 
../../../../../orte/mca/plm/base/plm_base_launch_support.c 423

I'm guessing from this that it's crashing in the event loop, maybe at :
static void process_orted_launch_report(int fd, short event, void *data)

strace:
poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, 
{fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}], 6, 
1000) = 1 ([{fd=13, revents=POLLIN}])
readv(13, 
[{"R\333\0\0\377\377\377\377R\333\0\0\377\377\377\377R\333\0\0\0\0\0\0\0\0\0\4\0\0\0\232"...,
 36}], 1) = 36
readv(13, 
[{"R\333\0\0\377\377\377\377R\333\0\0\0\0\0\0\0\0\0\n\0\0\0\1\0\0\0u1390"..., 
154}], 1) = 154
poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, 
{fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}], 6, 0) = 
0 (Timeout)
--- SIGSEGV (Segmentation fault) @ 0 (0) ---


OK, I matched the disassemblies and confirmed that the crash originates in 
process_orted_launch_report, and therefore matched up the source code line with 
where gdb reckons the program counter was at that point:

/* update state */
pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;

Hopefully all this information helps a little!





[OMPI users] Segmentation fault with SLURM and non-local nodes

2011-01-27 Thread Michael Curtis
Hi,

I'm not sure whether this problem is with SLURM or OpenMPI, but the stack 
traces (below) point to an issue within OpenMPI.

Whenever I try to launch an MPI job within SLURM, mpirun immediately 
segmentation faults -- but only if the machine that SLURM allocated to MPI is 
different to the one that I launched the MPI job.

However, if I force SLURM to allocate only the local node (ie, the one on which 
salloc was called), everything works fine.

Failing case:
michael@ipc ~ $ salloc -n8 mpirun --display-map ./mpi
    JOB MAP   

 Data for node: Name: ipc4  Num procs: 8
Process OMPI jobid: [21326,1] Process rank: 0
Process OMPI jobid: [21326,1] Process rank: 1
Process OMPI jobid: [21326,1] Process rank: 2
Process OMPI jobid: [21326,1] Process rank: 3
Process OMPI jobid: [21326,1] Process rank: 4
Process OMPI jobid: [21326,1] Process rank: 5
Process OMPI jobid: [21326,1] Process rank: 6
Process OMPI jobid: [21326,1] Process rank: 7

 =
[ipc:16986] *** Process received signal ***
[ipc:16986] Signal: Segmentation fault (11)
[ipc:16986] Signal code: Address not mapped (1)
[ipc:16986] Failing at address: 0x801328268
[ipc:16986] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7ff85c7638f0]
[ipc:16986] [ 1] /usr/lib/libopen-rte.so.0(+0x3459a) [0x7ff85d4a059a]
[ipc:16986] [ 2] /usr/lib/libopen-pal.so.0(+0x1eeb8) [0x7ff85d233eb8]
[ipc:16986] [ 3] /usr/lib/libopen-pal.so.0(opal_progress+0x99) [0x7ff85d228439]
[ipc:16986] [ 4] /usr/lib/libopen-rte.so.0(orte_plm_base_daemon_callback+0x9d) 
[0x7ff85d4a002d]
[ipc:16986] [ 5] /usr/lib/openmpi/lib/openmpi/mca_plm_slurm.so(+0x211a) 
[0x7ff85bbc311a]
[ipc:16986] [ 6] mpirun() [0x403c1f]
[ipc:16986] [ 7] mpirun() [0x403014]
[ipc:16986] [ 8] /lib/libc.so.6(__libc_start_main+0xfd) [0x7ff85c3efc4d]
[ipc:16986] [ 9] mpirun() [0x402f39]
[ipc:16986] *** End of error message ***

Non-failing case:
michael@eng-ipc4 ~ $ salloc -n8 -w ipc4 mpirun --display-map ./mpi
    JOB MAP   

 Data for node: Name: eng-ipc4.FQDN Num procs: 8
Process OMPI jobid: [12467,1] Process rank: 0
Process OMPI jobid: [12467,1] Process rank: 1
Process OMPI jobid: [12467,1] Process rank: 2
Process OMPI jobid: [12467,1] Process rank: 3
Process OMPI jobid: [12467,1] Process rank: 4
Process OMPI jobid: [12467,1] Process rank: 5
Process OMPI jobid: [12467,1] Process rank: 6
Process OMPI jobid: [12467,1] Process rank: 7

 =
Process 1 on eng-ipc4.FQDN out of 8
Process 3 on eng-ipc4.FQDN out of 8
Process 4 on eng-ipc4.FQDN out of 8
Process 6 on eng-ipc4.FQDN out of 8
Process 7 on eng-ipc4.FQDN out of 8
Process 0 on eng-ipc4.FQDN out of 8
Process 2 on eng-ipc4.FQDN out of 8
Process 5 on eng-ipc4.FQDN out of 8

Using mpi directly is fine:
eg mpirun -H 'ipc3,ipc4'  -np 8 ./mpi
Works as expected

This is a (small) homogenous cluster, all Xeon class machines with plenty of 
RAM and shared filesystem over NFS, running 64-bit Ubuntu server.  I was 
running stock OpenMPI (1.4.1) and SLURM (2.1.1), I have since upgraded to 
latest stable OpenMPI (1.4.3) and SLURM (2.2.0), with no effect. (the newer 
binaries were compiled from the respective upstream Debian packages).

strace (not shown) shows that the job is launched via srun and a connection is 
received back from the child process over TCP/IP. Soon after this, mpirun 
crashes. Nodes communicate over a semi-dedicated TCP/IP GigE connection.

Is this a known bug? What is going wrong?

Regards,
Michael Curtis