Re: [OMPI users] running a ompi 1.4.2 job with -np versus -npernode

Ralph Castain Mon, 17 May 2010 18:47:41 -0400

Hmmm....well, according to this, it looks like the process ranks are being
incorrectly assigned. Shouldn't have anything to do with what environ we are
in (slurm, rsh, etc).


I'll look into it - thanks!

On Mon, May 17, 2010 at 4:25 PM, Christopher Maestas <cdmaes...@gmail.com>wrote:

> OK.  The -np only run:
> ---
> sh-3.1$ mpirun -np 2 --display-allocation --display-devel-map mpi_hello
>
> ======================   ALLOCATED NODES   ======================
>
>  Data for node: Name: cut1n7            Launch id: -1   Arch: ffc91200
>  State: 2
>         Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
>         Daemon: [[51868,0],0]   Daemon launched: True
>         Num slots: 1    Slots in use: 0
>         Num slots allocated: 1  Max slots: 0
>         Username on node: NULL
>         Num procs: 0    Next node_rank: 0
>  Data for node: Name: cut1n8            Launch id: -1   Arch: 0 State: 2
>         Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
>         Daemon: Not defined     Daemon launched: False
>         Num slots: 0    Slots in use: 0
>         Num slots allocated: 0  Max slots: 0
>         Username on node: NULL
>         Num procs: 0    Next node_rank: 0
>
> =================================================================
>
>  Map generated by mapping policy: 0400
>         Npernode: 0     Oversubscribe allowed: TRUE     CPU Lists: FALSE
>         Num new daemons: 1      New daemon starting vpid 1
>         Num nodes: 2
>
>  Data for node: Name: cut1n7            Launch id: -1   Arch: ffc91200
>  State: 2
>         Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
>         Daemon: [[51868,0],0]   Daemon launched: True
>         Num slots: 1    Slots in use: 1
>         Num slots allocated: 1  Max slots: 0
>         Username on node: NULL
>         Num procs: 1    Next node_rank: 1
>         Data for proc: [[51868,1],0]
>                 Pid: 0  Local rank: 0   Node rank: 0
>                 State: 0        App_context: 0  Slot list: NULL
>
>  Data for node: Name: cut1n8            Launch id: -1   Arch: 0 State: 2
>         Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
>         Daemon: [[51868,0],1]   Daemon launched: False
>         Num slots: 0    Slots in use: 1
>         Num slots allocated: 0  Max slots: 0
>         Username on node: NULL
>         Num procs: 1    Next node_rank: 1
>         Data for proc: [[51868,1],1]
>                 Pid: 0  Local rank: 0   Node rank: 0
>                 State: 0        App_context: 0  Slot list: NULL
> Hello, I am node cut1n8 with rank 1
> Hello, I am node cut1n7 with rank 0
>
> ---
>
> Before the segfault I got (using -npernode):
> ---
> sh-3.1$ mpirun -npernode 1 --display-allocation --display-devel-map
> mpi_hello
>
> ======================   ALLOCATED NODES   ======================
>
>  Data for node: Name: cut1n7            Launch id: -1   Arch: ffc91200
>  State: 2
>         Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
>         Daemon: [[51942,0],0]   Daemon launched: True
>         Num slots: 1    Slots in use: 0
>         Num slots allocated: 1  Max slots: 0
>         Username on node: NULL
>         Num procs: 0    Next node_rank: 0
>  Data for node: Name: cut1n8            Launch id: -1   Arch: 0 State: 2
>         Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
>         Daemon: Not defined     Daemon launched: False
>         Num slots: 0    Slots in use: 0
>         Num slots allocated: 0  Max slots: 0
>         Username on node: NULL
>         Num procs: 0    Next node_rank: 0
> =================================================================
>
>  Map generated by mapping policy: 0400
>         Npernode: 1     Oversubscribe allowed: TRUE     CPU Lists: FALSE
>         Num new daemons: 1      New daemon starting vpid 1
>         Num nodes: 2
>
>  Data for node: Name: cut1n7            Launch id: -1   Arch: ffc91200
>  State: 2
>         Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
>         Daemon: [[51942,0],0]   Daemon launched: True
>         Num slots: 1    Slots in use: 1
>         Num slots allocated: 1  Max slots: 0
>         Username on node: NULL
>         Num procs: 1    Next node_rank: 1
>         Data for proc: [[51942,1],0]
>                 Pid: 0  Local rank: 0   Node rank: 0
>                 State: 0        App_context: 0  Slot list: NULL
>
>  Data for node: Name: cut1n8            Launch id: -1   Arch: 0 State: 2
>         Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
>         Daemon: [[51942,0],1]   Daemon launched: False
>         Num slots: 0    Slots in use: 1
>         Num slots allocated: 0  Max slots: 0
>         Username on node: NULL
>         Num procs: 1    Next node_rank: 1
>         Data for proc: [[51942,1],0]
>                 Pid: 0  Local rank: 0   Node rank: 0
>                 State: 0        App_context: 0  Slot list: NULL
> [cut1n7:19375] *** Process received signal ***
> [cut1n7:19375] Signal: Segmentation fault (11)
> [cut1n7:19375] Signal code: Address not mapped (1)
> [cut1n7:19375] Failing at address: 0x50
> [cut1n7:19375] [ 0] /lib64/libpthread.so.0 [0x37bda0de80]
> [cut1n7:19375] [ 1]
> /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xdb)
> [0x2aed0f93af8b]
> [cut1n7:19375] [ 2]
> /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x655)
> [0x2aed0f9462f5]
> [cut1n7:19375] [ 3]
> /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x10b)
> [0x2aed0f94d31b]
> [cut1n7:19375] [ 4]
> /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/openmpi/mca_plm_slurm.so
> [0x2aed107f6ecf]
> [cut1n7:19375] [ 5] mpirun [0x40335a]
> [cut1n7:19375] [ 6] mpirun [0x4029f3]
> [cut1n7:19375] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x37bce1d8b4]
> [cut1n7:19375] [ 8] mpirun [0x402929]
> [cut1n7:19375] *** End of error message ***
> Segmentation fault
> ---
>
> I'll look into a slurm version update.  Previously, SLURM 1.0.30 and Open
> MPI 1.3.2 working together.  Just curious what was giving me heartache here
> ...
>
> On Mon, May 17, 2010 at 4:06 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> That's a pretty old version of slurm - I don't have access to anything
>> that old to test against. You could try running it with --display-allocation
>> --display-devel-map to see what ORTE thinks the allocation is and how it
>> mapped the procs. It sounds like something may be having a problem there...
>>
>>
>> On Mon, May 17, 2010 at 11:08 AM, Christopher Maestas <
>> cdmaes...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I've been having some troubles with OpenMPI 1.4.X and slurm recently.  I
>>> seem to be able to run jobs this way ok:
>>> ---
>>> sh-3.1$ mpirun -np 2 mpi_hello
>>> Hello, I am node cut1n7 with rank 0
>>> Hello, I am node cut1n8 with rank 1
>>> --
>>>
>>> However if I try and use the -npernode option I get:
>>> ---
>>> sh-3.1$ mpirun -npernode 1 mpi_hello
>>> [cut1n7:16368] *** Process received signal ***
>>> [cut1n7:16368] Signal: Segmentation fault (11)
>>> [cut1n7:16368] Signal code: Address not mapped (1)
>>> [cut1n7:16368] Failing at address: 0x50
>>> [cut1n7:16368] [ 0] /lib64/libpthread.so.0 [0x37bda0de80]
>>> [cut1n7:16368] [ 1]
>>> /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xdb)
>>> [0x2b73eb84df8b]
>>> [cut1n7:16368] [ 2]
>>> /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x655)
>>> [0x2b73eb8592f5]
>>> [cut1n7:16368] [ 3]
>>> /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x10b)
>>> [0x2b73eb86031b]
>>> [cut1n7:16368] [ 4]
>>> /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/openmpi/mca_plm_slurm.so
>>> [0x2b73ec709ecf]
>>> [cut1n7:16368] [ 5] mpirun [0x40335a]
>>> [cut1n7:16368] [ 6] mpirun [0x4029f3]
>>> [cut1n7:16368] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4)
>>> [0x37bce1d8b4]
>>> [cut1n7:16368] [ 8] mpirun [0x402929]
>>> [cut1n7:16368] *** End of error message ***
>>> Segmentation fault
>>> ---
>>>
>>> This is ompi 1.4.2, gcc 4.1.1 and slurm 2.0.9 ... I'm sure it's a rather
>>> silly detail on my end, but figure I should start this thread for any
>>> insights and feedback I can help provide to resolve this.
>>>
>>> Thanks,
>>> -cdm
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] running a ompi 1.4.2 job with -np versus -npernode

Reply via email to