see also: https://pastebin.com/s5tjaUkF

(py3.9) ➜  /share  cat hosts
192.168.180.48 slots=1
192.168.60.203 slots=1

1.  This command now runs correctly using your
openmpi-gitclone-pr11096.tar.bz2
(py3.9) ➜  /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose
100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime


2. But this command gets stuck. It seems to be the mpi program that gets
stuck.
test.py:
import mpi4py
from mpi4py import MPI

(py3.9) ➜  /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose
100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 python test.py
[computer01:47982] mca: base: component_find: searching NULL for plm
components
[computer01:47982] mca: base: find_dyn_components: checking NULL for plm
components
[computer01:47982] pmix:mca: base: components_register: registering
framework plm components
[computer01:47982] pmix:mca: base: components_register: found loaded
component slurm
[computer01:47982] pmix:mca: base: components_register: component slurm
register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded
component ssh
[computer01:47982] pmix:mca: base: components_register: component ssh
register function successful
[computer01:47982] mca: base: components_open: opening plm components
[computer01:47982] mca: base: components_open: found loaded component slurm
[computer01:47982] mca: base: components_open: component slurm open
function successful
[computer01:47982] mca: base: components_open: found loaded component ssh
[computer01:47982] mca: base: components_open: component ssh open function
successful
[computer01:47982] mca:base:select: Auto-selecting plm components
[computer01:47982] mca:base:select:(  plm) Querying component [slurm]
[computer01:47982] mca:base:select:(  plm) Querying component [ssh]
[computer01:47982] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
[computer01:47982] mca:base:select:(  plm) Query of component [ssh] set
priority to 10
[computer01:47982] mca:base:select:(  plm) Selected component [ssh]
[computer01:47982] mca: base: close: component slurm closed
[computer01:47982] mca: base: close: unloading component slurm
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh_setup on agent
ssh : rsh path NULL
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive start
comm
[computer01:47982] mca: base: component_find: searching NULL for ras
components
[computer01:47982] mca: base: find_dyn_components: checking NULL for ras
components
[computer01:47982] pmix:mca: base: components_register: registering
framework ras components
[computer01:47982] pmix:mca: base: components_register: found loaded
component simulator
[computer01:47982] pmix:mca: base: components_register: component simulator
register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded
component pbs
[computer01:47982] pmix:mca: base: components_register: component pbs
register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded
component slurm
[computer01:47982] pmix:mca: base: components_register: component slurm
register function successful
[computer01:47982] mca: base: components_open: opening ras components
[computer01:47982] mca: base: components_open: found loaded component
simulator
[computer01:47982] mca: base: components_open: found loaded component pbs
[computer01:47982] mca: base: components_open: component pbs open function
successful
[computer01:47982] mca: base: components_open: found loaded component slurm
[computer01:47982] mca: base: components_open: component slurm open
function successful
[computer01:47982] mca:base:select: Auto-selecting ras components
[computer01:47982] mca:base:select:(  ras) Querying component [simulator]
[computer01:47982] mca:base:select:(  ras) Querying component [pbs]
[computer01:47982] mca:base:select:(  ras) Querying component [slurm]
[computer01:47982] mca:base:select:(  ras) No component selected!
[computer01:47982] mca: base: component_find: searching NULL for rmaps
components
[computer01:47982] mca: base: find_dyn_components: checking NULL for rmaps
components
[computer01:47982] pmix:mca: base: components_register: registering
framework rmaps components
[computer01:47982] pmix:mca: base: components_register: found loaded
component ppr
[computer01:47982] pmix:mca: base: components_register: component ppr
register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded
component rank_file
[computer01:47982] pmix:mca: base: components_register: component rank_file
has no register or open function
[computer01:47982] pmix:mca: base: components_register: found loaded
component round_robin
[computer01:47982] pmix:mca: base: components_register: component
round_robin register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded
component seq
[computer01:47982] pmix:mca: base: components_register: component seq
register function successful
[computer01:47982] mca: base: components_open: opening rmaps components
[computer01:47982] mca: base: components_open: found loaded component ppr
[computer01:47982] mca: base: components_open: component ppr open function
successful
[computer01:47982] mca: base: components_open: found loaded component
rank_file
[computer01:47982] mca: base: components_open: found loaded component
round_robin
[computer01:47982] mca: base: components_open: component round_robin open
function successful
[computer01:47982] mca: base: components_open: found loaded component seq
[computer01:47982] mca: base: components_open: component seq open function
successful
[computer01:47982] mca:rmaps:select: checking available component ppr
[computer01:47982] mca:rmaps:select: Querying component [ppr]
[computer01:47982] mca:rmaps:select: checking available component rank_file
[computer01:47982] mca:rmaps:select: Querying component [rank_file]
[computer01:47982] mca:rmaps:select: checking available component
round_robin
[computer01:47982] mca:rmaps:select: Querying component [round_robin]
[computer01:47982] mca:rmaps:select: checking available component seq
[computer01:47982] mca:rmaps:select: Querying component [seq]
[computer01:47982] [prterun-computer01-47982@0,0]: Final mapper priorities
[computer01:47982]     Mapper: rank_file Priority: 100
[computer01:47982]     Mapper: ppr Priority: 90
[computer01:47982]     Mapper: seq Priority: 60
[computer01:47982]     Mapper: round_robin Priority: 10
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate nothing
found in module - proceeding to hostfile
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate adding
hostfile hosts
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking
hostfile hosts for nodes
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: node
192.168.180.48 is being included - keep all is FALSE
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: node
192.168.60.203 is being included - keep all is FALSE
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node
192.168.180.48 slots 1
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node
192.168.60.203 slots 1
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert
inserting 2 nodes
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert
updating HNP [192.168.180.48] info to 1 slots
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert node
192.168.60.203 slots 1

======================   ALLOCATED NODES   ======================
    computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 192.168.180.48
    192.168.60.203: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
    Flags: SLOTS_GIVEN
    aliases: NONE
=================================================================
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm
creating map
[computer01:47982] [prterun-computer01-47982@0,0] setup:vm: working
unmanaged allocation
[computer01:47982] [prterun-computer01-47982@0,0] using hostfile hosts
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking
hostfile hosts for nodes
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: node
192.168.180.48 is being included - keep all is FALSE
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: node
192.168.60.203 is being included - keep all is FALSE
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node
192.168.180.48 slots 1
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node
192.168.60.203 slots 1
[computer01:47982] [prterun-computer01-47982@0,0] checking node
192.168.180.48
[computer01:47982] [prterun-computer01-47982@0,0] ignoring myself
[computer01:47982] [prterun-computer01-47982@0,0] checking node
192.168.60.203
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm add new
daemon [prterun-computer01-47982@0,1]
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm
assigning new daemon [prterun-computer01-47982@0,1] to node 192.168.60.203
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: launching vm
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: local shell: 0
(bash)
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: assuming same
remote shell as local shell
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: remote shell: 0
(bash)
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: final template
argv:
    /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export
PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export
LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export
DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env"
--prtemca ess_base_nspace "prterun-computer01-47982@0" --prtemca
ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca
prte_hnp_uri
"prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"
<prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>
--prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100"
--prtemca ras_base_verbose "100" --prtemca pmix_session_server "1"
--prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri
"prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"
<prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh:launch daemon 0
not a child of mine
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: adding node
192.168.60.203 to launch list
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: activating
launch event
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: recording launch
of daemon [prterun-computer01-47982@0,1]
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: executing:
(/usr/bin/ssh) [/usr/bin/ssh 192.168.60.203
PRTE_PREFIX=/usr/local/openmpi;export
PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export
LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export
DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env"
--prtemca ess_base_nspace "prterun-computer01-47982@0" --prtemca
ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri
"prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"
<prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>
--prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100"
--prtemca ras_base_verbose "100" --prtemca pmix_session_server "1"
--prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri
"prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"
<prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>
]
[computer01:47982] [prterun-computer01-47982@0,0]
plm:base:orted_report_launch from daemon [prterun-computer01-47982@0,1]
[computer01:47982] [prterun-computer01-47982@0,0]
plm:base:orted_report_launch from daemon [prterun-computer01-47982@0,1] on
node computer02
[computer01:47982] ALIASES FOR NODE computer02 (computer02)
[computer01:47982]     ALIAS: 192.168.60.203
[computer01:47982]     ALIAS: computer02
[computer01:47982]     ALIAS: 172.17.180.203
[computer01:47982]     ALIAS: 172.168.10.23
[computer01:47982]     ALIAS: 172.168.10.143
[computer01:47982] [prterun-computer01-47982@0,0] RECEIVED TOPOLOGY SIG
2N:2S:2L3:64L2:64L1:64C:128H:0-127::x86_64:le FROM NODE computer02
[computer01:47982] [prterun-computer01-47982@0,0] NEW TOPOLOGY - ADDING
SIGNATURE
[computer01:47982] [prterun-computer01-47982@0,0]
plm:base:orted_report_launch completed for daemon
[prterun-computer01-47982@0,1] at contact
prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24
[computer01:47982] [prterun-computer01-47982@0,0]
plm:base:orted_report_launch job prterun-computer01-47982@0 recvd 2 of 2
reported daemons
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive
processing msg
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive job
launch command from [prterun-computer01-47982@0,0]
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive adding
hosts

======================   ALLOCATED NODES   ======================
    computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 192.168.180.48
    computer02: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases:
192.168.60.203,computer02,172.17.180.203,172.168.10.23,172.168.10.143
=================================================================
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive calling
spawn
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done
processing commands
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_job
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate
allocation already read
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm
[computer01:47982] [prterun-computer01-47982@0,0] plm_base:setup_vm NODE
computer02 WAS NOT ADDED
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm no new
daemons required
[computer01:47982] mca:rmaps: mapping job prterun-computer01-47982@1
[computer01:47982] mca:rmaps: setting mapping policies for job
prterun-computer01-47982@1 inherit TRUE hwtcpus FALSE
[computer01:47982] mca:rmaps[355] mapping not given - using bycore
[computer01:47982] setdefaultbinding[314] binding not given - using bycore
[computer01:47982] mca:rmaps:rf: job prterun-computer01-47982@1 not using
rankfile policy
[computer01:47982] mca:rmaps:ppr: job prterun-computer01-47982@1 not using
ppr mapper PPR NULL policy PPR NOTSET
[computer01:47982] [prterun-computer01-47982@0,0] rmaps:seq called on job
prterun-computer01-47982@1
[computer01:47982] mca:rmaps:seq: job prterun-computer01-47982@1 not using
seq mapper
[computer01:47982] mca:rmaps:rr: mapping job prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] using hostfile hosts
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking
hostfile hosts for nodes
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: node
192.168.180.48 is being included - keep all is FALSE
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: node
192.168.60.203 is being included - keep all is FALSE
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node
192.168.180.48 slots 1
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node
192.168.60.203 slots 1
[computer01:47982] NODE computer01 DOESNT MATCH NODE 192.168.60.203
[computer01:47982] [prterun-computer01-47982@0,0] node computer01 has 1
slots available
[computer01:47982] [prterun-computer01-47982@0,0] node computer02 has 1
slots available
[computer01:47982] AVAILABLE NODES FOR MAPPING:
[computer01:47982]     node: computer01 daemon: 0 slots_available: 1
[computer01:47982]     node: computer02 daemon: 1 slots_available: 1
[computer01:47982] mca:rmaps:rr: mapping by Core for job
prterun-computer01-47982@1 slots 2 num_procs 2
[computer01:47982] mca:rmaps:rr: found 56 Core objects on node computer01
[computer01:47982] mca:rmaps:rr: assigning nprocs 1
[computer01:47982] mca:rmaps:rr: assigning proc to object 0
[computer01:47982] [prterun-computer01-47982@0,0] get_avail_ncpus: node
computer01 has 0 procs on it
[computer01:47982] mca:rmaps: compute bindings for job
prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007]
[computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID]
with policy CORE:IF-SUPPORTED
[computer01:47982] [prterun-computer01-47982@0,0] BOUND PROC
[prterun-computer01-47982@1,INVALID][computer01] TO package[0][core:0]
[computer01:47982] mca:rmaps:rr: found 64 Core objects on node computer02
[computer01:47982] mca:rmaps:rr: assigning nprocs 1
[computer01:47982] mca:rmaps:rr: assigning proc to object 0
[computer01:47982] [prterun-computer01-47982@0,0] get_avail_ncpus: node
computer02 has 0 procs on it
[computer01:47982] mca:rmaps: compute bindings for job
prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007]
[computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID]
with policy CORE:IF-SUPPORTED
[computer01:47982] [prterun-computer01-47982@0,0] BOUND PROC
[prterun-computer01-47982@1,INVALID][computer02] TO package[0][core:0]
[computer01:47982] [prterun-computer01-47982@0,0] complete_setup on job
prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch_apps for
job prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:send launch msg
for job prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive
processing msg
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive local
launch complete command from [prterun-computer01-47982@0,1]
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got
local launch complete for job prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got
local launch complete for vpid 1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got
local launch complete for vpid 1 state RUNNING
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done
processing commands
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch wiring up
iof for job prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive
processing msg
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive
registered command from [prterun-computer01-47982@0,1]
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got
registered for job prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got
registered for vpid 1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done
processing commands
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch
prterun-computer01-47982@1 registered
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:prted_cmd
sending prted_exit commands  #### ctrl + c
Abort is in progress...hit ctrl-c again to forcibly terminate

Reply via email to