Ok, this looks like the same type of output running ring_c as your Python MPI 
app -- good.  Using a C MPI program for testing just eliminates some possible 
variables / issues.

Ok, let's try running again, but add some more command line parameters:

mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca 
rmaps_base_verbose 100 --mca ras_base_verbose 100 --prtemca 
grpcomm_base_verbose 5 --prtemca state_base_verbose 5 ./ring_c

And please send the output back here to the list.

--
Jeff Squyres
jsquy...@cisco.com
________________________________
From: timesir <mrlong...@gmail.com>
Sent: Tuesday, November 29, 2022 9:44 PM
To: Jeff Squyres (jsquyres) <jsquy...@cisco.com>
Subject: Re: mpi program gets stuck


Do you think the information below is enough? If not, I will add more


(py3.9) ➜  /share  cat hosts
192.168.180.48 slots=1
192.168.60.203 slots=1



(py3.9) ➜  examples  mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 
--mca rmaps_base_verbose 100 --mca ras_base_verbose 100  ./ring_c

[computer01:74388] mca: base: component_find: searching NULL for plm components
[computer01:74388] mca: base: find_dyn_components: checking NULL for plm 
components
[computer01:74388] pmix:mca: base: components_register: registering framework 
plm components
[computer01:74388] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:74388] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:74388] pmix:mca: base: components_register: found loaded component 
ssh
[computer01:74388] pmix:mca: base: components_register: component ssh register 
function successful
[computer01:74388] mca: base: components_open: opening plm components
[computer01:74388] mca: base: components_open: found loaded component slurm
[computer01:74388] mca: base: components_open: component slurm open function 
successful
[computer01:74388] mca: base: components_open: found loaded component ssh
[computer01:74388] mca: base: components_open: component ssh open function 
successful
[computer01:74388] mca:base:select: Auto-selecting plm components
[computer01:74388] mca:base:select:(  plm) Querying component [slurm]
[computer01:74388] mca:base:select:(  plm) Querying component [ssh]
[computer01:74388] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
[computer01:74388] mca:base:select:(  plm) Query of component [ssh] set 
priority to 10
[computer01:74388] mca:base:select:(  plm) Selected component [ssh]
[computer01:74388] mca: base: close: component slurm closed
[computer01:74388] mca: base: close: unloading component slurm
[computer01:74388] [prterun-computer01-74388@0,0] plm:ssh_setup on agent ssh : 
rsh path NULL
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive start comm
[computer01:74388] mca: base: component_find: searching NULL for ras components
[computer01:74388] mca: base: find_dyn_components: checking NULL for ras 
components
[computer01:74388] pmix:mca: base: components_register: registering framework 
ras components
[computer01:74388] pmix:mca: base: components_register: found loaded component 
simulator
[computer01:74388] pmix:mca: base: components_register: component simulator 
register function successful
[computer01:74388] pmix:mca: base: components_register: found loaded component 
pbs
[computer01:74388] pmix:mca: base: components_register: component pbs register 
function successful
[computer01:74388] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:74388] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:74388] mca: base: components_open: opening ras components
[computer01:74388] mca: base: components_open: found loaded component simulator
[computer01:74388] mca: base: components_open: found loaded component pbs
[computer01:74388] mca: base: components_open: component pbs open function 
successful
[computer01:74388] mca: base: components_open: found loaded component slurm
[computer01:74388] mca: base: components_open: component slurm open function 
successful
[computer01:74388] mca:base:select: Auto-selecting ras components
[computer01:74388] mca:base:select:(  ras) Querying component [simulator]
[computer01:74388] mca:base:select:(  ras) Querying component [pbs]
[computer01:74388] mca:base:select:(  ras) Querying component [slurm]
[computer01:74388] mca:base:select:(  ras) No component selected!
[computer01:74388] mca: base: component_find: searching NULL for rmaps 
components
[computer01:74388] mca: base: find_dyn_components: checking NULL for rmaps 
components
[computer01:74388] pmix:mca: base: components_register: registering framework 
rmaps components
[computer01:74388] pmix:mca: base: components_register: found loaded component 
ppr
[computer01:74388] pmix:mca: base: components_register: component ppr register 
function successful
[computer01:74388] pmix:mca: base: components_register: found loaded component 
rank_file
[computer01:74388] pmix:mca: base: components_register: component rank_file has 
no register or open function
[computer01:74388] pmix:mca: base: components_register: found loaded component 
round_robin
[computer01:74388] pmix:mca: base: components_register: component round_robin 
register function successful
[computer01:74388] pmix:mca: base: components_register: found loaded component 
seq
[computer01:74388] pmix:mca: base: components_register: component seq register 
function successful
[computer01:74388] mca: base: components_open: opening rmaps components
[computer01:74388] mca: base: components_open: found loaded component ppr
[computer01:74388] mca: base: components_open: component ppr open function 
successful
[computer01:74388] mca: base: components_open: found loaded component rank_file
[computer01:74388] mca: base: components_open: found loaded component 
round_robin
[computer01:74388] mca: base: components_open: component round_robin open 
function successful
[computer01:74388] mca: base: components_open: found loaded component seq
[computer01:74388] mca: base: components_open: component seq open function 
successful
[computer01:74388] mca:rmaps:select: checking available component ppr
[computer01:74388] mca:rmaps:select: Querying component [ppr]
[computer01:74388] mca:rmaps:select: checking available component rank_file
[computer01:74388] mca:rmaps:select: Querying component [rank_file]
[computer01:74388] mca:rmaps:select: checking available component round_robin
[computer01:74388] mca:rmaps:select: Querying component [round_robin]
[computer01:74388] mca:rmaps:select: checking available component seq
[computer01:74388] mca:rmaps:select: Querying component [seq]
[computer01:74388] [prterun-computer01-74388@0,0]: Final mapper priorities
[computer01:74388]     Mapper: rank_file Priority: 100
[computer01:74388]     Mapper: ppr Priority: 90
[computer01:74388]     Mapper: seq Priority: 60
[computer01:74388]     Mapper: round_robin Priority: 10
[computer01:74388] [prterun-computer01-74388@0,0] ras:base:allocate
[computer01:74388] [prterun-computer01-74388@0,0] ras:base:allocate nothing 
found in module - proceeding to hostfile
[computer01:74388] [prterun-computer01-74388@0,0] ras:base:allocate adding 
hostfile hosts
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: checking hostfile 
hosts for nodes
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: node 192.168.180.48 
is being included - keep all is FALSE
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: node 192.168.60.203 
is being included - keep all is FALSE
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: adding node 
192.168.180.48 slots 1
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: adding node 
192.168.60.203 slots 1
[computer01:74388] [prterun-computer01-74388@0,0] ras:base:node_insert 
inserting 2 nodes
[computer01:74388] [prterun-computer01-74388@0,0] ras:base:node_insert updating 
HNP [192.168.180.48] info to 1 slots
[computer01:74388] [prterun-computer01-74388@0,0] ras:base:node_insert node 
192.168.60.203 slots 1

======================   ALLOCATED NODES   ======================
    computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 192.168.180.48
    192.168.60.203: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
    Flags: SLOTS_GIVEN
    aliases: NONE
=================================================================
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_vm
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_vm creating map
[computer01:74388] [prterun-computer01-74388@0,0] setup:vm: working unmanaged 
allocation
[computer01:74388] [prterun-computer01-74388@0,0] using hostfile hosts
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: checking hostfile 
hosts for nodes
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: node 192.168.180.48 
is being included - keep all is FALSE
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: node 192.168.60.203 
is being included - keep all is FALSE
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: adding node 
192.168.180.48 slots 1
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: adding node 
192.168.60.203 slots 1
[computer01:74388] [prterun-computer01-74388@0,0] checking node 192.168.180.48
[computer01:74388] [prterun-computer01-74388@0,0] ignoring myself
[computer01:74388] [prterun-computer01-74388@0,0] checking node 192.168.60.203
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_vm add new 
daemon [prterun-computer01-74388@0,1]
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_vm assigning 
new daemon [prterun-computer01-74388@0,1] to node 192.168.60.203
[computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: launching vm
[computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: local shell: 0 (bash)
[computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: assuming same remote 
shell as local shell
[computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: remote shell: 0 
(bash)
[computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: final template argv:
    /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export 
PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export
 
LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export
 DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca 
ess_base_nspace "prterun-computer01-74388@0" --prtemca ess_base_vpid 
"<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri 
"prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24"<mailto:prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24>
 --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca 
ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" 
--tree-spawn --prtemca prte_parent_uri 
"prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24"<mailto:prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24>
[computer01:74388] [prterun-computer01-74388@0,0] plm:ssh:launch daemon 0 not a 
child of mine
[computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: adding node 
192.168.60.203 to launch list
[computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: activating launch 
event
[computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: recording launch of 
daemon [prterun-computer01-74388@0,1]
[computer01:74388] [prterun-computer01-74388@0,0] plm:ssh: executing: 
(/usr/bin/ssh) [/usr/bin/ssh 192.168.60.203 
PRTE_PREFIX=/usr/local/openmpi;export 
PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export
 
LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export
 DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca 
ess_base_nspace "prterun-computer01-74388@0" --prtemca ess_base_vpid 1 
--prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri 
"prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24"<mailto:prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24>
 --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca 
ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" 
--tree-spawn --prtemca prte_parent_uri 
"prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24"<mailto:prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24>]
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:orted_report_launch 
from daemon [prterun-computer01-74388@0,1]
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:orted_report_launch 
from daemon [prterun-computer01-74388@0,1] on node computer02
[computer01:74388] ALIASES FOR NODE computer02 (computer02)
[computer01:74388]     ALIAS: 192.168.60.203
[computer01:74388]     ALIAS: computer02
[computer01:74388]     ALIAS: 172.17.180.203
[computer01:74388]     ALIAS: 172.168.10.23
[computer01:74388]     ALIAS: 172.168.10.143
[computer01:74388] [prterun-computer01-74388@0,0] RECEIVED TOPOLOGY SIG 
2N:2S:2L3:64L2:64L1:64C:128H:0-127::x86_64:le FROM NODE computer02
[computer01:74388] [prterun-computer01-74388@0,0] NEW TOPOLOGY - ADDING 
SIGNATURE
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:orted_report_launch 
completed for daemon [prterun-computer01-74388@0,1] at contact 
prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24<mailto:prterun-computer01-74388@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59616:24,16,24,24,24,24>
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:orted_report_launch 
job prterun-computer01-74388@0 recvd 2 of 2 reported daemons
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive processing 
msg
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive job launch 
command from [prterun-computer01-74388@0,0]
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive adding hosts

======================   ALLOCATED NODES   ======================
    computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 192.168.180.48
    computer02: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 
192.168.60.203,computer02,172.17.180.203,172.168.10.23,172.168.10.143
=================================================================
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive calling spawn
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive done 
processing commands
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_job
[computer01:74388] [prterun-computer01-74388@0,0] ras:base:allocate
[computer01:74388] [prterun-computer01-74388@0,0] ras:base:allocate allocation 
already read
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_vm
[computer01:74388] [prterun-computer01-74388@0,0] plm_base:setup_vm NODE 
computer02 WAS NOT ADDED
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:setup_vm no new 
daemons required
[computer01:74388] mca:rmaps: mapping job prterun-computer01-74388@1
[computer01:74388] mca:rmaps: setting mapping policies for job 
prterun-computer01-74388@1 inherit TRUE hwtcpus FALSE
[computer01:74388] mca:rmaps[355] mapping not given - using bycore
[computer01:74388] setdefaultbinding[314] binding not given - using bycore
[computer01:74388] mca:rmaps:rf: job prterun-computer01-74388@1 not using 
rankfile policy
[computer01:74388] mca:rmaps:ppr: job prterun-computer01-74388@1 not using ppr 
mapper PPR NULL policy PPR NOTSET
[computer01:74388] [prterun-computer01-74388@0,0] rmaps:seq called on job 
prterun-computer01-74388@1
[computer01:74388] mca:rmaps:seq: job prterun-computer01-74388@1 not using seq 
mapper
[computer01:74388] mca:rmaps:rr: mapping job prterun-computer01-74388@1
[computer01:74388] [prterun-computer01-74388@0,0] using hostfile hosts
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: checking hostfile 
hosts for nodes
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: node 192.168.180.48 
is being included - keep all is FALSE
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: node 192.168.60.203 
is being included - keep all is FALSE
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: adding node 
192.168.180.48 slots 1
[computer01:74388] [prterun-computer01-74388@0,0] hostfile: adding node 
192.168.60.203 slots 1
[computer01:74388] NODE computer01 DOESNT MATCH NODE 192.168.60.203
[computer01:74388] [prterun-computer01-74388@0,0] node computer01 has 1 slots 
available
[computer01:74388] [prterun-computer01-74388@0,0] node computer02 has 1 slots 
available
[computer01:74388] AVAILABLE NODES FOR MAPPING:
[computer01:74388]     node: computer01 daemon: 0 slots_available: 1
[computer01:74388]     node: computer02 daemon: 1 slots_available: 1
[computer01:74388] mca:rmaps:rr: mapping by Core for job 
prterun-computer01-74388@1 slots 2 num_procs 2
[computer01:74388] mca:rmaps:rr: found 56 Core objects on node computer01
[computer01:74388] mca:rmaps:rr: assigning nprocs 1
[computer01:74388] mca:rmaps:rr: assigning proc to object 0
[computer01:74388] [prterun-computer01-74388@0,0] get_avail_ncpus: node 
computer01 has 0 procs on it
[computer01:74388] mca:rmaps: compute bindings for job 
prterun-computer01-74388@1 with policy CORE:IF-SUPPORTED[1007]
[computer01:74388] mca:rmaps: bind [prterun-computer01-74388@1,INVALID] with 
policy CORE:IF-SUPPORTED
[computer01:74388] [prterun-computer01-74388@0,0] BOUND PROC 
[prterun-computer01-74388@1,INVALID][computer01] TO package[0][core:0]
[computer01:74388] mca:rmaps:rr: found 64 Core objects on node computer02
[computer01:74388] mca:rmaps:rr: assigning nprocs 1
[computer01:74388] mca:rmaps:rr: assigning proc to object 0
[computer01:74388] [prterun-computer01-74388@0,0] get_avail_ncpus: node 
computer02 has 0 procs on it
[computer01:74388] mca:rmaps: compute bindings for job 
prterun-computer01-74388@1 with policy CORE:IF-SUPPORTED[1007]
[computer01:74388] mca:rmaps: bind [prterun-computer01-74388@1,INVALID] with 
policy CORE:IF-SUPPORTED
[computer01:74388] [prterun-computer01-74388@0,0] BOUND PROC 
[prterun-computer01-74388@1,INVALID][computer02] TO package[0][core:0]
[computer01:74388] [prterun-computer01-74388@0,0] complete_setup on job 
prterun-computer01-74388@1
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:launch_apps for job 
prterun-computer01-74388@1
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:send launch msg for 
job prterun-computer01-74388@1
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive processing 
msg
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive local launch 
complete command from [prterun-computer01-74388@0,1]
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive got local 
launch complete for job prterun-computer01-74388@1
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive got local 
launch complete for vpid 1
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive got local 
launch complete for vpid 1 state RUNNING
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive done 
processing commands
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:launch wiring up iof 
for job prterun-computer01-74388@1
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive processing 
msg
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive registered 
command from [prterun-computer01-74388@0,1]
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive got 
registered for job prterun-computer01-74388@1
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive got 
registered for vpid 1
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:receive done 
processing commands
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:launch 
prterun-computer01-74388@1 registered
[computer01:74388] [prterun-computer01-74388@0,0] plm:base:prted_cmd sending 
prted_exit commands
Abort is in progress...hit ctrl-c again to forcibly terminate





在 2022/11/30 00:08, Jeff Squyres (jsquyres) 写道:
(we've conversed a bit off-list; bringing this back to the list with a good 
subject to differentiate it from other digest threads)

I'm glad the tarball I provided (that included the PMIx fix) resolved running 
"uptime" for you.

Can you try running a plain C MPI program instead of a Python MPI program?  
That would just eliminate a few more variables from the troubleshooting process.

In the "examples" directory in the tarball I provided are trivial "hello world" 
and "ring" MPI programs.  A "make" should build them all.  Try running hello_c 
and ring_c.

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>
________________________________
From: timesir <mrlong...@gmail.com><mailto:mrlong...@gmail.com>
Sent: Tuesday, November 29, 2022 10:42 AM
To: Jeff Squyres (jsquyres) <jsquy...@cisco.com><mailto:jsquy...@cisco.com>; 
Open MPI Users <users@lists.open-mpi.org><mailto:users@lists.open-mpi.org>
Subject: mpi program gets stuck


see also: https://pastebin.com/s5tjaUkF

(py3.9) ➜  /share  cat hosts
192.168.180.48 slots=1
192.168.60.203 slots=1

1.  This command now runs correctly using your openmpi-gitclone-pr11096.tar.bz2
(py3.9) ➜  /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 
--mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime


2. But this command gets stuck. It seems to be the mpi program that gets stuck.
test.py:
import mpi4py
from mpi4py import MPI

(py3.9) ➜  /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 
--mca rmaps_base_verbose 100 --mca ras_base_verbose 100 python test.py
[computer01:47982] mca: base: component_find: searching NULL for plm components
[computer01:47982] mca: base: find_dyn_components: checking NULL for plm 
components
[computer01:47982] pmix:mca: base: components_register: registering framework 
plm components
[computer01:47982] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:47982] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded component 
ssh
[computer01:47982] pmix:mca: base: components_register: component ssh register 
function successful
[computer01:47982] mca: base: components_open: opening plm components
[computer01:47982] mca: base: components_open: found loaded component slurm
[computer01:47982] mca: base: components_open: component slurm open function 
successful
[computer01:47982] mca: base: components_open: found loaded component ssh
[computer01:47982] mca: base: components_open: component ssh open function 
successful
[computer01:47982] mca:base:select: Auto-selecting plm components
[computer01:47982] mca:base:select:(  plm) Querying component [slurm]
[computer01:47982] mca:base:select:(  plm) Querying component [ssh]
[computer01:47982] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
[computer01:47982] mca:base:select:(  plm) Query of component [ssh] set 
priority to 10
[computer01:47982] mca:base:select:(  plm) Selected component [ssh]
[computer01:47982] mca: base: close: component slurm closed
[computer01:47982] mca: base: close: unloading component slurm
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh_setup on agent ssh : 
rsh path NULL
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive start comm
[computer01:47982] mca: base: component_find: searching NULL for ras components
[computer01:47982] mca: base: find_dyn_components: checking NULL for ras 
components
[computer01:47982] pmix:mca: base: components_register: registering framework 
ras components
[computer01:47982] pmix:mca: base: components_register: found loaded component 
simulator
[computer01:47982] pmix:mca: base: components_register: component simulator 
register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded component 
pbs
[computer01:47982] pmix:mca: base: components_register: component pbs register 
function successful
[computer01:47982] pmix:mca: base: components_register: found loaded component 
slurm
[computer01:47982] pmix:mca: base: components_register: component slurm 
register function successful
[computer01:47982] mca: base: components_open: opening ras components
[computer01:47982] mca: base: components_open: found loaded component simulator
[computer01:47982] mca: base: components_open: found loaded component pbs
[computer01:47982] mca: base: components_open: component pbs open function 
successful
[computer01:47982] mca: base: components_open: found loaded component slurm
[computer01:47982] mca: base: components_open: component slurm open function 
successful
[computer01:47982] mca:base:select: Auto-selecting ras components
[computer01:47982] mca:base:select:(  ras) Querying component [simulator]
[computer01:47982] mca:base:select:(  ras) Querying component [pbs]
[computer01:47982] mca:base:select:(  ras) Querying component [slurm]
[computer01:47982] mca:base:select:(  ras) No component selected!
[computer01:47982] mca: base: component_find: searching NULL for rmaps 
components
[computer01:47982] mca: base: find_dyn_components: checking NULL for rmaps 
components
[computer01:47982] pmix:mca: base: components_register: registering framework 
rmaps components
[computer01:47982] pmix:mca: base: components_register: found loaded component 
ppr
[computer01:47982] pmix:mca: base: components_register: component ppr register 
function successful
[computer01:47982] pmix:mca: base: components_register: found loaded component 
rank_file
[computer01:47982] pmix:mca: base: components_register: component rank_file has 
no register or open function
[computer01:47982] pmix:mca: base: components_register: found loaded component 
round_robin
[computer01:47982] pmix:mca: base: components_register: component round_robin 
register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded component 
seq
[computer01:47982] pmix:mca: base: components_register: component seq register 
function successful
[computer01:47982] mca: base: components_open: opening rmaps components
[computer01:47982] mca: base: components_open: found loaded component ppr
[computer01:47982] mca: base: components_open: component ppr open function 
successful
[computer01:47982] mca: base: components_open: found loaded component rank_file
[computer01:47982] mca: base: components_open: found loaded component 
round_robin
[computer01:47982] mca: base: components_open: component round_robin open 
function successful
[computer01:47982] mca: base: components_open: found loaded component seq
[computer01:47982] mca: base: components_open: component seq open function 
successful
[computer01:47982] mca:rmaps:select: checking available component ppr
[computer01:47982] mca:rmaps:select: Querying component [ppr]
[computer01:47982] mca:rmaps:select: checking available component rank_file
[computer01:47982] mca:rmaps:select: Querying component [rank_file]
[computer01:47982] mca:rmaps:select: checking available component round_robin
[computer01:47982] mca:rmaps:select: Querying component [round_robin]
[computer01:47982] mca:rmaps:select: checking available component seq
[computer01:47982] mca:rmaps:select: Querying component [seq]
[computer01:47982] [prterun-computer01-47982@0,0]: Final mapper priorities
[computer01:47982]     Mapper: rank_file Priority: 100
[computer01:47982]     Mapper: ppr Priority: 90
[computer01:47982]     Mapper: seq Priority: 60
[computer01:47982]     Mapper: round_robin Priority: 10
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate nothing 
found in module - proceeding to hostfile
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate adding 
hostfile hosts
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking hostfile 
hosts for nodes
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.180.48 
is being included - keep all is FALSE
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 
is being included - keep all is FALSE
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 
192.168.180.48 slots 1
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 
192.168.60.203 slots 1
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert 
inserting 2 nodes
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert updating 
HNP [192.168.180.48] info to 1 slots
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:node_insert node 
192.168.60.203 slots 1

======================   ALLOCATED NODES   ======================
    computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 192.168.180.48
    192.168.60.203<http://192.168.60.203>: slots=1 max_slots=0 slots_inuse=0 
state=UNKNOWN
    Flags: SLOTS_GIVEN
    aliases: NONE
=================================================================
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm creating map
[computer01:47982] [prterun-computer01-47982@0,0] setup:vm: working unmanaged 
allocation
[computer01:47982] [prterun-computer01-47982@0,0] using hostfile hosts
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking hostfile 
hosts for nodes
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.180.48 
is being included - keep all is FALSE
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 
is being included - keep all is FALSE
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 
192.168.180.48 slots 1
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 
192.168.60.203 slots 1
[computer01:47982] [prterun-computer01-47982@0,0] checking node 192.168.180.48
[computer01:47982] [prterun-computer01-47982@0,0] ignoring myself
[computer01:47982] [prterun-computer01-47982@0,0] checking node 192.168.60.203
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm add new 
daemon [prterun-computer01-47982@0,1]
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm assigning 
new daemon [prterun-computer01-47982@0,1] to node 192.168.60.203
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: launching vm
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: local shell: 0 (bash)
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: assuming same remote 
shell as local shell
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: remote shell: 0 
(bash)
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: final template argv:
    /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export 
PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export
 
LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export
 DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca 
ess_base_nspace "prterun-computer01-47982@0" --prtemca ess_base_vpid 
"<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri 
"prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"<mailto:prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>
 --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca 
ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" 
--tree-spawn --prtemca prte_parent_uri 
"prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"<mailto:prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh:launch daemon 0 not a 
child of mine
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: adding node 
192.168.60.203 to launch list
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: activating launch 
event
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: recording launch of 
daemon [prterun-computer01-47982@0,1]
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: executing: 
(/usr/bin/ssh) [/usr/bin/ssh 192.168.60.203 
PRTE_PREFIX=/usr/local/openmpi;export 
PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export
 
LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export
 DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca 
ess_base_nspace "prterun-computer01-47982@0" --prtemca ess_base_vpid 1 
--prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri 
"prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"<mailto:prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>
 --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca 
ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" 
--tree-spawn --prtemca prte_parent_uri 
"prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"<mailto:prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>]
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch 
from daemon [prterun-computer01-47982@0,1]
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch 
from daemon [prterun-computer01-47982@0,1] on node computer02
[computer01:47982] ALIASES FOR NODE computer02 (computer02)
[computer01:47982]     ALIAS: 192.168.60.203
[computer01:47982]     ALIAS: computer02
[computer01:47982]     ALIAS: 172.17.180.203
[computer01:47982]     ALIAS: 172.168.10.23
[computer01:47982]     ALIAS: 172.168.10.143
[computer01:47982] [prterun-computer01-47982@0,0] RECEIVED TOPOLOGY SIG 
2N:2S:2L3:64L2:64L1:64C:128H:0-127::x86_64:le FROM NODE computer02
[computer01:47982] [prterun-computer01-47982@0,0] NEW TOPOLOGY - ADDING 
SIGNATURE
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch 
completed for daemon [prterun-computer01-47982@0,1] at contact 
prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24<mailto:prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:orted_report_launch 
job prterun-computer01-47982@0 recvd 2 of 2 reported daemons
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive processing 
msg
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive job launch 
command from [prterun-computer01-47982@0,0]
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive adding hosts

======================   ALLOCATED NODES   ======================
    computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 192.168.180.48
    computer02: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 
192.168.60.203,computer02,172.17.180.203,172.168.10.23,172.168.10.143
=================================================================
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive calling spawn
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done 
processing commands
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_job
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate
[computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate allocation 
already read
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm
[computer01:47982] [prterun-computer01-47982@0,0] plm_base:setup_vm NODE 
computer02 WAS NOT ADDED
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm no new 
daemons required
[computer01:47982] mca:rmaps: mapping job prterun-computer01-47982@1
[computer01:47982] mca:rmaps: setting mapping policies for job 
prterun-computer01-47982@1 inherit TRUE hwtcpus FALSE
[computer01:47982] mca:rmaps[355] mapping not given - using bycore
[computer01:47982] setdefaultbinding[314] binding not given - using bycore
[computer01:47982] mca:rmaps:rf: job prterun-computer01-47982@1 not using 
rankfile policy
[computer01:47982] mca:rmaps:ppr: job prterun-computer01-47982@1 not using ppr 
mapper PPR NULL policy PPR NOTSET
[computer01:47982] [prterun-computer01-47982@0,0] rmaps:seq called on job 
prterun-computer01-47982@1
[computer01:47982] mca:rmaps:seq: job prterun-computer01-47982@1 not using seq 
mapper
[computer01:47982] mca:rmaps:rr: mapping job prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] using hostfile hosts
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking hostfile 
hosts for nodes
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.180.48 
is being included - keep all is FALSE
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 
is being included - keep all is FALSE
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 
192.168.180.48 slots 1
[computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node 
192.168.60.203 slots 1
[computer01:47982] NODE computer01 DOESNT MATCH NODE 192.168.60.203
[computer01:47982] [prterun-computer01-47982@0,0] node computer01 has 1 slots 
available
[computer01:47982] [prterun-computer01-47982@0,0] node computer02 has 1 slots 
available
[computer01:47982] AVAILABLE NODES FOR MAPPING:
[computer01:47982]     node: computer01 daemon: 0 slots_available: 1
[computer01:47982]     node: computer02 daemon: 1 slots_available: 1
[computer01:47982] mca:rmaps:rr: mapping by Core for job 
prterun-computer01-47982@1 slots 2 num_procs 2
[computer01:47982] mca:rmaps:rr: found 56 Core objects on node computer01
[computer01:47982] mca:rmaps:rr: assigning nprocs 1
[computer01:47982] mca:rmaps:rr: assigning proc to object 0
[computer01:47982] [prterun-computer01-47982@0,0] get_avail_ncpus: node 
computer01 has 0 procs on it
[computer01:47982] mca:rmaps: compute bindings for job 
prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007]
[computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID] with 
policy CORE:IF-SUPPORTED
[computer01:47982] [prterun-computer01-47982@0,0] BOUND PROC 
[prterun-computer01-47982@1,INVALID][computer01] TO package[0][core:0]
[computer01:47982] mca:rmaps:rr: found 64 Core objects on node computer02
[computer01:47982] mca:rmaps:rr: assigning nprocs 1
[computer01:47982] mca:rmaps:rr: assigning proc to object 0
[computer01:47982] [prterun-computer01-47982@0,0] get_avail_ncpus: node 
computer02 has 0 procs on it
[computer01:47982] mca:rmaps: compute bindings for job 
prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007]
[computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID] with 
policy CORE:IF-SUPPORTED
[computer01:47982] [prterun-computer01-47982@0,0] BOUND PROC 
[prterun-computer01-47982@1,INVALID][computer02] TO package[0][core:0]
[computer01:47982] [prterun-computer01-47982@0,0] complete_setup on job 
prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch_apps for job 
prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:send launch msg for 
job prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive processing 
msg
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive local launch 
complete command from [prterun-computer01-47982@0,1]
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got local 
launch complete for job prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got local 
launch complete for vpid 1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got local 
launch complete for vpid 1 state RUNNING
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done 
processing commands
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch wiring up iof 
for job prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive processing 
msg
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive registered 
command from [prterun-computer01-47982@0,1]
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got 
registered for job prterun-computer01-47982@1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got 
registered for vpid 1
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done 
processing commands
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:launch 
prterun-computer01-47982@1 registered
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:prted_cmd sending 
prted_exit commands  #### ctrl + c
Abort is in progress...hit ctrl-c again to forcibly terminate

Reply via email to