Do you receive this email?

在 2022年11月23日星期三,timesir <mrlong...@gmail.com> 写道:

>
> *1.  This command now runs correctly *
>
> *(py3.9) ➜  /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose
> 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime*
>
>
>
> *2. But this command gets stuck. It seems to be the mpi program that gets
> stuck. *
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *[computer01:47982] mca: base: component_find: searching NULL for plm
> components [computer01:47982] mca: base: find_dyn_components: checking NULL
> for plm components [computer01:47982] pmix:mca: base: components_register:
> registering framework plm components [computer01:47982] pmix:mca: base:
> components_register: found loaded component slurm [computer01:47982]
> pmix:mca: base: components_register: component slurm register function
> successful [computer01:47982] pmix:mca: base: components_register: found
> loaded component ssh [computer01:47982] pmix:mca: base:
> components_register: component ssh register function successful
> [computer01:47982] mca: base: components_open: opening plm components
> [computer01:47982] mca: base: components_open: found loaded component slurm
> [computer01:47982] mca: base: components_open: component slurm open
> function successful [computer01:47982] mca: base: components_open: found
> loaded component ssh [computer01:47982] mca: base: components_open:
> component ssh open function successful [computer01:47982] mca:base:select:
> Auto-selecting plm components [computer01:47982] mca:base:select:(  plm)
> Querying component [slurm] [computer01:47982] mca:base:select:(  plm)
> Querying component [ssh] [computer01:47982] [[INVALID],0] plm:ssh_lookup on
> agent ssh : rsh path NULL [computer01:47982] mca:base:select:(  plm) Query
> of component [ssh] set priority to 10 [computer01:47982] mca:base:select:(
> plm) Selected component [ssh] [computer01:47982] mca: base: close:
> component slurm closed [computer01:47982] mca: base: close: unloading
> component slurm [computer01:47982] [prterun-computer01-47982@0,0]
> plm:ssh_setup on agent ssh : rsh path NULL [computer01:47982]
> [prterun-computer01-47982@0,0] plm:base:receive start comm
> [computer01:47982] mca: base: component_find: searching NULL for ras
> components [computer01:47982] mca: base: find_dyn_components: checking NULL
> for ras components [computer01:47982] pmix:mca: base: components_register:
> registering framework ras components [computer01:47982] pmix:mca: base:
> components_register: found loaded component simulator [computer01:47982]
> pmix:mca: base: components_register: component simulator register function
> successful [computer01:47982] pmix:mca: base: components_register: found
> loaded component pbs [computer01:47982] pmix:mca: base:
> components_register: component pbs register function successful
> [computer01:47982] pmix:mca: base: components_register: found loaded
> component slurm [computer01:47982] pmix:mca: base: components_register:
> component slurm register function successful [computer01:47982] mca: base:
> components_open: opening ras components [computer01:47982] mca: base:
> components_open: found loaded component simulator [computer01:47982] mca:
> base: components_open: found loaded component pbs [computer01:47982] mca:
> base: components_open: component pbs open function successful
> [computer01:47982] mca: base: components_open: found loaded component slurm
> [computer01:47982] mca: base: components_open: component slurm open
> function successful [computer01:47982] mca:base:select: Auto-selecting ras
> components [computer01:47982] mca:base:select:(  ras) Querying component
> [simulator] [computer01:47982] mca:base:select:(  ras) Querying component
> [pbs] [computer01:47982] mca:base:select:(  ras) Querying component [slurm]
> [computer01:47982] mca:base:select:(  ras) No component selected!
> [computer01:47982] mca: base: component_find: searching NULL for rmaps
> components [computer01:47982] mca: base: find_dyn_components: checking NULL
> for rmaps components [computer01:47982] pmix:mca: base:
> components_register: registering framework rmaps components
> [computer01:47982] pmix:mca: base: components_register: found loaded
> component ppr [computer01:47982] pmix:mca: base: components_register:
> component ppr register function successful [computer01:47982] pmix:mca:
> base: components_register: found loaded component rank_file
> [computer01:47982] pmix:mca: base: components_register: component rank_file
> has no register or open function [computer01:47982] pmix:mca: base:
> components_register: found loaded component round_robin [computer01:47982]
> pmix:mca: base: components_register: component round_robin register
> function successful [computer01:47982] pmix:mca: base: components_register:
> found loaded component seq [computer01:47982] pmix:mca: base:
> components_register: component seq register function successful
> [computer01:47982] mca: base: components_open: opening rmaps components
> [computer01:47982] mca: base: components_open: found loaded component ppr
> [computer01:47982] mca: base: components_open: component ppr open function
> successful [computer01:47982] mca: base: components_open: found loaded
> component rank_file [computer01:47982] mca: base: components_open: found
> loaded component round_robin [computer01:47982] mca: base: components_open:
> component round_robin open function successful [computer01:47982] mca:
> base: components_open: found loaded component seq [computer01:47982] mca:
> base: components_open: component seq open function successful
> [computer01:47982] mca:rmaps:select: checking available component ppr
> [computer01:47982] mca:rmaps:select: Querying component [ppr]
> [computer01:47982] mca:rmaps:select: checking available component rank_file
> [computer01:47982] mca:rmaps:select: Querying component [rank_file]
> [computer01:47982] mca:rmaps:select: checking available component
> round_robin [computer01:47982] mca:rmaps:select: Querying component
> [round_robin] [computer01:47982] mca:rmaps:select: checking available
> component seq [computer01:47982] mca:rmaps:select: Querying component [seq]
> [computer01:47982] [prterun-computer01-47982@0,0]: Final mapper priorities
> [computer01:47982]     Mapper: rank_file Priority: 100 [computer01:47982]
>     Mapper: ppr Priority: 90 [computer01:47982]     Mapper: seq Priority:
> 60 [computer01:47982]     Mapper: round_robin Priority: 10
> [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate
> [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate nothing
> found in module - proceeding to hostfile [computer01:47982]
> [prterun-computer01-47982@0,0] ras:base:allocate adding hostfile hosts
> [computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking
> hostfile hosts for nodes [computer01:47982] [prterun-computer01-47982@0,0]
> hostfile: node 192.168.180.48 is being included - keep all is FALSE
> [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node
> 192.168.60.203 is being included - keep all is FALSE [computer01:47982]
> [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1
> [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node
> 192.168.60.203 slots 1 [computer01:47982] [prterun-computer01-47982@0,0]
> ras:base:node_insert inserting 2 nodes [computer01:47982]
> [prterun-computer01-47982@0,0] ras:base:node_insert updating HNP
> [192.168.180.48] info to 1 slots [computer01:47982]
> [prterun-computer01-47982@0,0] ras:base:node_insert node 192.168.60.203
> slots 1 ======================   ALLOCATED NODES   ======================
>     computer01: slots=1 max_slots=0 slots_inuse=0 state=UP     Flags:
> DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN     aliases: 192.168.180.48
>     192.168.60.203 <http://192.168.60.203>: slots=1 max_slots=0
> slots_inuse=0 state=UNKNOWN     Flags: SLOTS_GIVEN     aliases: NONE
> =================================================================
> [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm
> [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm
> creating map [computer01:47982] [prterun-computer01-47982@0,0] setup:vm:
> working unmanaged allocation [computer01:47982]
> [prterun-computer01-47982@0,0] using hostfile hosts [computer01:47982]
> [prterun-computer01-47982@0,0] hostfile: checking hostfile hosts for nodes
> [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node
> 192.168.180.48 is being included - keep all is FALSE [computer01:47982]
> [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 is being
> included - keep all is FALSE [computer01:47982]
> [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1
> [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node
> 192.168.60.203 slots 1 [computer01:47982] [prterun-computer01-47982@0,0]
> checking node 192.168.180.48 [computer01:47982]
> [prterun-computer01-47982@0,0] ignoring myself [computer01:47982]
> [prterun-computer01-47982@0,0] checking node 192.168.60.203
> [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm add new
> daemon [prterun-computer01-47982@0,1] [computer01:47982]
> [prterun-computer01-47982@0,0] plm:base:setup_vm assigning new daemon
> [prterun-computer01-47982@0,1] to node 192.168.60.203 [computer01:47982]
> [prterun-computer01-47982@0,0] plm:ssh: launching vm [computer01:47982]
> [prterun-computer01-47982@0,0] plm:ssh: local shell: 0 (bash)
> [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: assuming same
> remote shell as local shell [computer01:47982]
> [prterun-computer01-47982@0,0] plm:ssh: remote shell: 0 (bash)
> [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: final template
> argv:     /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export
> PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export
> LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export
> DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env"
> --prtemca ess_base_nspace "prterun-computer01-47982@0" --prtemca
> ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca
> prte_hnp_uri
> "prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"
> <prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>
> --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100"
> --prtemca ras_base_verbose "100" --prtemca pmix_session_server "1"
> --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri
> "prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"
> <prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>
> [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh:launch daemon 0
> not a child of mine [computer01:47982] [prterun-computer01-47982@0,0]
> plm:ssh: adding node 192.168.60.203 to launch list [computer01:47982]
> [prterun-computer01-47982@0,0] plm:ssh: activating launch event
> [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: recording launch
> of daemon [prterun-computer01-47982@0,1] [computer01:47982]
> [prterun-computer01-47982@0,0] plm:ssh: executing: (/usr/bin/ssh)
> [/usr/bin/ssh 192.168.60.203 PRTE_PREFIX=/usr/local/openmpi;export
> PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export
> LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export
> DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env"
> --prtemca ess_base_nspace "prterun-computer01-47982@0" --prtemca
> ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri
> "prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"
> <prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>
> --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100"
> --prtemca ras_base_verbose "100" --prtemca pmix_session_server "1"
> --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri
> "prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24"
> <prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>]
> [computer01:47982] [prterun-computer01-47982@0,0]
> plm:base:orted_report_launch from daemon [prterun-computer01-47982@0,1]
> [computer01:47982] [prterun-computer01-47982@0,0]
> plm:base:orted_report_launch from daemon [prterun-computer01-47982@0,1] on
> node computer02 [computer01:47982] ALIASES FOR NODE computer02 (computer02)
> [computer01:47982]     ALIAS: 192.168.60.203 [computer01:47982]     ALIAS:
> computer02 [computer01:47982]     ALIAS: 172.17.180.203 [computer01:47982]
>     ALIAS: 172.168.10.23 [computer01:47982]     ALIAS: 172.168.10.143
> [computer01:47982] [prterun-computer01-47982@0,0] RECEIVED TOPOLOGY SIG
> 2N:2S:2L3:64L2:64L1:64C:128H:0-127::x86_64:le FROM NODE computer02
> [computer01:47982] [prterun-computer01-47982@0,0] NEW TOPOLOGY - ADDING
> SIGNATURE [computer01:47982] [prterun-computer01-47982@0,0]
> plm:base:orted_report_launch completed for daemon
> [prterun-computer01-47982@0,1] at contact
> prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24
> <prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>
> [computer01:47982] [prterun-computer01-47982@0,0]
> plm:base:orted_report_launch job prterun-computer01-47982@0 recvd 2 of 2
> reported daemons [computer01:47982] [prterun-computer01-47982@0,0]
> plm:base:receive processing msg [computer01:47982]
> [prterun-computer01-47982@0,0] plm:base:receive job launch command from
> [prterun-computer01-47982@0,0] [computer01:47982]
> [prterun-computer01-47982@0,0] plm:base:receive adding hosts
> ======================   ALLOCATED NODES   ======================
> computer01: slots=1 max_slots=0 slots_inuse=0 state=UP     Flags:
> DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN     aliases: 192.168.180.48
>     computer02: slots=1 max_slots=0 slots_inuse=0 state=UP     Flags:
> DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN     aliases:
> 192.168.60.203,computer02,172.17.180.203,172.168.10.23,172.168.10.143
> =================================================================
> [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive calling
> spawn [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive
> done processing commands [computer01:47982] [prterun-computer01-47982@0,0]
> plm:base:setup_job [computer01:47982] [prterun-computer01-47982@0,0]
> ras:base:allocate [computer01:47982] [prterun-computer01-47982@0,0]
> ras:base:allocate allocation already read [computer01:47982]
> [prterun-computer01-47982@0,0] plm:base:setup_vm [computer01:47982]
> [prterun-computer01-47982@0,0] plm_base:setup_vm NODE computer02 WAS NOT
> ADDED [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm
> no new daemons required [computer01:47982] mca:rmaps: mapping job
> prterun-computer01-47982@1 [computer01:47982] mca:rmaps: setting mapping
> policies for job prterun-computer01-47982@1 inherit TRUE hwtcpus FALSE
> [computer01:47982] mca:rmaps[355] mapping not given - using bycore
> [computer01:47982] setdefaultbinding[314] binding not given - using bycore
> [computer01:47982] mca:rmaps:rf: job prterun-computer01-47982@1 not using
> rankfile policy [computer01:47982] mca:rmaps:ppr: job
> prterun-computer01-47982@1 not using ppr mapper PPR NULL policy PPR NOTSET
> [computer01:47982] [prterun-computer01-47982@0,0] rmaps:seq called on job
> prterun-computer01-47982@1 [computer01:47982] mca:rmaps:seq: job
> prterun-computer01-47982@1 not using seq mapper [computer01:47982]
> mca:rmaps:rr: mapping job prterun-computer01-47982@1 [computer01:47982]
> [prterun-computer01-47982@0,0] using hostfile hosts [computer01:47982]
> [prterun-computer01-47982@0,0] hostfile: checking hostfile hosts for nodes
> [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node
> 192.168.180.48 is being included - keep all is FALSE [computer01:47982]
> [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 is being
> included - keep all is FALSE [computer01:47982]
> [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1
> [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node
> 192.168.60.203 slots 1 [computer01:47982] NODE computer01 DOESNT MATCH NODE
> 192.168.60.203 [computer01:47982] [prterun-computer01-47982@0,0] node
> computer01 has 1 slots available [computer01:47982]
> [prterun-computer01-47982@0,0] node computer02 has 1 slots available
> [computer01:47982] AVAILABLE NODES FOR MAPPING: [computer01:47982]
> node: computer01 daemon: 0 slots_available: 1 [computer01:47982]     node:
> computer02 daemon: 1 slots_available: 1 [computer01:47982] mca:rmaps:rr:
> mapping by Core for job prterun-computer01-47982@1 slots 2 num_procs 2
> [computer01:47982] mca:rmaps:rr: found 56 Core objects on node computer01
> [computer01:47982] mca:rmaps:rr: assigning nprocs 1 [computer01:47982]
> mca:rmaps:rr: assigning proc to object 0 [computer01:47982]
> [prterun-computer01-47982@0,0] get_avail_ncpus: node computer01 has 0 procs
> on it [computer01:47982] mca:rmaps: compute bindings for job
> prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007]
> [computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID]
> with policy CORE:IF-SUPPORTED [computer01:47982]
> [prterun-computer01-47982@0,0] BOUND PROC
> [prterun-computer01-47982@1,INVALID][computer01] TO package[0][core:0]
> [computer01:47982] mca:rmaps:rr: found 64 Core objects on node computer02
> [computer01:47982] mca:rmaps:rr: assigning nprocs 1 [computer01:47982]
> mca:rmaps:rr: assigning proc to object 0 [computer01:47982]
> [prterun-computer01-47982@0,0] get_avail_ncpus: node computer02 has 0 procs
> on it [computer01:47982] mca:rmaps: compute bindings for job
> prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007]
> [computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID]
> with policy CORE:IF-SUPPORTED [computer01:47982]
> [prterun-computer01-47982@0,0] BOUND PROC
> [prterun-computer01-47982@1,INVALID][computer02] TO package[0][core:0]
> [computer01:47982] [prterun-computer01-47982@0,0] complete_setup on job
> prterun-computer01-47982@1 [computer01:47982]
> [prterun-computer01-47982@0,0] plm:base:launch_apps for job
> prterun-computer01-47982@1 [computer01:47982]
> [prterun-computer01-47982@0,0] plm:base:send launch msg for job
> prterun-computer01-47982@1 [computer01:47982]
> [prterun-computer01-47982@0,0] plm:base:receive processing msg
> [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive local
> launch complete command from [prterun-computer01-47982@0,1]
> [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got
> local launch complete for job prterun-computer01-47982@1 [computer01:47982]
> [prterun-computer01-47982@0,0] plm:base:receive got local launch complete
> for vpid 1 [computer01:47982] [prterun-computer01-47982@0,0]
> plm:base:receive got local launch complete for vpid 1 state RUNNING
> [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done
> processing commands [computer01:47982] [prterun-computer01-47982@0,0]
> plm:base:launch wiring up iof for job prterun-computer01-47982@1
> [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive
> processing msg [computer01:47982] [prterun-computer01-47982@0,0]
> plm:base:receive registered command from [prterun-computer01-47982@0,1]
> [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got
> registered for job prterun-computer01-47982@1 [computer01:47982]
> [prterun-computer01-47982@0,0] plm:base:receive got registered for vpid 1
> [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done
> processing commands [computer01:47982] [prterun-computer01-47982@0,0]
> plm:base:launch prterun-computer01-47982@1 registered [computer01:47982]
> [prterun-computer01-47982@0,0] plm:base:prted_cmd sending prted_exit
> commands  #### ctrl + c Abort is in progress...hit ctrl-c again to forcibly
> terminate *
>
>
>
> 在 2022/11/21 21:26, Jeff Squyres (jsquyres) 写道:
>
> Thanks for the output!  It looks like this is an actual bug in the 5.0rc9
> tarball.  It stems from a mis-handling of topology mismatches between your
> two computers (which *should*​ work just fine).
>
> I have filed a PR with the fixes: https://github.com/
> open-mpi/ompi/pull/11096
>
> That being said, I know the Open PI v5.0.0 release managers were
> sidetracked for the past 2 weeks, and this week is the Thanksgiving holiday
> in the US, which generally results in some delays because people are taking
> time off.
>
> Bottom line: I don't know when this PR will get merged, so I made a new
> unofficial tarball based on that PR.  Can you give this a whirl and see if
> it fixes your problem?
>
> https://www-lb.open-mpi.org/~jsquyres/unofficial/openmpi-
> gitclone-pr11096.tar.bz2
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> ------------------------------
> *From:* timesir <mrlong...@gmail.com> <mrlong...@gmail.com>
> *Sent:* Friday, November 18, 2022 10:55 PM
> *To:* Jeff Squyres (jsquyres) <jsquy...@cisco.com> <jsquy...@cisco.com>;
> users@lists.open-mpi.org <users@lists.open-mpi.org>
> <users@lists.open-mpi.org>; gilles.gouaillar...@gmail.com
> <gilles.gouaillar...@gmail.com> <gilles.gouaillar...@gmail.com>
> *Subject:* Re: users Digest, Vol 4818, Issue 1
>
>
>
>
> *1. the extra information is switching to use intel mpi, which works fine
> on all four machines. 2. The attached information is the config.log and the
> output of ompi_info --all command for two machines 3. Here is the output of
> the command you need*
>
>
>
> *(py3.9) ➜  /share   ompi_info --version*
> Open MPI v5.0.0rc9
>
> https://www.open-mpi.org/community/help/
>
> *(py3.9) ➜  /share  cat hosts*
> 192.168.180.48 slots=1
> 192.168.60.203 slots=1
>
>
> *(py3.9) ➜  /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose
> 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime*
> [computer01:50653] mca: base: component_find: searching NULL for plm
> components
> [computer01:50653] mca: base: find_dyn_components: checking NULL for plm
> components
> [computer01:50653] pmix:mca: base: components_register: registering
> framework plm components
> [computer01:50653] pmix:mca: base: components_register: found loaded
> component slurm
> [computer01:50653] pmix:mca: base: components_register: component slurm
> register function successful
> [computer01:50653] pmix:mca: base: components_register: found loaded
> component ssh
> [computer01:50653] pmix:mca: base: components_register: component ssh
> register function successful
> [computer01:50653] mca: base: components_open: opening plm components
> [computer01:50653] mca: base: components_open: found loaded component slurm
> [computer01:50653] mca: base: components_open: component slurm open
> function successful
> [computer01:50653] mca: base: components_open: found loaded component ssh
> [computer01:50653] mca: base: components_open: component ssh open function
> successful
> [computer01:50653] mca:base:select: Auto-selecting plm components
> [computer01:50653] mca:base:select:(  plm) Querying component [slurm]
> [computer01:50653] mca:base:select:(  plm) Querying component [ssh]
> [computer01:50653] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path
> NULL
> [computer01:50653] mca:base:select:(  plm) Query of component [ssh] set
> priority to 10
> [computer01:50653] mca:base:select:(  plm) Selected component [ssh]
> [computer01:50653] mca: base: close: component slurm closed
> [computer01:50653] mca: base: close: unloading component slurm
> [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh_setup on agent
> ssh : rsh path NULL
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive start
> comm
> [computer01:50653] mca: base: component_find: searching NULL for ras
> components
> [computer01:50653] mca: base: find_dyn_components: checking NULL for ras
> components
> [computer01:50653] pmix:mca: base: components_register: registering
> framework ras components
> [computer01:50653] pmix:mca: base: components_register: found loaded
> component simulator
> [computer01:50653] pmix:mca: base: components_register: component
> simulator register function successful
> [computer01:50653] pmix:mca: base: components_register: found loaded
> component pbs
> [computer01:50653] pmix:mca: base: components_register: component pbs
> register function successful
> [computer01:50653] pmix:mca: base: components_register: found loaded
> component slurm
> [computer01:50653] pmix:mca: base: components_register: component slurm
> register function successful
> [151/1582]
> [computer01:50653] mca: base: components_open: opening ras components
> [computer01:50653] mca: base: components_open: found loaded component
> simulator
> [computer01:50653] mca: base: components_open: found loaded component pbs
> [computer01:50653] mca: base: components_open: component pbs open function
> successful
> [computer01:50653] mca: base: components_open: found loaded component slurm
> [computer01:50653] mca: base: components_open: component slurm open
> function successful
> [computer01:50653] mca:base:select: Auto-selecting ras components
> [computer01:50653] mca:base:select:(  ras) Querying component [simulator]
> [computer01:50653] mca:base:select:(  ras) Querying component [pbs]
> [computer01:50653] mca:base:select:(  ras) Querying component [slurm]
> [computer01:50653] mca:base:select:(  ras) No component selected!
> [computer01:50653] mca: base: component_find: searching NULL for rmaps
> components
> [computer01:50653] mca: base: find_dyn_components: checking NULL for rmaps
> components
> [computer01:50653] pmix:mca: base: components_register: registering
> framework rmaps components
> [computer01:50653] pmix:mca: base: components_register: found loaded
> component ppr
> [computer01:50653] pmix:mca: base: components_register: component ppr
> register function successful
> [computer01:50653] pmix:mca: base: components_register: found loaded
> component rank_file
> [computer01:50653] pmix:mca: base: components_register: component
> rank_file has no register or open function
> [computer01:50653] pmix:mca: base: components_register: found loaded
> component round_robin
> [computer01:50653] pmix:mca: base: components_register: component
> round_robin register function successful
> [computer01:50653] pmix:mca: base: components_register: found loaded
> component seq
> [computer01:50653] pmix:mca: base: components_register: component seq
> register function successful
> [computer01:50653] mca: base: components_open: opening rmaps components
> [computer01:50653] mca: base: components_open: found loaded component ppr
> [computer01:50653] mca: base: components_open: component ppr open function
> successful
> [computer01:50653] mca: base: components_open: found loaded component
> rank_file
> [computer01:50653] mca: base: components_open: found loaded component
> round_robin
> [computer01:50653] mca: base: components_open: component round_robin open
> function successful
> [computer01:50653] mca: base: components_open: found loaded component seq
> [computer01:50653] mca: base: components_open: component seq open function
> successful
> [computer01:50653] mca:rmaps:select: checking available component ppr
> [computer01:50653] mca:rmaps:select: Querying component [ppr]
> [computer01:50653] mca:rmaps:select: checking available component rank_file
> [computer01:50653] mca:rmaps:select: Querying component [rank_file]
> [computer01:50653] mca:rmaps:select: checking available component
> round_robin
> [computer01:50653] mca:rmaps:select: Querying component [round_robin]
> [computer01:50653] mca:rmaps:select: checking available component seq
> [computer01:50653] mca:rmaps:select: Querying component
> [seq]
> [113/1582]
> [computer01:50653] [prterun-computer01-50653@0,0]: Final mapper priorities
> [computer01:50653]      Mapper: ppr Priority: 90
> [computer01:50653]      Mapper: seq Priority: 60
> [computer01:50653]      Mapper: round_robin Priority: 10
> [computer01:50653]      Mapper: rank_file Priority: 0
> [computer01:50653] [prterun-computer01-50653@0,0] ras:base:allocate
> [computer01:50653] [prterun-computer01-50653@0,0] ras:base:allocate
> nothing found in module - proceeding to hostfile
> [computer01:50653] [prterun-computer01-50653@0,0] ras:base:allocate
> adding hostfile hosts
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: checking
> hostfile hosts for nodes
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: node
> 192.168.180.48 is being included - keep all is FALSE
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: node
> 192.168.60.203 is being included - keep all is FALSE
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: adding node
> 192.168.180.48 slots 1
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: adding node
> 192.168.60.203 slots 1
> [computer01:50653] [prterun-computer01-50653@0,0] ras:base:node_insert
> inserting 2 nodes
> [computer01:50653] [prterun-computer01-50653@0,0] ras:base:node_insert
> updating HNP [192.168.180.48] info to 1 slots
> [computer01:50653] [prterun-computer01-50653@0,0] ras:base:node_insert
> node 192.168.60.203 slots 1
>
> ======================   ALLOCATED NODES   ======================
>     computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
>         Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
>         aliases: 192.168.180.48
>     192.168.60.203: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>         Flags: SLOTS_GIVEN
>         aliases: NONE
> =================================================================
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_vm
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_vm
> creating map
> [computer01:50653] [prterun-computer01-50653@0,0] setup:vm: working
> unmanaged allocation
> [computer01:50653] [prterun-computer01-50653@0,0] using hostfile hosts
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: checking
> hostfile hosts for nodes
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: node
> 192.168.180.48 is being included - keep all is FALSE
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: node
> 192.168.60.203 is being included - keep all is FALSE
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: adding node
> 192.168.180.48 slots 1
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: adding node
> 192.168.60.203 slots 1
> [computer01:50653] [prterun-computer01-50653@0,0] checking node
> 192.168.180.48
> [computer01:50653] [prterun-computer01-50653@0,0] ignoring myself
> [computer01:50653] [prterun-computer01-50653@0,0] checking node
> 192.168.60.203
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_vm add
> new daemon [prterun-computer01-50653@0,1]
> [75/1582]
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_vm
> assigning new daemon [prterun-computer01-50653@0,1] to node 192.168.60.203
> [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: launching vm
> [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: local shell: 0
> (bash)
> [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: assuming same
> remote shell as local shell
> [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: remote shell:
> 0 (bash)
> [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: final template
> argv:
>         /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export
> PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/
> local/openmpi/lib:$LD_LIBRARY_
> PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/
> usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export
> DYLD_LIBRARY_PATH;/usr/local/openmpi/b
> in/prted --prtemca ess "env" --prtemca ess_base_nspace
> "prterun-computer01-50653@0" --prtemca ess_base_vpid "<template>"
> --prtemca ess_base_num_procs "2" --
> prtemca prte_hnp_uri "prterun-computer01-50653@0.0;
> tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.
> 100.24,172.168.10.144,192.168.122.1:38155:24,16
> ,24,24,24,24" --prtemca plm_base_verbose "100" --prtemca
> rmaps_base_verbose "100" --prtemca ras_base_verbose "100" --prtemca
> pmix_session_server "1" --prtem
> ca plm "ssh" --tree-spawn --prtemca prte_parent_uri "
> prterun-computer01-50653@0.0;tcp://192.168.180.48,172.17.
> 180.205,172.168.10.24,172.168.100.24,172.168.1
> 0.144,192.168.122.1:38155:24,16,24,24,24,24"
> [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh:launch daemon 0
> not a child of mine
> [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: adding node
> 192.168.60.203 to launch list
> [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: activating
> launch event
> [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: recording
> launch of daemon [prterun-computer01-50653@0,1]
> [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: executing:
> (/usr/bin/ssh) [/usr/bin/ssh 192.168.60.203 
> PRTE_PREFIX=/usr/local/openmpi;export
> PRTE
> _PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/
> openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_
> PATH=/usr/local/openmpi/lib:/usr/
> local/openmpi/lib:$DYLD_LIBRARY_PATH;export 
> DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted
> --prtemca ess "env" --prtemca ess_base_nspace "prterun-computer01
> -50653@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2"
> --prtemca prte_hnp_uri "prterun-computer01-50653@0.0;
> tcp://192.168.180.48,172.17.180.20
> 5,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:38155:24,16,24,24,24,24"
> --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --p
> rtemca ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca
> plm "ssh" --tree-spawn --prtemca prte_parent_uri "
> prterun-computer01-50653@0.0;tcp
> ://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.
> 24,172.168.10.144,192.168.122.1:38155:24,16,24,24,24,24"]
> [computer01:50653] [prterun-computer01-50653@0,0]
> plm:base:orted_report_launch from daemon [prterun-computer01-50653@0,1]
> [computer01:50653] [prterun-computer01-50653@0,0]
> plm:base:orted_report_launch from daemon [prterun-computer01-50653@0,1]
> on node computer02
> [computer01:50653] ALIASES FOR NODE computer02 (computer02)
> [computer01:50653]      ALIAS: 192.168.60.203
> [computer01:50653]      ALIAS: computer02
> [computer01:50653]      ALIAS: 172.17.180.203
> [computer01:50653]      ALIAS: 172.168.10.23
> [computer01:50653]      ALIAS: 172.168.10.143
> [computer01:50653] [prterun-computer01-50653@0,0] RECEIVED TOPOLOGY SIG
> 2N:2S:2L3:64L2:64L1:64C:128H:0-127::x86_64:le FROM NODE computer02
> [computer01:50653] [prterun-computer01-50653@0,0] NEW TOPOLOGY - ADDING
> SIGNATURE
> [computer01:50653] [prterun-computer01-50653@0,0]
> plm:base:orted_report_launch completed for daemon
> [prterun-computer01-50653@0,1] at contact prterun-comput
> er01-50653@0.0;tcp://192.168.180.48,172.17.180.205,172.168.
> 10.24,172.168.100.24,172.168.10.144,192.168.122.1:38155:24,16,24,24,24,24
> [computer01:50653] [prterun-computer01-50653@0,0]
> plm:base:orted_report_launch job prterun-computer01-50653@0 recvd 2 of 2
> reported daemons
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive
> processing msg
> [37/1582]
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive job
> launch command from [prterun-computer01-50653@0,0]
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive adding
> hosts
>
> ======================   ALLOCATED NODES   ======================
>     computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
>         Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
>         aliases: 192.168.180.48
>     computer02: slots=1 max_slots=0 slots_inuse=0 state=UP
>         Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
>         aliases: 192.168.60.203,computer02,172.
> 17.180.203,172.168.10.23,172.168.10.143
> =================================================================
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive
> calling spawn
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive done
> processing commands
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_job
> [computer01:50653] [prterun-computer01-50653@0,0] ras:base:allocate
> [computer01:50653] [prterun-computer01-50653@0,0] ras:base:allocate
> allocation already read
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_vm
> [computer01:50653] [prterun-computer01-50653@0,0] plm_base:setup_vm NODE
> computer02 WAS NOT ADDED
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_vm no
> new daemons required
> [computer01:50653] mca:rmaps: mapping job prterun-computer01-50653@1
> [computer01:50653] mca:rmaps: setting mapping policies for job
> prterun-computer01-50653@1 inherit TRUE hwtcpus FALSE
> [computer01:50653] mca:rmaps[358] mapping not given - using bycore
> [computer01:50653] setdefaultbinding[365] binding not given - using bycore
> [computer01:50653] mca:rmaps:ppr: job prterun-computer01-50653@1 not
> using ppr mapper PPR NULL policy PPR NOTSET
> [computer01:50653] [prterun-computer01-50653@0,0] rmaps:seq called on job
> prterun-computer01-50653@1
> [computer01:50653] mca:rmaps:seq: job prterun-computer01-50653@1 not
> using seq mapper
> [computer01:50653] mca:rmaps:rr: mapping job prterun-computer01-50653@1
> [computer01:50653] [prterun-computer01-50653@0,0] using hostfile hosts
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: checking
> hostfile hosts for nodes
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: node
> 192.168.180.48 is being included - keep all is FALSE
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: node
> 192.168.60.203 is being included - keep all is FALSE
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: adding node
> 192.168.180.48 slots 1
> [computer01:50653] [prterun-computer01-50653@0,0] hostfile: adding node
> 192.168.60.203 slots 1
> [computer01:50653] NODE computer01 DOESNT MATCH NODE 192.168.60.203
> [computer01:50653] [prterun-computer01-50653@0,0] node computer01 has 1
> slots available
> [computer01:50653] [prterun-computer01-50653@0,0] node computer02 lacks
> topology
> [computer01:50653] AVAILABLE NODES FOR MAPPING:
> [computer01:50653]     node: computer01 daemon: 0 slots_available: 1
> [computer01:50653] mca:rmaps:rr: mapping by Core for job
> prterun-computer01-50653@1 slots 1 num_procs 2
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:orted_cmd
> sending kill_local_procs cmds
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 2
> slots that were requested by the application:
>
>   uptime
>
> Either request fewer procs for your application, or make more slots
> available for use.
>
> A "slot" is the PRRTE term for an allocatable unit where we can
> launch a process.  The number of slots available are defined by the
> environment in which PRRTE processes are run:
>
>   1. Hostfile, via "slots=N" clauses (N defaults to number of
>      processor cores if not provided)
>   2. The --host command line parameter, via a ":N" suffix on the
>      hostname (N defaults to 1 if not provided)
>   3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
>   4. If none of a hostfile, the --host command line parameter, or an
>      RM is present, PRRTE defaults to the number of processor cores
>
> In all the above cases, if you want PRRTE to default to the number
> of hardware threads instead of the number of processor cores, use the
> --use-hwthread-cpus option.
>
> Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
> number of available slots when deciding the number of processes to
> launch.
> --------------------------------------------------------------------------
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:prted_cmd
> sending prted_exit commands
> [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive stop
> comm
> [computer01:50653] mca: base: close: component ssh closed
> [computer01:50653] mca: base: close: unloading component ssh
>
>
>
> 在 2022/11/19 01:01, Jeff Squyres (jsquyres) 写道:
>
> Actually, I guess I see a reason we're not getting all the output I
> expect: can you rebuild Open MPI with the --enable-debug configure command
> line option, and then re-run all of those commands again?  We should get
> more output this time.
>
> Thanks!
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> ------------------------------
> *From:* Jeff Squyres (jsquyres) <jsquy...@cisco.com> <jsquy...@cisco.com>
> *Sent:* Friday, November 18, 2022 11:52 AM
> *To:* timesir <mrlong...@gmail.com> <mrlong...@gmail.com>;
> users@lists.open-mpi.org <users@lists.open-mpi.org>
> <users@lists.open-mpi.org>; gilles.gouaillar...@gmail.com
> <gilles.gouaillar...@gmail.com> <gilles.gouaillar...@gmail.com>
> *Subject:* Re: users Digest, Vol 4818, Issue 1
>
> Ok, this is a good / consistent output.  That being said, I don't grok
> what is happening here: it says it finds 2 slots, but then it tells you it
> doesn't have enough slots.
>
> Let me dig deeper and get back to you...
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> ------------------------------
> *From:* timesir <mrlong...@gmail.com> <mrlong...@gmail.com>
> *Sent:* Friday, November 18, 2022 10:20 AM
> *To:* Jeff Squyres (jsquyres) <jsquy...@cisco.com> <jsquy...@cisco.com>;
> users@lists.open-mpi.org <users@lists.open-mpi.org>
> <users@lists.open-mpi.org>; gilles.gouaillar...@gmail.com
> <gilles.gouaillar...@gmail.com> <gilles.gouaillar...@gmail.com>
> *Subject:* Re: users Digest, Vol 4818, Issue 1
>
> *(py3.9) ➜  /share   ompi_info --version*
>
> Open MPI v5.0.0rc9
>
> https://www.open-mpi.org/community/help/
>
>
> *(py3.9) ➜  /share  cat hosts*
> 192.168.180.48 slots=1
> 192.168.60.203 slots=1
>
>
> *(py3.9) ➜**  /share  mpirun -n 2 --machinefile hosts --mca
> plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose
> 100 uptime*
> [computer01:53933] mca: base: component_find: searching NULL for plm
> components
> [computer01:53933] mca: base: find_dyn_components: checking NULL for plm
> components
> [computer01:53933] pmix:mca: base: components_register: registering
> framework plm components
> [computer01:53933] pmix:mca: base: components_register: found loaded
> component slurm
> [computer01:53933] pmix:mca: base: components_register: component slurm
> register function successful
> [computer01:53933] pmix:mca: base: components_register: found loaded
> component ssh
> [computer01:53933] pmix:mca: base: components_register: component ssh
> register function successful
> [computer01:53933] mca: base: components_open: opening plm components
> [computer01:53933] mca: base: components_open: found loaded component slurm
> [computer01:53933] mca: base: components_open: component slurm open
> function successful
> [computer01:53933] mca: base: components_open: found loaded component ssh
> [computer01:53933] mca: base: components_open: component ssh open function
> successful
> [computer01:53933] mca:base:select: Auto-selecting plm components
> [computer01:53933] mca:base:select:(  plm) Querying component [slurm]
> [computer01:53933] mca:base:select:(  plm) Querying component [ssh]
> [computer01:53933] mca:base:select:(  plm) Query of component [ssh] set
> priority to 10
> [computer01:53933] mca:base:select:(  plm) Selected component [ssh]
> [computer01:53933] mca: base: close: component slurm closed
> [computer01:53933] mca: base: close: unloading component slurm
> [computer01:53933] mca: base: component_find: searching NULL for ras
> components
> [computer01:53933] mca: base: find_dyn_components: checking NULL for ras
> components
> [computer01:53933] pmix:mca: base: components_register: registering
> framework ras components
> [computer01:53933] pmix:mca: base: components_register: found loaded
> component simulator
> [computer01:53933] pmix:mca: base: components_register: component
> simulator register function successful
> [computer01:53933] pmix:mca: base: components_register: found loaded
> component pbs
> [computer01:53933] pmix:mca: base: components_register: component pbs
> register function successful
> [computer01:53933] pmix:mca: base: components_register: found loaded
> component slurm
> [computer01:53933] pmix:mca: base: components_register: component slurm
> register function successful
> [computer01:53933] mca: base: components_open: opening ras components
> [computer01:53933] mca: base: components_open: found loaded component
> simulator
> [computer01:53933] mca: base: components_open: found loaded component pbs
> [computer01:53933] mca: base: components_open: component pbs open function
> successful
> [computer01:53933] mca: base: components_open: found loaded component slurm
> [computer01:53933] mca: base: components_open: component slurm open
> function successful
> [computer01:53933] mca:base:select: Auto-selecting ras components
> [computer01:53933] mca:base:select:(  ras) Querying component [simulator]
>
> [computer01:53933] mca:base:select:(  ras) Querying component
> [pbs]
> [71/1815]
> [computer01:53933] mca:base:select:(  ras) Querying component [slurm]
> [computer01:53933] mca:base:select:(  ras) No component selected!
> [computer01:53933] mca: base: component_find: searching NULL for rmaps
> components
> [computer01:53933] mca: base: find_dyn_components: checking NULL for rmaps
> components
> [computer01:53933] pmix:mca: base: components_register: registering
> framework rmaps components
> [computer01:53933] pmix:mca: base: components_register: found loaded
> component ppr
> [computer01:53933] pmix:mca: base: components_register: component ppr
> register function successful
> [computer01:53933] pmix:mca: base: components_register: found loaded
> component rank_file
> [computer01:53933] pmix:mca: base: components_register: component
> rank_file has no register or open function
> [computer01:53933] pmix:mca: base: components_register: found loaded
> component round_robin
> [computer01:53933] pmix:mca: base: components_register: component
> round_robin register function successful
> [computer01:53933] pmix:mca: base: components_register: found loaded
> component seq
> [computer01:53933] pmix:mca: base: components_register: component seq
> register function successful
> [computer01:53933] mca: base: components_open: opening rmaps components
> [computer01:53933] mca: base: components_open: found loaded component ppr
> [computer01:53933] mca: base: components_open: component ppr open function
> successful
> [computer01:53933] mca: base: components_open: found loaded component
> rank_file
> [computer01:53933] mca: base: components_open: found loaded component
> round_robin
> [computer01:53933] mca: base: components_open: component round_robin open
> function successful
> [computer01:53933] mca: base: components_open: found loaded component seq
> [computer01:53933] mca: base: components_open: component seq open function
> successful
> [computer01:53933] mca:rmaps:select: checking available component ppr
> [computer01:53933] mca:rmaps:select: Querying component [ppr]
> [computer01:53933] mca:rmaps:select: checking available component rank_file
> [computer01:53933] mca:rmaps:select: Querying component [rank_file]
> [computer01:53933] mca:rmaps:select: checking available component
> round_robin
> [computer01:53933] mca:rmaps:select: Querying component [round_robin]
> [computer01:53933] mca:rmaps:select: checking available component seq
> [computer01:53933] mca:rmaps:select: Querying component [seq]
> [computer01:53933] [prterun-computer01-53933@0,0]: Final mapper priorities
> [computer01:53933]      Mapper: ppr Priority: 90
> [computer01:53933]      Mapper: seq Priority: 60
> [computer01:53933]      Mapper: round_robin Priority: 10
> [computer01:53933]      Mapper: rank_file Priority: 0
>
> ======================   ALLOCATED NODES   ======================
>     computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
>
>         Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
>
> [33/1815]
>         aliases: 192.168.180.48
>     192.168.60.203: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>         Flags: SLOTS_GIVEN
>         aliases: NONE
> =================================================================
> [computer01:53933] [prterun-computer01-53933@0,0] plm:ssh: final template
> argv:
>         /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export
> PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/
> local/openmpi/lib:$LD_LIBRARY_
> PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/
> usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;exportDYLD_
> LIBRARY_PATH;/usr/local/openmpi/b
> in/prted --prtemca ess "env" --prtemca ess_base_nspace
> "prterun-computer01-53933@0" --prtemca ess_base_vpid "<template>"
> --prtemca ess_base_num_procs "2" --
> prtemca prte_hnp_uri "prterun-computer01-53933@0.0;
> tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.
> 100.24,172.168.10.144,192.168.122.1:42567:24,16
> ,24,24,24,24" --prtemca plm_base_verbose "100" --prtemca
> rmaps_base_verbose "100" --prtemca ras_base_verbose "100" --prtemca
> pmix_session_server "1" --prtem
> ca plm "ssh" --tree-spawn --prtemca prte_parent_uri "
> prterun-computer01-53933@0.0;tcp://192.168.180.48,172.17.
> 180.205,172.168.10.24,172.168.100.24,172.168.1
> 0.144,192.168.122.1:42567:24,16,24,24,24,24"
> [computer01:53933] ALIASES FOR NODE computer02 (omputer02)
> [computer01:53933]      ALIAS: 192.168.60.203
> [computer01:53933]      ALIAS: computer02
> [computer01:53933]      ALIAS: 172.17.180.203
> [computer01:53933]      ALIAS: 172.168.10.23
> [computer01:53933]      ALIAS: 172.168.10.143
>
> ======================   ALLOCATED NODES   ======================
>     computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
>         Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
>         aliases: 192.168.180.48
>     computer02: slots=1 max_slots=0 slots_inuse=0 state=UP
>         Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
>         aliases: 192.168.60.203,computer02,172.
> 17.180.203,172.168.10.23,172.168.10.143
> =================================================================
> [computer01:53933] mca:rmaps: mapping job prterun-computer01-53933@1
> [computer01:53933] mca:rmaps: setting mapping policies for job
> prterun-computer01-53933@1 inherit TRUE hwtcpus FALSE
> [computer01:53933] mca:rmaps[358] mapping not given - using bycore
> [computer01:53933] setdefaultbinding[365] binding not given - using bycore
> [computer01:53933] mca:rmaps:ppr: job prterun-computer01-53933@1 not
> using ppr mapper PPR NULL policy PPR NOTSET
> [computer01:53933] mca:rmaps:seq: job prterun-computer01-53933@1 not
> using seq mapper
> [computer01:53933] mca:rmaps:rr: mapping job prterun-computer01-53933@1
> [computer01:53933] AVAILABLE NODES FOR MAPPING:
> [computer01:53933]     node: computer01 daemon: 0 slots_available: 1
>
> [computer01:53933] mca:rmaps:rr: mapping by Core for job
> prterun-computer01-53933@1 slots 1 num_procs 2
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 2
> slots that were requested by the application:
>
>   uptime
>
> Either request fewer procs for your application, or make more slots
> available for use.
>
> A "slot" is the PRRTE term for an allocatable unit where we can
> launch a process.  The number of slots available are defined by the
> environment in which PRRTE processes are run:
>
>   1. Hostfile, via "slots=N" clauses (N defaults to number of
>      processor cores if not provided)
>   2. The --host command line parameter, via a ":N" suffix on the
>      hostname (N defaults to 1 if not provided)
>   3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
>   4. If none of a hostfile, the --host command line parameter, or an
>      RM is present, PRRTE defaults to the number of processor cores
>
> In all the above cases, if you want PRRTE to default to the number
> of hardware threads instead of the number of processor cores, use the
> --use-hwthread-cpus option.
>
> Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
> number of available slots when deciding the number of processes to
> launch.
> --------------------------------------------------------------------------
> [computer01:53933] mca: base: close: component ssh closed
> [computer01:53933] mca: base: close: unloading component ssh
>
>
> 在 2022/11/18 22:48, Jeff Squyres (jsquyres) 写道:
>
>

Reply via email to