Do you receive this email?
在 2022年11月23日星期三,timesir <mrlong...@gmail.com> 写道: > > *1. This command now runs correctly * > > *(py3.9) ➜ /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose > 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime* > > > > *2. But this command gets stuck. It seems to be the mpi program that gets > stuck. * > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *[computer01:47982] mca: base: component_find: searching NULL for plm > components [computer01:47982] mca: base: find_dyn_components: checking NULL > for plm components [computer01:47982] pmix:mca: base: components_register: > registering framework plm components [computer01:47982] pmix:mca: base: > components_register: found loaded component slurm [computer01:47982] > pmix:mca: base: components_register: component slurm register function > successful [computer01:47982] pmix:mca: base: components_register: found > loaded component ssh [computer01:47982] pmix:mca: base: > components_register: component ssh register function successful > [computer01:47982] mca: base: components_open: opening plm components > [computer01:47982] mca: base: components_open: found loaded component slurm > [computer01:47982] mca: base: components_open: component slurm open > function successful [computer01:47982] mca: base: components_open: found > loaded component ssh [computer01:47982] mca: base: components_open: > component ssh open function successful [computer01:47982] mca:base:select: > Auto-selecting plm components [computer01:47982] mca:base:select:( plm) > Querying component [slurm] [computer01:47982] mca:base:select:( plm) > Querying component [ssh] [computer01:47982] [[INVALID],0] plm:ssh_lookup on > agent ssh : rsh path NULL [computer01:47982] mca:base:select:( plm) Query > of component [ssh] set priority to 10 [computer01:47982] mca:base:select:( > plm) Selected component [ssh] [computer01:47982] mca: base: close: > component slurm closed [computer01:47982] mca: base: close: unloading > component slurm [computer01:47982] [prterun-computer01-47982@0,0] > plm:ssh_setup on agent ssh : rsh path NULL [computer01:47982] > [prterun-computer01-47982@0,0] plm:base:receive start comm > [computer01:47982] mca: base: component_find: searching NULL for ras > components [computer01:47982] mca: base: find_dyn_components: checking NULL > for ras components [computer01:47982] pmix:mca: base: components_register: > registering framework ras components [computer01:47982] pmix:mca: base: > components_register: found loaded component simulator [computer01:47982] > pmix:mca: base: components_register: component simulator register function > successful [computer01:47982] pmix:mca: base: components_register: found > loaded component pbs [computer01:47982] pmix:mca: base: > components_register: component pbs register function successful > [computer01:47982] pmix:mca: base: components_register: found loaded > component slurm [computer01:47982] pmix:mca: base: components_register: > component slurm register function successful [computer01:47982] mca: base: > components_open: opening ras components [computer01:47982] mca: base: > components_open: found loaded component simulator [computer01:47982] mca: > base: components_open: found loaded component pbs [computer01:47982] mca: > base: components_open: component pbs open function successful > [computer01:47982] mca: base: components_open: found loaded component slurm > [computer01:47982] mca: base: components_open: component slurm open > function successful [computer01:47982] mca:base:select: Auto-selecting ras > components [computer01:47982] mca:base:select:( ras) Querying component > [simulator] [computer01:47982] mca:base:select:( ras) Querying component > [pbs] [computer01:47982] mca:base:select:( ras) Querying component [slurm] > [computer01:47982] mca:base:select:( ras) No component selected! > [computer01:47982] mca: base: component_find: searching NULL for rmaps > components [computer01:47982] mca: base: find_dyn_components: checking NULL > for rmaps components [computer01:47982] pmix:mca: base: > components_register: registering framework rmaps components > [computer01:47982] pmix:mca: base: components_register: found loaded > component ppr [computer01:47982] pmix:mca: base: components_register: > component ppr register function successful [computer01:47982] pmix:mca: > base: components_register: found loaded component rank_file > [computer01:47982] pmix:mca: base: components_register: component rank_file > has no register or open function [computer01:47982] pmix:mca: base: > components_register: found loaded component round_robin [computer01:47982] > pmix:mca: base: components_register: component round_robin register > function successful [computer01:47982] pmix:mca: base: components_register: > found loaded component seq [computer01:47982] pmix:mca: base: > components_register: component seq register function successful > [computer01:47982] mca: base: components_open: opening rmaps components > [computer01:47982] mca: base: components_open: found loaded component ppr > [computer01:47982] mca: base: components_open: component ppr open function > successful [computer01:47982] mca: base: components_open: found loaded > component rank_file [computer01:47982] mca: base: components_open: found > loaded component round_robin [computer01:47982] mca: base: components_open: > component round_robin open function successful [computer01:47982] mca: > base: components_open: found loaded component seq [computer01:47982] mca: > base: components_open: component seq open function successful > [computer01:47982] mca:rmaps:select: checking available component ppr > [computer01:47982] mca:rmaps:select: Querying component [ppr] > [computer01:47982] mca:rmaps:select: checking available component rank_file > [computer01:47982] mca:rmaps:select: Querying component [rank_file] > [computer01:47982] mca:rmaps:select: checking available component > round_robin [computer01:47982] mca:rmaps:select: Querying component > [round_robin] [computer01:47982] mca:rmaps:select: checking available > component seq [computer01:47982] mca:rmaps:select: Querying component [seq] > [computer01:47982] [prterun-computer01-47982@0,0]: Final mapper priorities > [computer01:47982] Mapper: rank_file Priority: 100 [computer01:47982] > Mapper: ppr Priority: 90 [computer01:47982] Mapper: seq Priority: > 60 [computer01:47982] Mapper: round_robin Priority: 10 > [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate > [computer01:47982] [prterun-computer01-47982@0,0] ras:base:allocate nothing > found in module - proceeding to hostfile [computer01:47982] > [prterun-computer01-47982@0,0] ras:base:allocate adding hostfile hosts > [computer01:47982] [prterun-computer01-47982@0,0] hostfile: checking > hostfile hosts for nodes [computer01:47982] [prterun-computer01-47982@0,0] > hostfile: node 192.168.180.48 is being included - keep all is FALSE > [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node > 192.168.60.203 is being included - keep all is FALSE [computer01:47982] > [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1 > [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node > 192.168.60.203 slots 1 [computer01:47982] [prterun-computer01-47982@0,0] > ras:base:node_insert inserting 2 nodes [computer01:47982] > [prterun-computer01-47982@0,0] ras:base:node_insert updating HNP > [192.168.180.48] info to 1 slots [computer01:47982] > [prterun-computer01-47982@0,0] ras:base:node_insert node 192.168.60.203 > slots 1 ====================== ALLOCATED NODES ====================== > computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: > DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 > 192.168.60.203 <http://192.168.60.203>: slots=1 max_slots=0 > slots_inuse=0 state=UNKNOWN Flags: SLOTS_GIVEN aliases: NONE > ================================================================= > [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm > [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm > creating map [computer01:47982] [prterun-computer01-47982@0,0] setup:vm: > working unmanaged allocation [computer01:47982] > [prterun-computer01-47982@0,0] using hostfile hosts [computer01:47982] > [prterun-computer01-47982@0,0] hostfile: checking hostfile hosts for nodes > [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node > 192.168.180.48 is being included - keep all is FALSE [computer01:47982] > [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 is being > included - keep all is FALSE [computer01:47982] > [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1 > [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node > 192.168.60.203 slots 1 [computer01:47982] [prterun-computer01-47982@0,0] > checking node 192.168.180.48 [computer01:47982] > [prterun-computer01-47982@0,0] ignoring myself [computer01:47982] > [prterun-computer01-47982@0,0] checking node 192.168.60.203 > [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm add new > daemon [prterun-computer01-47982@0,1] [computer01:47982] > [prterun-computer01-47982@0,0] plm:base:setup_vm assigning new daemon > [prterun-computer01-47982@0,1] to node 192.168.60.203 [computer01:47982] > [prterun-computer01-47982@0,0] plm:ssh: launching vm [computer01:47982] > [prterun-computer01-47982@0,0] plm:ssh: local shell: 0 (bash) > [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: assuming same > remote shell as local shell [computer01:47982] > [prterun-computer01-47982@0,0] plm:ssh: remote shell: 0 (bash) > [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: final template > argv: /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export > PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export > LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export > DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" > --prtemca ess_base_nspace "prterun-computer01-47982@0" --prtemca > ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca > prte_hnp_uri > "prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24" > <prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24> > --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" > --prtemca ras_base_verbose "100" --prtemca pmix_session_server "1" > --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri > "prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24" > <prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24> > [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh:launch daemon 0 > not a child of mine [computer01:47982] [prterun-computer01-47982@0,0] > plm:ssh: adding node 192.168.60.203 to launch list [computer01:47982] > [prterun-computer01-47982@0,0] plm:ssh: activating launch event > [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh: recording launch > of daemon [prterun-computer01-47982@0,1] [computer01:47982] > [prterun-computer01-47982@0,0] plm:ssh: executing: (/usr/bin/ssh) > [/usr/bin/ssh 192.168.60.203 PRTE_PREFIX=/usr/local/openmpi;export > PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export > LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export > DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" > --prtemca ess_base_nspace "prterun-computer01-47982@0" --prtemca > ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri > "prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24" > <prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24> > --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" > --prtemca ras_base_verbose "100" --prtemca pmix_session_server "1" > --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri > "prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24" > <prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24>] > [computer01:47982] [prterun-computer01-47982@0,0] > plm:base:orted_report_launch from daemon [prterun-computer01-47982@0,1] > [computer01:47982] [prterun-computer01-47982@0,0] > plm:base:orted_report_launch from daemon [prterun-computer01-47982@0,1] on > node computer02 [computer01:47982] ALIASES FOR NODE computer02 (computer02) > [computer01:47982] ALIAS: 192.168.60.203 [computer01:47982] ALIAS: > computer02 [computer01:47982] ALIAS: 172.17.180.203 [computer01:47982] > ALIAS: 172.168.10.23 [computer01:47982] ALIAS: 172.168.10.143 > [computer01:47982] [prterun-computer01-47982@0,0] RECEIVED TOPOLOGY SIG > 2N:2S:2L3:64L2:64L1:64C:128H:0-127::x86_64:le FROM NODE computer02 > [computer01:47982] [prterun-computer01-47982@0,0] NEW TOPOLOGY - ADDING > SIGNATURE [computer01:47982] [prterun-computer01-47982@0,0] > plm:base:orted_report_launch completed for daemon > [prterun-computer01-47982@0,1] at contact > prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24 > <prterun-computer01-47982@0.0;tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:59788:24,16,24,24,24,24> > [computer01:47982] [prterun-computer01-47982@0,0] > plm:base:orted_report_launch job prterun-computer01-47982@0 recvd 2 of 2 > reported daemons [computer01:47982] [prterun-computer01-47982@0,0] > plm:base:receive processing msg [computer01:47982] > [prterun-computer01-47982@0,0] plm:base:receive job launch command from > [prterun-computer01-47982@0,0] [computer01:47982] > [prterun-computer01-47982@0,0] plm:base:receive adding hosts > ====================== ALLOCATED NODES ====================== > computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: > DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 > computer02: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: > DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: > 192.168.60.203,computer02,172.17.180.203,172.168.10.23,172.168.10.143 > ================================================================= > [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive calling > spawn [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive > done processing commands [computer01:47982] [prterun-computer01-47982@0,0] > plm:base:setup_job [computer01:47982] [prterun-computer01-47982@0,0] > ras:base:allocate [computer01:47982] [prterun-computer01-47982@0,0] > ras:base:allocate allocation already read [computer01:47982] > [prterun-computer01-47982@0,0] plm:base:setup_vm [computer01:47982] > [prterun-computer01-47982@0,0] plm_base:setup_vm NODE computer02 WAS NOT > ADDED [computer01:47982] [prterun-computer01-47982@0,0] plm:base:setup_vm > no new daemons required [computer01:47982] mca:rmaps: mapping job > prterun-computer01-47982@1 [computer01:47982] mca:rmaps: setting mapping > policies for job prterun-computer01-47982@1 inherit TRUE hwtcpus FALSE > [computer01:47982] mca:rmaps[355] mapping not given - using bycore > [computer01:47982] setdefaultbinding[314] binding not given - using bycore > [computer01:47982] mca:rmaps:rf: job prterun-computer01-47982@1 not using > rankfile policy [computer01:47982] mca:rmaps:ppr: job > prterun-computer01-47982@1 not using ppr mapper PPR NULL policy PPR NOTSET > [computer01:47982] [prterun-computer01-47982@0,0] rmaps:seq called on job > prterun-computer01-47982@1 [computer01:47982] mca:rmaps:seq: job > prterun-computer01-47982@1 not using seq mapper [computer01:47982] > mca:rmaps:rr: mapping job prterun-computer01-47982@1 [computer01:47982] > [prterun-computer01-47982@0,0] using hostfile hosts [computer01:47982] > [prterun-computer01-47982@0,0] hostfile: checking hostfile hosts for nodes > [computer01:47982] [prterun-computer01-47982@0,0] hostfile: node > 192.168.180.48 is being included - keep all is FALSE [computer01:47982] > [prterun-computer01-47982@0,0] hostfile: node 192.168.60.203 is being > included - keep all is FALSE [computer01:47982] > [prterun-computer01-47982@0,0] hostfile: adding node 192.168.180.48 slots 1 > [computer01:47982] [prterun-computer01-47982@0,0] hostfile: adding node > 192.168.60.203 slots 1 [computer01:47982] NODE computer01 DOESNT MATCH NODE > 192.168.60.203 [computer01:47982] [prterun-computer01-47982@0,0] node > computer01 has 1 slots available [computer01:47982] > [prterun-computer01-47982@0,0] node computer02 has 1 slots available > [computer01:47982] AVAILABLE NODES FOR MAPPING: [computer01:47982] > node: computer01 daemon: 0 slots_available: 1 [computer01:47982] node: > computer02 daemon: 1 slots_available: 1 [computer01:47982] mca:rmaps:rr: > mapping by Core for job prterun-computer01-47982@1 slots 2 num_procs 2 > [computer01:47982] mca:rmaps:rr: found 56 Core objects on node computer01 > [computer01:47982] mca:rmaps:rr: assigning nprocs 1 [computer01:47982] > mca:rmaps:rr: assigning proc to object 0 [computer01:47982] > [prterun-computer01-47982@0,0] get_avail_ncpus: node computer01 has 0 procs > on it [computer01:47982] mca:rmaps: compute bindings for job > prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007] > [computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID] > with policy CORE:IF-SUPPORTED [computer01:47982] > [prterun-computer01-47982@0,0] BOUND PROC > [prterun-computer01-47982@1,INVALID][computer01] TO package[0][core:0] > [computer01:47982] mca:rmaps:rr: found 64 Core objects on node computer02 > [computer01:47982] mca:rmaps:rr: assigning nprocs 1 [computer01:47982] > mca:rmaps:rr: assigning proc to object 0 [computer01:47982] > [prterun-computer01-47982@0,0] get_avail_ncpus: node computer02 has 0 procs > on it [computer01:47982] mca:rmaps: compute bindings for job > prterun-computer01-47982@1 with policy CORE:IF-SUPPORTED[1007] > [computer01:47982] mca:rmaps: bind [prterun-computer01-47982@1,INVALID] > with policy CORE:IF-SUPPORTED [computer01:47982] > [prterun-computer01-47982@0,0] BOUND PROC > [prterun-computer01-47982@1,INVALID][computer02] TO package[0][core:0] > [computer01:47982] [prterun-computer01-47982@0,0] complete_setup on job > prterun-computer01-47982@1 [computer01:47982] > [prterun-computer01-47982@0,0] plm:base:launch_apps for job > prterun-computer01-47982@1 [computer01:47982] > [prterun-computer01-47982@0,0] plm:base:send launch msg for job > prterun-computer01-47982@1 [computer01:47982] > [prterun-computer01-47982@0,0] plm:base:receive processing msg > [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive local > launch complete command from [prterun-computer01-47982@0,1] > [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got > local launch complete for job prterun-computer01-47982@1 [computer01:47982] > [prterun-computer01-47982@0,0] plm:base:receive got local launch complete > for vpid 1 [computer01:47982] [prterun-computer01-47982@0,0] > plm:base:receive got local launch complete for vpid 1 state RUNNING > [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done > processing commands [computer01:47982] [prterun-computer01-47982@0,0] > plm:base:launch wiring up iof for job prterun-computer01-47982@1 > [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive > processing msg [computer01:47982] [prterun-computer01-47982@0,0] > plm:base:receive registered command from [prterun-computer01-47982@0,1] > [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive got > registered for job prterun-computer01-47982@1 [computer01:47982] > [prterun-computer01-47982@0,0] plm:base:receive got registered for vpid 1 > [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive done > processing commands [computer01:47982] [prterun-computer01-47982@0,0] > plm:base:launch prterun-computer01-47982@1 registered [computer01:47982] > [prterun-computer01-47982@0,0] plm:base:prted_cmd sending prted_exit > commands #### ctrl + c Abort is in progress...hit ctrl-c again to forcibly > terminate * > > > > 在 2022/11/21 21:26, Jeff Squyres (jsquyres) 写道: > > Thanks for the output! It looks like this is an actual bug in the 5.0rc9 > tarball. It stems from a mis-handling of topology mismatches between your > two computers (which *should* work just fine). > > I have filed a PR with the fixes: https://github.com/ > open-mpi/ompi/pull/11096 > > That being said, I know the Open PI v5.0.0 release managers were > sidetracked for the past 2 weeks, and this week is the Thanksgiving holiday > in the US, which generally results in some delays because people are taking > time off. > > Bottom line: I don't know when this PR will get merged, so I made a new > unofficial tarball based on that PR. Can you give this a whirl and see if > it fixes your problem? > > https://www-lb.open-mpi.org/~jsquyres/unofficial/openmpi- > gitclone-pr11096.tar.bz2 > > -- > Jeff Squyres > jsquy...@cisco.com > ------------------------------ > *From:* timesir <mrlong...@gmail.com> <mrlong...@gmail.com> > *Sent:* Friday, November 18, 2022 10:55 PM > *To:* Jeff Squyres (jsquyres) <jsquy...@cisco.com> <jsquy...@cisco.com>; > users@lists.open-mpi.org <users@lists.open-mpi.org> > <users@lists.open-mpi.org>; gilles.gouaillar...@gmail.com > <gilles.gouaillar...@gmail.com> <gilles.gouaillar...@gmail.com> > *Subject:* Re: users Digest, Vol 4818, Issue 1 > > > > > *1. the extra information is switching to use intel mpi, which works fine > on all four machines. 2. The attached information is the config.log and the > output of ompi_info --all command for two machines 3. Here is the output of > the command you need* > > > > *(py3.9) ➜ /share ompi_info --version* > Open MPI v5.0.0rc9 > > https://www.open-mpi.org/community/help/ > > *(py3.9) ➜ /share cat hosts* > 192.168.180.48 slots=1 > 192.168.60.203 slots=1 > > > *(py3.9) ➜ /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose > 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime* > [computer01:50653] mca: base: component_find: searching NULL for plm > components > [computer01:50653] mca: base: find_dyn_components: checking NULL for plm > components > [computer01:50653] pmix:mca: base: components_register: registering > framework plm components > [computer01:50653] pmix:mca: base: components_register: found loaded > component slurm > [computer01:50653] pmix:mca: base: components_register: component slurm > register function successful > [computer01:50653] pmix:mca: base: components_register: found loaded > component ssh > [computer01:50653] pmix:mca: base: components_register: component ssh > register function successful > [computer01:50653] mca: base: components_open: opening plm components > [computer01:50653] mca: base: components_open: found loaded component slurm > [computer01:50653] mca: base: components_open: component slurm open > function successful > [computer01:50653] mca: base: components_open: found loaded component ssh > [computer01:50653] mca: base: components_open: component ssh open function > successful > [computer01:50653] mca:base:select: Auto-selecting plm components > [computer01:50653] mca:base:select:( plm) Querying component [slurm] > [computer01:50653] mca:base:select:( plm) Querying component [ssh] > [computer01:50653] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path > NULL > [computer01:50653] mca:base:select:( plm) Query of component [ssh] set > priority to 10 > [computer01:50653] mca:base:select:( plm) Selected component [ssh] > [computer01:50653] mca: base: close: component slurm closed > [computer01:50653] mca: base: close: unloading component slurm > [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh_setup on agent > ssh : rsh path NULL > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive start > comm > [computer01:50653] mca: base: component_find: searching NULL for ras > components > [computer01:50653] mca: base: find_dyn_components: checking NULL for ras > components > [computer01:50653] pmix:mca: base: components_register: registering > framework ras components > [computer01:50653] pmix:mca: base: components_register: found loaded > component simulator > [computer01:50653] pmix:mca: base: components_register: component > simulator register function successful > [computer01:50653] pmix:mca: base: components_register: found loaded > component pbs > [computer01:50653] pmix:mca: base: components_register: component pbs > register function successful > [computer01:50653] pmix:mca: base: components_register: found loaded > component slurm > [computer01:50653] pmix:mca: base: components_register: component slurm > register function successful > [151/1582] > [computer01:50653] mca: base: components_open: opening ras components > [computer01:50653] mca: base: components_open: found loaded component > simulator > [computer01:50653] mca: base: components_open: found loaded component pbs > [computer01:50653] mca: base: components_open: component pbs open function > successful > [computer01:50653] mca: base: components_open: found loaded component slurm > [computer01:50653] mca: base: components_open: component slurm open > function successful > [computer01:50653] mca:base:select: Auto-selecting ras components > [computer01:50653] mca:base:select:( ras) Querying component [simulator] > [computer01:50653] mca:base:select:( ras) Querying component [pbs] > [computer01:50653] mca:base:select:( ras) Querying component [slurm] > [computer01:50653] mca:base:select:( ras) No component selected! > [computer01:50653] mca: base: component_find: searching NULL for rmaps > components > [computer01:50653] mca: base: find_dyn_components: checking NULL for rmaps > components > [computer01:50653] pmix:mca: base: components_register: registering > framework rmaps components > [computer01:50653] pmix:mca: base: components_register: found loaded > component ppr > [computer01:50653] pmix:mca: base: components_register: component ppr > register function successful > [computer01:50653] pmix:mca: base: components_register: found loaded > component rank_file > [computer01:50653] pmix:mca: base: components_register: component > rank_file has no register or open function > [computer01:50653] pmix:mca: base: components_register: found loaded > component round_robin > [computer01:50653] pmix:mca: base: components_register: component > round_robin register function successful > [computer01:50653] pmix:mca: base: components_register: found loaded > component seq > [computer01:50653] pmix:mca: base: components_register: component seq > register function successful > [computer01:50653] mca: base: components_open: opening rmaps components > [computer01:50653] mca: base: components_open: found loaded component ppr > [computer01:50653] mca: base: components_open: component ppr open function > successful > [computer01:50653] mca: base: components_open: found loaded component > rank_file > [computer01:50653] mca: base: components_open: found loaded component > round_robin > [computer01:50653] mca: base: components_open: component round_robin open > function successful > [computer01:50653] mca: base: components_open: found loaded component seq > [computer01:50653] mca: base: components_open: component seq open function > successful > [computer01:50653] mca:rmaps:select: checking available component ppr > [computer01:50653] mca:rmaps:select: Querying component [ppr] > [computer01:50653] mca:rmaps:select: checking available component rank_file > [computer01:50653] mca:rmaps:select: Querying component [rank_file] > [computer01:50653] mca:rmaps:select: checking available component > round_robin > [computer01:50653] mca:rmaps:select: Querying component [round_robin] > [computer01:50653] mca:rmaps:select: checking available component seq > [computer01:50653] mca:rmaps:select: Querying component > [seq] > [113/1582] > [computer01:50653] [prterun-computer01-50653@0,0]: Final mapper priorities > [computer01:50653] Mapper: ppr Priority: 90 > [computer01:50653] Mapper: seq Priority: 60 > [computer01:50653] Mapper: round_robin Priority: 10 > [computer01:50653] Mapper: rank_file Priority: 0 > [computer01:50653] [prterun-computer01-50653@0,0] ras:base:allocate > [computer01:50653] [prterun-computer01-50653@0,0] ras:base:allocate > nothing found in module - proceeding to hostfile > [computer01:50653] [prterun-computer01-50653@0,0] ras:base:allocate > adding hostfile hosts > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: checking > hostfile hosts for nodes > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: node > 192.168.180.48 is being included - keep all is FALSE > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: node > 192.168.60.203 is being included - keep all is FALSE > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: adding node > 192.168.180.48 slots 1 > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: adding node > 192.168.60.203 slots 1 > [computer01:50653] [prterun-computer01-50653@0,0] ras:base:node_insert > inserting 2 nodes > [computer01:50653] [prterun-computer01-50653@0,0] ras:base:node_insert > updating HNP [192.168.180.48] info to 1 slots > [computer01:50653] [prterun-computer01-50653@0,0] ras:base:node_insert > node 192.168.60.203 slots 1 > > ====================== ALLOCATED NODES ====================== > computer01: slots=1 max_slots=0 slots_inuse=0 state=UP > Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN > aliases: 192.168.180.48 > 192.168.60.203: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > Flags: SLOTS_GIVEN > aliases: NONE > ================================================================= > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_vm > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_vm > creating map > [computer01:50653] [prterun-computer01-50653@0,0] setup:vm: working > unmanaged allocation > [computer01:50653] [prterun-computer01-50653@0,0] using hostfile hosts > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: checking > hostfile hosts for nodes > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: node > 192.168.180.48 is being included - keep all is FALSE > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: node > 192.168.60.203 is being included - keep all is FALSE > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: adding node > 192.168.180.48 slots 1 > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: adding node > 192.168.60.203 slots 1 > [computer01:50653] [prterun-computer01-50653@0,0] checking node > 192.168.180.48 > [computer01:50653] [prterun-computer01-50653@0,0] ignoring myself > [computer01:50653] [prterun-computer01-50653@0,0] checking node > 192.168.60.203 > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_vm add > new daemon [prterun-computer01-50653@0,1] > [75/1582] > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_vm > assigning new daemon [prterun-computer01-50653@0,1] to node 192.168.60.203 > [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: launching vm > [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: local shell: 0 > (bash) > [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: assuming same > remote shell as local shell > [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: remote shell: > 0 (bash) > [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: final template > argv: > /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export > PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/ > local/openmpi/lib:$LD_LIBRARY_ > PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/ > usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export > DYLD_LIBRARY_PATH;/usr/local/openmpi/b > in/prted --prtemca ess "env" --prtemca ess_base_nspace > "prterun-computer01-50653@0" --prtemca ess_base_vpid "<template>" > --prtemca ess_base_num_procs "2" -- > prtemca prte_hnp_uri "prterun-computer01-50653@0.0; > tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168. > 100.24,172.168.10.144,192.168.122.1:38155:24,16 > ,24,24,24,24" --prtemca plm_base_verbose "100" --prtemca > rmaps_base_verbose "100" --prtemca ras_base_verbose "100" --prtemca > pmix_session_server "1" --prtem > ca plm "ssh" --tree-spawn --prtemca prte_parent_uri " > prterun-computer01-50653@0.0;tcp://192.168.180.48,172.17. > 180.205,172.168.10.24,172.168.100.24,172.168.1 > 0.144,192.168.122.1:38155:24,16,24,24,24,24" > [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh:launch daemon 0 > not a child of mine > [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: adding node > 192.168.60.203 to launch list > [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: activating > launch event > [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: recording > launch of daemon [prterun-computer01-50653@0,1] > [computer01:50653] [prterun-computer01-50653@0,0] plm:ssh: executing: > (/usr/bin/ssh) [/usr/bin/ssh 192.168.60.203 > PRTE_PREFIX=/usr/local/openmpi;export > PRTE > _PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/ > openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_ > PATH=/usr/local/openmpi/lib:/usr/ > local/openmpi/lib:$DYLD_LIBRARY_PATH;export > DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted > --prtemca ess "env" --prtemca ess_base_nspace "prterun-computer01 > -50653@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" > --prtemca prte_hnp_uri "prterun-computer01-50653@0.0; > tcp://192.168.180.48,172.17.180.20 > 5,172.168.10.24,172.168.100.24,172.168.10.144,192.168.122.1:38155:24,16,24,24,24,24" > --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --p > rtemca ras_base_verbose "100" --prtemca pmix_session_server "1" --prtemca > plm "ssh" --tree-spawn --prtemca prte_parent_uri " > prterun-computer01-50653@0.0;tcp > ://192.168.180.48,172.17.180.205,172.168.10.24,172.168.100. > 24,172.168.10.144,192.168.122.1:38155:24,16,24,24,24,24"] > [computer01:50653] [prterun-computer01-50653@0,0] > plm:base:orted_report_launch from daemon [prterun-computer01-50653@0,1] > [computer01:50653] [prterun-computer01-50653@0,0] > plm:base:orted_report_launch from daemon [prterun-computer01-50653@0,1] > on node computer02 > [computer01:50653] ALIASES FOR NODE computer02 (computer02) > [computer01:50653] ALIAS: 192.168.60.203 > [computer01:50653] ALIAS: computer02 > [computer01:50653] ALIAS: 172.17.180.203 > [computer01:50653] ALIAS: 172.168.10.23 > [computer01:50653] ALIAS: 172.168.10.143 > [computer01:50653] [prterun-computer01-50653@0,0] RECEIVED TOPOLOGY SIG > 2N:2S:2L3:64L2:64L1:64C:128H:0-127::x86_64:le FROM NODE computer02 > [computer01:50653] [prterun-computer01-50653@0,0] NEW TOPOLOGY - ADDING > SIGNATURE > [computer01:50653] [prterun-computer01-50653@0,0] > plm:base:orted_report_launch completed for daemon > [prterun-computer01-50653@0,1] at contact prterun-comput > er01-50653@0.0;tcp://192.168.180.48,172.17.180.205,172.168. > 10.24,172.168.100.24,172.168.10.144,192.168.122.1:38155:24,16,24,24,24,24 > [computer01:50653] [prterun-computer01-50653@0,0] > plm:base:orted_report_launch job prterun-computer01-50653@0 recvd 2 of 2 > reported daemons > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive > processing msg > [37/1582] > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive job > launch command from [prterun-computer01-50653@0,0] > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive adding > hosts > > ====================== ALLOCATED NODES ====================== > computer01: slots=1 max_slots=0 slots_inuse=0 state=UP > Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN > aliases: 192.168.180.48 > computer02: slots=1 max_slots=0 slots_inuse=0 state=UP > Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN > aliases: 192.168.60.203,computer02,172. > 17.180.203,172.168.10.23,172.168.10.143 > ================================================================= > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive > calling spawn > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive done > processing commands > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_job > [computer01:50653] [prterun-computer01-50653@0,0] ras:base:allocate > [computer01:50653] [prterun-computer01-50653@0,0] ras:base:allocate > allocation already read > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_vm > [computer01:50653] [prterun-computer01-50653@0,0] plm_base:setup_vm NODE > computer02 WAS NOT ADDED > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:setup_vm no > new daemons required > [computer01:50653] mca:rmaps: mapping job prterun-computer01-50653@1 > [computer01:50653] mca:rmaps: setting mapping policies for job > prterun-computer01-50653@1 inherit TRUE hwtcpus FALSE > [computer01:50653] mca:rmaps[358] mapping not given - using bycore > [computer01:50653] setdefaultbinding[365] binding not given - using bycore > [computer01:50653] mca:rmaps:ppr: job prterun-computer01-50653@1 not > using ppr mapper PPR NULL policy PPR NOTSET > [computer01:50653] [prterun-computer01-50653@0,0] rmaps:seq called on job > prterun-computer01-50653@1 > [computer01:50653] mca:rmaps:seq: job prterun-computer01-50653@1 not > using seq mapper > [computer01:50653] mca:rmaps:rr: mapping job prterun-computer01-50653@1 > [computer01:50653] [prterun-computer01-50653@0,0] using hostfile hosts > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: checking > hostfile hosts for nodes > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: node > 192.168.180.48 is being included - keep all is FALSE > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: node > 192.168.60.203 is being included - keep all is FALSE > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: adding node > 192.168.180.48 slots 1 > [computer01:50653] [prterun-computer01-50653@0,0] hostfile: adding node > 192.168.60.203 slots 1 > [computer01:50653] NODE computer01 DOESNT MATCH NODE 192.168.60.203 > [computer01:50653] [prterun-computer01-50653@0,0] node computer01 has 1 > slots available > [computer01:50653] [prterun-computer01-50653@0,0] node computer02 lacks > topology > [computer01:50653] AVAILABLE NODES FOR MAPPING: > [computer01:50653] node: computer01 daemon: 0 slots_available: 1 > [computer01:50653] mca:rmaps:rr: mapping by Core for job > prterun-computer01-50653@1 slots 1 num_procs 2 > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:orted_cmd > sending kill_local_procs cmds > -------------------------------------------------------------------------- > There are not enough slots available in the system to satisfy the 2 > slots that were requested by the application: > > uptime > > Either request fewer procs for your application, or make more slots > available for use. > > A "slot" is the PRRTE term for an allocatable unit where we can > launch a process. The number of slots available are defined by the > environment in which PRRTE processes are run: > > 1. Hostfile, via "slots=N" clauses (N defaults to number of > processor cores if not provided) > 2. The --host command line parameter, via a ":N" suffix on the > hostname (N defaults to 1 if not provided) > 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) > 4. If none of a hostfile, the --host command line parameter, or an > RM is present, PRRTE defaults to the number of processor cores > > In all the above cases, if you want PRRTE to default to the number > of hardware threads instead of the number of processor cores, use the > --use-hwthread-cpus option. > > Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the > number of available slots when deciding the number of processes to > launch. > -------------------------------------------------------------------------- > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:prted_cmd > sending prted_exit commands > [computer01:50653] [prterun-computer01-50653@0,0] plm:base:receive stop > comm > [computer01:50653] mca: base: close: component ssh closed > [computer01:50653] mca: base: close: unloading component ssh > > > > 在 2022/11/19 01:01, Jeff Squyres (jsquyres) 写道: > > Actually, I guess I see a reason we're not getting all the output I > expect: can you rebuild Open MPI with the --enable-debug configure command > line option, and then re-run all of those commands again? We should get > more output this time. > > Thanks! > > -- > Jeff Squyres > jsquy...@cisco.com > ------------------------------ > *From:* Jeff Squyres (jsquyres) <jsquy...@cisco.com> <jsquy...@cisco.com> > *Sent:* Friday, November 18, 2022 11:52 AM > *To:* timesir <mrlong...@gmail.com> <mrlong...@gmail.com>; > users@lists.open-mpi.org <users@lists.open-mpi.org> > <users@lists.open-mpi.org>; gilles.gouaillar...@gmail.com > <gilles.gouaillar...@gmail.com> <gilles.gouaillar...@gmail.com> > *Subject:* Re: users Digest, Vol 4818, Issue 1 > > Ok, this is a good / consistent output. That being said, I don't grok > what is happening here: it says it finds 2 slots, but then it tells you it > doesn't have enough slots. > > Let me dig deeper and get back to you... > > -- > Jeff Squyres > jsquy...@cisco.com > ------------------------------ > *From:* timesir <mrlong...@gmail.com> <mrlong...@gmail.com> > *Sent:* Friday, November 18, 2022 10:20 AM > *To:* Jeff Squyres (jsquyres) <jsquy...@cisco.com> <jsquy...@cisco.com>; > users@lists.open-mpi.org <users@lists.open-mpi.org> > <users@lists.open-mpi.org>; gilles.gouaillar...@gmail.com > <gilles.gouaillar...@gmail.com> <gilles.gouaillar...@gmail.com> > *Subject:* Re: users Digest, Vol 4818, Issue 1 > > *(py3.9) ➜ /share ompi_info --version* > > Open MPI v5.0.0rc9 > > https://www.open-mpi.org/community/help/ > > > *(py3.9) ➜ /share cat hosts* > 192.168.180.48 slots=1 > 192.168.60.203 slots=1 > > > *(py3.9) ➜** /share mpirun -n 2 --machinefile hosts --mca > plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose > 100 uptime* > [computer01:53933] mca: base: component_find: searching NULL for plm > components > [computer01:53933] mca: base: find_dyn_components: checking NULL for plm > components > [computer01:53933] pmix:mca: base: components_register: registering > framework plm components > [computer01:53933] pmix:mca: base: components_register: found loaded > component slurm > [computer01:53933] pmix:mca: base: components_register: component slurm > register function successful > [computer01:53933] pmix:mca: base: components_register: found loaded > component ssh > [computer01:53933] pmix:mca: base: components_register: component ssh > register function successful > [computer01:53933] mca: base: components_open: opening plm components > [computer01:53933] mca: base: components_open: found loaded component slurm > [computer01:53933] mca: base: components_open: component slurm open > function successful > [computer01:53933] mca: base: components_open: found loaded component ssh > [computer01:53933] mca: base: components_open: component ssh open function > successful > [computer01:53933] mca:base:select: Auto-selecting plm components > [computer01:53933] mca:base:select:( plm) Querying component [slurm] > [computer01:53933] mca:base:select:( plm) Querying component [ssh] > [computer01:53933] mca:base:select:( plm) Query of component [ssh] set > priority to 10 > [computer01:53933] mca:base:select:( plm) Selected component [ssh] > [computer01:53933] mca: base: close: component slurm closed > [computer01:53933] mca: base: close: unloading component slurm > [computer01:53933] mca: base: component_find: searching NULL for ras > components > [computer01:53933] mca: base: find_dyn_components: checking NULL for ras > components > [computer01:53933] pmix:mca: base: components_register: registering > framework ras components > [computer01:53933] pmix:mca: base: components_register: found loaded > component simulator > [computer01:53933] pmix:mca: base: components_register: component > simulator register function successful > [computer01:53933] pmix:mca: base: components_register: found loaded > component pbs > [computer01:53933] pmix:mca: base: components_register: component pbs > register function successful > [computer01:53933] pmix:mca: base: components_register: found loaded > component slurm > [computer01:53933] pmix:mca: base: components_register: component slurm > register function successful > [computer01:53933] mca: base: components_open: opening ras components > [computer01:53933] mca: base: components_open: found loaded component > simulator > [computer01:53933] mca: base: components_open: found loaded component pbs > [computer01:53933] mca: base: components_open: component pbs open function > successful > [computer01:53933] mca: base: components_open: found loaded component slurm > [computer01:53933] mca: base: components_open: component slurm open > function successful > [computer01:53933] mca:base:select: Auto-selecting ras components > [computer01:53933] mca:base:select:( ras) Querying component [simulator] > > [computer01:53933] mca:base:select:( ras) Querying component > [pbs] > [71/1815] > [computer01:53933] mca:base:select:( ras) Querying component [slurm] > [computer01:53933] mca:base:select:( ras) No component selected! > [computer01:53933] mca: base: component_find: searching NULL for rmaps > components > [computer01:53933] mca: base: find_dyn_components: checking NULL for rmaps > components > [computer01:53933] pmix:mca: base: components_register: registering > framework rmaps components > [computer01:53933] pmix:mca: base: components_register: found loaded > component ppr > [computer01:53933] pmix:mca: base: components_register: component ppr > register function successful > [computer01:53933] pmix:mca: base: components_register: found loaded > component rank_file > [computer01:53933] pmix:mca: base: components_register: component > rank_file has no register or open function > [computer01:53933] pmix:mca: base: components_register: found loaded > component round_robin > [computer01:53933] pmix:mca: base: components_register: component > round_robin register function successful > [computer01:53933] pmix:mca: base: components_register: found loaded > component seq > [computer01:53933] pmix:mca: base: components_register: component seq > register function successful > [computer01:53933] mca: base: components_open: opening rmaps components > [computer01:53933] mca: base: components_open: found loaded component ppr > [computer01:53933] mca: base: components_open: component ppr open function > successful > [computer01:53933] mca: base: components_open: found loaded component > rank_file > [computer01:53933] mca: base: components_open: found loaded component > round_robin > [computer01:53933] mca: base: components_open: component round_robin open > function successful > [computer01:53933] mca: base: components_open: found loaded component seq > [computer01:53933] mca: base: components_open: component seq open function > successful > [computer01:53933] mca:rmaps:select: checking available component ppr > [computer01:53933] mca:rmaps:select: Querying component [ppr] > [computer01:53933] mca:rmaps:select: checking available component rank_file > [computer01:53933] mca:rmaps:select: Querying component [rank_file] > [computer01:53933] mca:rmaps:select: checking available component > round_robin > [computer01:53933] mca:rmaps:select: Querying component [round_robin] > [computer01:53933] mca:rmaps:select: checking available component seq > [computer01:53933] mca:rmaps:select: Querying component [seq] > [computer01:53933] [prterun-computer01-53933@0,0]: Final mapper priorities > [computer01:53933] Mapper: ppr Priority: 90 > [computer01:53933] Mapper: seq Priority: 60 > [computer01:53933] Mapper: round_robin Priority: 10 > [computer01:53933] Mapper: rank_file Priority: 0 > > ====================== ALLOCATED NODES ====================== > computer01: slots=1 max_slots=0 slots_inuse=0 state=UP > > Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN > > [33/1815] > aliases: 192.168.180.48 > 192.168.60.203: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > Flags: SLOTS_GIVEN > aliases: NONE > ================================================================= > [computer01:53933] [prterun-computer01-53933@0,0] plm:ssh: final template > argv: > /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export > PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/ > local/openmpi/lib:$LD_LIBRARY_ > PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/ > usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;exportDYLD_ > LIBRARY_PATH;/usr/local/openmpi/b > in/prted --prtemca ess "env" --prtemca ess_base_nspace > "prterun-computer01-53933@0" --prtemca ess_base_vpid "<template>" > --prtemca ess_base_num_procs "2" -- > prtemca prte_hnp_uri "prterun-computer01-53933@0.0; > tcp://192.168.180.48,172.17.180.205,172.168.10.24,172.168. > 100.24,172.168.10.144,192.168.122.1:42567:24,16 > ,24,24,24,24" --prtemca plm_base_verbose "100" --prtemca > rmaps_base_verbose "100" --prtemca ras_base_verbose "100" --prtemca > pmix_session_server "1" --prtem > ca plm "ssh" --tree-spawn --prtemca prte_parent_uri " > prterun-computer01-53933@0.0;tcp://192.168.180.48,172.17. > 180.205,172.168.10.24,172.168.100.24,172.168.1 > 0.144,192.168.122.1:42567:24,16,24,24,24,24" > [computer01:53933] ALIASES FOR NODE computer02 (omputer02) > [computer01:53933] ALIAS: 192.168.60.203 > [computer01:53933] ALIAS: computer02 > [computer01:53933] ALIAS: 172.17.180.203 > [computer01:53933] ALIAS: 172.168.10.23 > [computer01:53933] ALIAS: 172.168.10.143 > > ====================== ALLOCATED NODES ====================== > computer01: slots=1 max_slots=0 slots_inuse=0 state=UP > Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN > aliases: 192.168.180.48 > computer02: slots=1 max_slots=0 slots_inuse=0 state=UP > Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN > aliases: 192.168.60.203,computer02,172. > 17.180.203,172.168.10.23,172.168.10.143 > ================================================================= > [computer01:53933] mca:rmaps: mapping job prterun-computer01-53933@1 > [computer01:53933] mca:rmaps: setting mapping policies for job > prterun-computer01-53933@1 inherit TRUE hwtcpus FALSE > [computer01:53933] mca:rmaps[358] mapping not given - using bycore > [computer01:53933] setdefaultbinding[365] binding not given - using bycore > [computer01:53933] mca:rmaps:ppr: job prterun-computer01-53933@1 not > using ppr mapper PPR NULL policy PPR NOTSET > [computer01:53933] mca:rmaps:seq: job prterun-computer01-53933@1 not > using seq mapper > [computer01:53933] mca:rmaps:rr: mapping job prterun-computer01-53933@1 > [computer01:53933] AVAILABLE NODES FOR MAPPING: > [computer01:53933] node: computer01 daemon: 0 slots_available: 1 > > [computer01:53933] mca:rmaps:rr: mapping by Core for job > prterun-computer01-53933@1 slots 1 num_procs 2 > -------------------------------------------------------------------------- > There are not enough slots available in the system to satisfy the 2 > slots that were requested by the application: > > uptime > > Either request fewer procs for your application, or make more slots > available for use. > > A "slot" is the PRRTE term for an allocatable unit where we can > launch a process. The number of slots available are defined by the > environment in which PRRTE processes are run: > > 1. Hostfile, via "slots=N" clauses (N defaults to number of > processor cores if not provided) > 2. The --host command line parameter, via a ":N" suffix on the > hostname (N defaults to 1 if not provided) > 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) > 4. If none of a hostfile, the --host command line parameter, or an > RM is present, PRRTE defaults to the number of processor cores > > In all the above cases, if you want PRRTE to default to the number > of hardware threads instead of the number of processor cores, use the > --use-hwthread-cpus option. > > Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the > number of available slots when deciding the number of processes to > launch. > -------------------------------------------------------------------------- > [computer01:53933] mca: base: close: component ssh closed > [computer01:53933] mca: base: close: unloading component ssh > > > 在 2022/11/18 22:48, Jeff Squyres (jsquyres) 写道: > >