Hi Christoph,

First a big caveat and disclaimer.  I'm not sure if any Open MPI developers 
have access any longer to Cray XC systems, so all I can do is make suggestions.

What's probably happening is orte is thinking it is going to fork off the 
application processes on the head node itself.  That isn't going to work for XC 
aries network.
I'm not sure what would have changed between the orte in 4.0.x and 4.1.x to 
cause this difference but could you set the following ORTE MCA parameter and 
see if this problem goes away?

export ORTE_MCA_ras_base_launch_orted_on_hn=1

What batch scheduler is your system using?

Howard

On 7/1/24, 2:11 PM, "users on behalf of Borchert, Christopher B 
ERDC-RDE-ITL-MS CIV via users" <users-boun...@lists.open-mpi.org 
<mailto:users-boun...@lists.open-mpi.org> on behalf of users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org>> wrote:


On a Cray XC (requiring aprun launcher to get from batch node to compute
node), 4.0.5 works but 4.1.1 and 4.1.6 do not (even on a single node). The
newer ones throw this:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------


On all 3 when I add -d to mpirun, they show aprun is being called. However,
the 2 newer versions add an invalid flag: -L. Doesn't matter if the -L is
followed by a batch node name or a compute node name.


4.0.5:
[batch7:78642] plm:alps: aprun -n 1 -N 1 -cc none -e
PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1
orted -mca orte_debug 1 -mca ess_base_jobid 3787849728 -mca ess_base_vpid 1
-mca ess_base_num_procs 2 -mca orte_node_regex batch[1:7],[3:132]@0(2) -mca
orte_hnp_uri 3787849728.0;tcp://10.128.13.251:34149


4.1.1:
[batch7:75094] plm:alps: aprun -n 1 -N 1 -cc none -e
PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L batch7
orted -mca orte_debug 1 -mca ess_base_jobid 4154589184 -mca ess_base_vpid 1
-mca ess_base_num_procs 2 -mca orte_node_regex mpirun,batch[1:7]@0(2) -mca
orte_hnp_uri 4154589184.0;tcp://10.128.13.251:56589
aprun: -L node_list contains an invalid entry


4.1.6:
[batch20:43065] plm:alps: aprun -n 1 -N 1 -cc none -e
PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L
nid00140 orted -mca orte_debug 1 -mca ess_base_jobid 115474432 -mca
ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_node_regex
batch[2:20],nid[5:140]@0(2) -mca orte_hnp_uri
115474432.0;tcp://10.128.1.39:51455
aprun: -L node_list contains an invalid entry


How can I get this -L argument removed?


Thanks, Chris



  • Re: [OMPI users] [EX... Pritchard Jr., Howard via users
    • Re: [OMPI users... Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users
      • Re: [OMPI u... Pritchard Jr., Howard via users
        • Re: [OM... Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users
          • Re:... Pritchard Jr., Howard via users
            • ... Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users
              • ... Pritchard Jr., Howard via users
                • ... Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users
                • ... Pritchard Jr., Howard via users

Reply via email to