Thanks Howard. I don't find the env var changes the behavior. I'm using PBS Pro.

Chris

-----Original Message-----
From: Pritchard Jr., Howard <howa...@lanl.gov> 
Sent: Monday, July 1, 2024 3:43 PM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Borchert, Christopher B ERDC-RDE-ITL-MS CIV 
<christopher.b.borch...@erdc.dren.mil>
Subject: Re: [EXTERNAL] [OMPI users] Invalid -L flag added to aprun

Hi Christoph,

First a big caveat and disclaimer.  I'm not sure if any Open MPI developers 
have access any longer to Cray XC systems, so all I can do is make suggestions.

What's probably happening is orte is thinking it is going to fork off the 
application processes on the head node itself.  That isn't going to work for XC 
aries network.
I'm not sure what would have changed between the orte in 4.0.x and 4.1.x to 
cause this difference but could you set the following ORTE MCA parameter and 
see if this problem goes away?

export ORTE_MCA_ras_base_launch_orted_on_hn=1

What batch scheduler is your system using?

Howard

On 7/1/24, 2:11 PM, "users on behalf of Borchert, Christopher B 
ERDC-RDE-ITL-MS CIV via users" <users-boun...@lists.open-mpi.org 
<mailto:users-boun...@lists.open-mpi.org> on behalf of users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org>> wrote:


On a Cray XC (requiring aprun launcher to get from batch node to compute node), 
4.0.5 works but 4.1.1 and 4.1.6 do not (even on a single node). The newer ones 
throw this:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before communicating 
back to mpirun. This could be caused by a number of factors, including an 
inability to create a connection back to mpirun due to a lack of common network 
interfaces and/or no route found between them. Please check network 
connectivity (including firewalls and network routing requirements).
--------------------------------------------------------------------------


On all 3 when I add -d to mpirun, they show aprun is being called. However, the 
2 newer versions add an invalid flag: -L. Doesn't matter if the -L is followed 
by a batch node name or a compute node name.


4.0.5:
[batch7:78642] plm:alps: aprun -n 1 -N 1 -cc none -e
PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 orted -mca 
orte_debug 1 -mca ess_base_jobid 3787849728 -mca ess_base_vpid 1 -mca 
ess_base_num_procs 2 -mca orte_node_regex batch[1:7],[3:132]@0(2) -mca 
orte_hnp_uri 3787849728.0;tcp://10.128.13.251:34149


4.1.1:
[batch7:75094] plm:alps: aprun -n 1 -N 1 -cc none -e
PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L batch7 
orted -mca orte_debug 1 -mca ess_base_jobid 4154589184 -mca ess_base_vpid 1 
-mca ess_base_num_procs 2 -mca orte_node_regex mpirun,batch[1:7]@0(2) -mca 
orte_hnp_uri 4154589184.0;tcp://10.128.13.251:56589
aprun: -L node_list contains an invalid entry


4.1.6:
[batch20:43065] plm:alps: aprun -n 1 -N 1 -cc none -e
PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L
nid00140 orted -mca orte_debug 1 -mca ess_base_jobid 115474432 -mca 
ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_node_regex
batch[2:20],nid[5:140]@0(2) -mca orte_hnp_uri
115474432.0;tcp://10.128.1.39:51455
aprun: -L node_list contains an invalid entry


How can I get this -L argument removed?


Thanks, Chris



Attachment: smime.p7s
Description: S/MIME cryptographic signature

  • Re: [OMPI users] [EX... Pritchard Jr., Howard via users
    • Re: [OMPI users... Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users
      • Re: [OMPI u... Pritchard Jr., Howard via users
        • Re: [OM... Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users
          • Re:... Pritchard Jr., Howard via users
            • ... Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users
              • ... Pritchard Jr., Howard via users
                • ... Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users
                • ... Pritchard Jr., Howard via users

Reply via email to