You can still use "map-by" to get what you want since you know there are four 
interfaces per node - just do "--map-by ppr:8:node". Note that you definitely 
do NOT want to list those multiple IP addresses in your hostfile - all you are 
doing is causing extra work for mpirun as it has to DNS resolve those addresses 
back down to their common host. We then totally ignore the fact that you 
specified those addresses, so it is accomplishing nothing (other than creating 
extra work).

You'll need to talk to AWS about how to drive striping across the interfaces. 
It sounds like they are automatically doing it, but perhaps not according to 
the algorithm you are seeking (i.e., they may not make such a linear assignment 
as you describe).


> On Jun 8, 2021, at 1:23 PM, John Moore via users <users@lists.open-mpi.org> 
> wrote:
> 
> Hello,
> 
> I am trying to run OpenMPI on AWSs new p4d instances. These instances have 4x 
> 100Gb/s network interfaces, each with their own ipv4 address.
> 
> I am primarily testing the bandwidth with the osu_micro_benchmarks test 
> suite. Specifically I am running the osu_bibw and osu_mbw_mr tests to 
> calculate the peak aggregate bandwidth I can achieve between two instances.
> 
> I have found that running the osu_biwb test can only obtain the achieved 
> throughput of one network interface (100 Gb/s).  This is the command I am 
> using:
> /opt/amazon/openmpi/bin/mpirun -v -x FI_EFA_USE_DEVICE_RDMA=1 -x 
> FI_PROVIDER="efa" -np 2 -host host1,host2 --map-by node --mca 
> btl_baes_verbose 30 --mca btl tcp,self --mca btl_tcp_if_exclude lo,do\cker0  
> ./osu_bw -m 40000000
> 
> As far as I understand it, openmpi should be detecting the four interfaces 
> and striping data across them, correct?
> 
> I have found that the osu_mbw_mr test can achieve 4x the bandwidth of a 
> single network interface, if the configuration is correct. For example, I am 
> using the following command:
> /opt/amazon/openmpi/bin/mpirun -v -x FI_EFA_USE_DEVICE_RDMA=1 -x 
> FI_PROVIDER="efa" -np 8 -hostfile hostfile5 --map-by node --mca 
> btl_baes_verbose 30 --mca btl tcp,self --mca btl_tcp_if_exclude lo,d\ocker0  
> ./osu_mbw_mr
> This will run four pairs of send/recv calls across the different nodes. 
> hostfile5 contains all 8 local ipv4 addresses associated with the four nodes. 
> I believe this is why I am getting the expected performance.
> 
> So, now I want to runa real use case, but I can't use --map-by node. I want 
> to run two ranks per ipv4 address (interface) with the ranks ordered 
> sequentially according to the hostfile (the first 8 ranks will belong to the 
> first host, but the ranks will be divided among four ipv4 addresses to 
> utilize the full network bandwidth). But OpenMPI won't allow me to assign 
> slots=2 to each ipv4 address because they all belong to the same host. 
> 
> Any recommendation would be greatly appreciated.
> 
> Thanks,
> John


Reply via email to