Hi, With mpi 4.0.3 we are facing issues with machines in hostfile having differently named physical and virtual interface. During validation we came across this behaviour which seems to be a bug as we have no control over the customer hosts/ third party distributions which could land us in a scenario where we need to run mpi with hosts having differently named physical and virtual interfaces.
We are getting 2 types of error with hosts having physical and virtual interface. Sometimes it hangs and gives the below error in output: WARNING: Open MPI accepted a TCP connection from what appears to be a another Open MPI process but cannot find a corresponding process entry for that peer. This attempted connection will be ignored; your MPI job may or may not continue properly. Local host: machine1 PID: 5108 And other times it doesn't hang and give this error in output: [machine1][[32337,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier: got [[32337,1],1] expected [[32337,1],2] Mpi 4.0.3 works fine with hosts which have single consistent physical interfaces. We expect that mpi should run independently of physical or logical interfaces/names. The executable used is a simple cpp exe which establishes communications with multiple cores and prints the messages when messages get received by the core. Please find below the command used for running mpi along and the hostfile interfaces: Mpi command: mpirun --hostfile hostfile_new --merge-stderr-to-stdout --output-filename ./mpi_master_test/out:NOCOPY --bind-to none --report-bindings -N 2 -n 4 ./MPItest2 4 Hostfile: XXXXX@XXXX[397] cat hostfile_new machine1 slots=2 machine2 slots=6 error output1: XXXXX@XXXX[XXX] cat out.* [machine1][[32337,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier: got [[32337,1],1] expected [[32337,1],2] [machine1:30855] pml_ob1_sendreq.c:189 FATAL Hello world from processor machine2, rank 0 out of 4 processors +++ Number of points: 4 Data size: 0.00(MB) Hello world from processor machine2, rank 1 out of 4 processors *** machine1 Receive time: 0.000 secs =================== Process Rank 1 on Host machine1 Receiving =================== ================================================================= Hello world from processor machine2, rank 2 out of 4 processors Hello world from processor machine2, rank 3 out of 4 processors error output2: Hello world from processor machine1, rank 0 out of 4 processors +++ Number of points: 4 Data size: 0.00(MB) -------------------------------------------------------------------------- WARNING: Open MPI accepted a TCP connection from what appears to be a another Open MPI process but cannot find a corresponding process entry for that peer. This attempted connection will be ignored; your MPI job may or may not continue properly. Local host: machine1 PID: 5108 -------------------------------------------------------------------------- Hello world from processor machine1, rank 1 out of 4 processors *** machine1 Receive time: 0.000 secs =================== Process Rank 1 on Host machine1 Receiving =================== ================================================================= Hello world from processor machine2, rank 2 out of 4 processors Hello world from processor machine2, rank 3 out of 4 processors Interface on machine1: Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface default XXXXXX-XXX 0.0.0.0 UG 0 0 0 eno1 137.XXX.XXX.0 0.0.0.0 255.255.255.0 U 0 0 0 eno1 192.XXX.XXX.0 0.0.0.0 255.255.255.0 U 0 0 0 virbr2 Interface on machine2: Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface default xxxxxx-xxxx 0.0.0.0 UG 0 0 0 eth0 137.XXX.XXX.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 xxxx-xxxx 0.0.0.0 255.255.0.0 U 0 0 0 eth0 192.XXX.XXX.0 0.0.0.0 255.255.255.0 U 0 0 0 virbr1 As you can notice for experimenting purposes we used 2 machines having different names for physical and virtual interfaces. Please let us know in case of any queries. Thanks & Regards Ashutosh Singh