Hi,

With mpi 4.0.3 we are facing issues with machines in hostfile having 
differently named physical and virtual interface. During validation we came 
across this behaviour which seems to be a bug as we have no control over the 
customer hosts/ third party distributions which could land us in a scenario 
where we need to run mpi with hosts having differently named physical and 
virtual interfaces.

We are getting 2 types of error with hosts having physical and virtual 
interface. Sometimes it hangs and gives the below error in output:

WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: machine1
  PID:        5108

And other times it doesn't hang and give this error in output:

[machine1][[32337,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_recv_connect_ack]
 received unexpected process identifier: got [[32337,1],1] expected 
[[32337,1],2]

Mpi 4.0.3 works fine with hosts which have single consistent physical 
interfaces.

We expect that mpi should run independently of physical or logical 
interfaces/names.

The executable used is a simple cpp exe which establishes communications with 
multiple cores and prints the messages when messages get received by the core.

Please find below the command used for running mpi along and the hostfile 
interfaces:

Mpi command:

mpirun --hostfile hostfile_new --merge-stderr-to-stdout --output-filename 
./mpi_master_test/out:NOCOPY --bind-to none --report-bindings -N 2 -n 4 
./MPItest2 4

Hostfile:

XXXXX@XXXX[397] cat hostfile_new
machine1 slots=2
machine2 slots=6

error output1:

XXXXX@XXXX[XXX] cat out.*
[machine1][[32337,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_recv_connect_ack]
 received unexpected process identifier: got [[32337,1],1] expected 
[[32337,1],2]
[machine1:30855] pml_ob1_sendreq.c:189 FATAL
Hello world from processor machine2, rank 0 out of 4 processors
+++ Number of points: 4 Data size: 0.00(MB)
Hello world from processor machine2, rank 1 out of 4 processors
*** machine1 Receive time: 0.000 secs
=================== Process Rank 1 on Host machine1 Receiving 
===================
=================================================================
Hello world from processor machine2, rank 2 out of 4 processors
Hello world from processor machine2, rank 3 out of 4 processors

error output2:

Hello world from processor machine1, rank 0 out of 4 processors
+++ Number of points: 4 Data size: 0.00(MB)
--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: machine1
  PID:        5108
--------------------------------------------------------------------------
Hello world from processor machine1, rank 1 out of 4 processors
*** machine1 Receive time: 0.000 secs
=================== Process Rank 1 on Host machine1 Receiving 
===================
=================================================================
Hello world from processor machine2, rank 2 out of 4 processors
Hello world from processor machine2, rank 3 out of 4 processors

Interface on machine1:

Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
default         XXXXXX-XXX 0.0.0.0         UG        0 0          0 eno1
137.XXX.XXX.0   0.0.0.0         255.255.255.0   U         0 0          0 eno1
192.XXX.XXX.0   0.0.0.0         255.255.255.0   U         0 0          0 virbr2

Interface on machine2:
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
default         xxxxxx-xxxx 0.0.0.0         UG        0 0          0 eth0
137.XXX.XXX.0   0.0.0.0         255.255.255.0   U         0 0          0 eth0
xxxx-xxxx      0.0.0.0         255.255.0.0     U         0 0          0 eth0
192.XXX.XXX.0   0.0.0.0         255.255.255.0   U         0 0          0 virbr1

As you can notice for experimenting purposes we used 2 machines having 
different names for physical and virtual interfaces.

Please let us know in case of any queries.

Thanks & Regards
Ashutosh Singh

Reply via email to