Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

Ralph Castain via devel Mon, 04 May 2020 09:12:20 -0700

What happens if you run your "3 procs on two nodes" case using just microway1 
and 3 (i.e., omit microway2)?



On May 4, 2020, at 9:05 AM, John DelSignore via devel <devel@lists.open-mpi.org 
<mailto:devel@lists.open-mpi.org> > wrote:

Hi George,

10.71.2.58 is microway2 (which has been used in all of the configurations I've 
tried, so maybe that's why it appears to be the common denominator):

lid:/amd/home/jdelsign>host -l totalviewtech.com <http://totalviewtech.com> 
|grep microway
microway1.totalviewtech.com <http://microway1.totalviewtech.com> has address 
10.71.2.52
microway2.totalviewtech.com <http://microway2.totalviewtech.com> has address 
10.71.2.58
microway3.totalviewtech.com <http://microway3.totalviewtech.com> has address 
10.71.2.55
lid:/amd/home/jdelsign>

All three systems are on the same Ethernet, and in fact they are probably all 
in the same rack. AFAIK, there is no firewall and there is no restriction on 
port 1024. There are configurations that work OK on those same three nodes:

mic:/amd/home/jdelsign/PMIx>pterm
pterm failed to initialize, likely due to no DVM being available
mic:/amd/home/jdelsign/PMIx>
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway1
Hello from proc (1): microway2.totalviewtech.com 
<http://microway2.totalviewtech.com> 
Hello from proc (2): microway3.totalviewtech.com 
<http://microway3.totalviewtech.com> 
All Done!
mic:/amd/home/jdelsign/PMIx>cat myhostfile
microway1 slots=16
microway2 slots=16
microway3 slots=16
mic:/amd/home/jdelsign/PMIx>

I haven't been able to find a combo where prun works. On the other hand, mpirun 
works in the above case, but not in the case where I put three processes on two 
nodes.

Cheers, John D.

On 2020-05-04 11:42, George Bosilca wrote:
John,

The common denominator across all these errors is an error from connect while 
trying to connect to 10.71.2.58 on port 1024. Who is 10.71.2.58 ? If the 
firewall open ? Is the port 1024 allowed to connect to ?


  George.


On Mon, May 4, 2020 at 11:36 AM John DelSignore via devel 
<devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > wrote:
Inline below...

On 2020-05-04 11:09, Ralph Castain via devel wrote:
Staring at this some more, I do have the following questions:

* in your first case, it looks like "prte" was started from microway3 - correct?

Yes, "prte" was started from microway3. 


* in the second case, that worked, it looks like "mpirun" was executed from 
microway1 - correct?
No, "mpirun" was executed from microway3.

* in the third case, you state that "mpirun" was again executed from microway3, 
and the process output confirms that
Yes, "mpirun" was started from microway3.

I'm wondering if the issue here might actually be that PRRTE expects the 
ordering of hosts in the hostfile to start with the host it is sitting on - 
i.e., if the node index number between the various daemons is getting confused. 
Can you perhaps see what happens with the failing cases if you put microway3 at 
the top of the hostfile and execute prte/mpirun from microway3 as before?

OK, the first failing case:

mic:/amd/home/jdelsign/PMIx>pterm
pterm failed to initialize, likely due to no DVM being available
mic:/amd/home/jdelsign/PMIx>cat myhostfile3
microway3 slots=16
microway1 slots=16
microway2 slots=16
mic:/amd/home/jdelsign/PMIx>prte --hostfile ./myhostfile3 --daemonize
mic:/amd/home/jdelsign/PMIx>prun -n 3 --map-by node -x MESSAGE=name 
--personality ompi ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway3.totalviewtech.com <http://microway3.totalviewtech.com/> 
Hello from proc (1): microway1
Hello from proc (2): microway2.totalviewtech.com 
<http://microway2.totalviewtech.com/> 
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: microway1
  PID:        292266
  Message:    connect() to 10.71.2.58:1024 <http://10.71.2.58:1024/> failed
  Error:      No route to host (113)
--------------------------------------------------------------------------
[microway1:292266] 
../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
mic:/amd/home/jdelsign/PMIx>hostname
microway3.totalviewtech.com <http://microway3.totalviewtech.com/> 
mic:/amd/home/jdelsign/PMIx>

And the second failing test case:

mic:/amd/home/jdelsign/PMIx>pterm
pterm failed to initialize, likely due to no DVM being available
mic:/amd/home/jdelsign/PMIx>cat myhostfile3+2
microway3 slots=16
microway2 slots=16
mic:/amd/home/jdelsign/PMIx>
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile3+2 ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway3.totalviewtech.com <http://microway3.totalviewtech.com/> 
Hello from proc (1): microway3.totalviewtech.com 
<http://microway3.totalviewtech.com/> 
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: microway3
  PID:        271144
  Message:    connect() to 10.71.2.58:1024 <http://10.71.2.58:1024/> failed
  Error:      No route to host (113)
--------------------------------------------------------------------------
[microway3.totalviewtech.com:271144 <http://microway3.totalviewtech.com:271144> 
] ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
Hello from proc (2): microway2.totalviewtech.com 
<http://microway2.totalviewtech.com/> 
mic:/amd/home/jdelsign/PMIx>

So, AFAICT, host name order didn't matter.

Cheers, John D.



 


On May 4, 2020, at 7:34 AM, John DelSignore via devel <devel@lists.open-mpi.org 
<mailto:devel@lists.open-mpi.org> > wrote:

Hi folks,

I cloned a fresh copy of OMPI master this morning at ~8:30am EDT and rebuilt. 
I'm running a very simple test code on three Centos 7.[56] nodes named 
microway[123] over TCP. I'm seeing a fatal error similar to the following:

[microway3.totalviewtech.com <http://microway3.totalviewtech.com/> :227713] 
../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL

The case of prun launching an OMPI code does not work correctly. The MPI 
processes seem to launch OK, but there is the follwoing OMPI error at the point 
where the processes communicate. In the following case, I have DVM running on 
three nodes "microway[123]":

mic:/amd/home/jdelsign/PMIx>prun -n 3 --map-by node -x MESSAGE=name 
--personality ompi ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway3.totalviewtech.com <http://microway3.totalviewtech.com/> 
Hello from proc (1): microway1
Hello from proc (2): microway2.totalviewtech.com 
<http://microway2.totalviewtech.com/> 
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: microway1
  PID:        282716
  Message:    connect() to 10.71.2.58:1024 <http://10.71.2.58:1024/> failed
  Error:      No route to host (113)
--------------------------------------------------------------------------
[microway1:282716] 
../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: microway3
  Local PID:  214271
  Peer host:  microway1
--------------------------------------------------------------------------
mic:/amd/home/jdelsign/PMIx>

If I use mpirun to launch the program it works whether or not a DVM is already 
running (first without a DVM, then with a DVM):

mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway1
Hello from proc (1): microway2.totalviewtech.com 
<http://microway2.totalviewtech.com/> 
Hello from proc (2): microway3.totalviewtech.com 
<http://microway3.totalviewtech.com/> 
All Done!
mic:/amd/home/jdelsign/PMIx>
mic:/amd/home/jdelsign/PMIx>prte --hostfile ./myhostfile --daemonize
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway1
Hello from proc (1): microway2.totalviewtech.com 
<http://microway2.totalviewtech.com/> 
Hello from proc (2): microway3.totalviewtech.com 
<http://microway3.totalviewtech.com/> 
All Done!
mic:/amd/home/jdelsign/PMIx>

But if I use mpirun to launch 3 processes from microway3 and use a hostfile 
that contains only microway[23], I get a similar failure as the prun case:

mic:/amd/home/jdelsign/PMIx>hostname
microway3.totalviewtech.com <http://microway3.totalviewtech.com/> 
mic:/amd/home/jdelsign/PMIx>cat myhostfile2
microway2 slots=16
microway3 slots=16
mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name 
--personality ompi --hostfile myhostfile2 ./tx_basic_mpi
tx_basic_mpi
Hello from proc (0)
MESSAGE: microway2.totalviewtech.com <http://microway2.totalviewtech.com/> 
Hello from proc (1): microway2.totalviewtech.com 
<http://microway2.totalviewtech.com/> 
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: microway3
  PID:        227713
  Message:    connect() to 10.71.2.58:1024 <http://10.71.2.58:1024/> failed
  Error:      No route to host (113)
--------------------------------------------------------------------------
[microway3.totalviewtech.com <http://microway3.totalviewtech.com/> :227713] 
../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
[microway2][[32270,1],1][../../../../../../ompi/opal/mca/btl/tcp/btl_tcp.c:566:mca_btl_tcp_recv_blocking]
 recv(13) failed: Connection reset by peer (104)
mic:/amd/home/jdelsign/PMIx>

I asked my therapist (Ralph) about it, and he said,

"It looks to me like the btl/tcp component is having trouble correctly 
selecting a route to use when opening communications across hosts. I've seen 
this in my docker setup too, but thought perhaps it was just a docker-related 
issue.

What's weird in your last example is that both procs are on the same node, and 
therefore they should only be using shared memory to communicate - the btl/tcp 
component shouldn't be trying to create a connection at all."

Cheers, John D.




This e-mail may contain information that is privileged or confidential. If you 
are not the intended recipient, please delete the e-mail and any attachments 
and notify us immediately.





CAUTION: This email originated from outside of the organization. Do not click 
on links or open attachments unless you recognize the sender and know the 
content is safe.


This e-mail may contain information that is privileged or confidential. If you 
are not the intended recipient, please delete the e-mail and any attachments 
and notify us immediately.




CAUTION: This email originated from outside of the organization. Do not click 
on links or open attachments unless you recognize the sender and know the 
content is safe.


This e-mail may contain information that is privileged or confidential. If you 
are not the intended recipient, please delete the e-mail and any attachments 
and notify us immediately.

Re: [OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

Reply via email to