This is reported upstream (UCX) here:

https://github.com/openucx/ucx/issues/5443

On 23/07/2020 06:33, Drew Parsons wrote:
Package: libopenmpi3
Version: 4.0.4-2
Followup-For: Bug #965352
Control: affects -1 src:scalapack

UCX seems to be affecting the scalapack build also:

87/96 Test #82: xdgsep ...........................   Passed  109.38 sec
       Start 95: xshseqr
88/96 Test #83: xcgsep ...........................   Passed  101.14 sec
       Start 96: xdhseqr
89/96 Test #96: xdhseqr ..........................***Failed   49.20 sec

  ScaLAPACK Test for PDHSEQR

  epsilon   =    1.1102230246251565E-016
  threshold =    30.000000000000000

  Residual and Orthogonality Residual computed by:

  Residual      =  || T - Q^T*A*Q ||_F / ( ||A||_F * eps * sqrt(N) )

  Orthogonality =  MAX( || I - Q^T*Q ||_F, || I - Q*Q^T ||_F ) /  (eps * N)

  Test passes if both residuals are less then threshold

     N  NB    P    Q  QR Time  CHECK
----- --- ---- ---- -------- ------
[1595480623.088652] [monte:1320201:0]           sock.c:344  UCX  ERROR 
recv(fd=28) failed: Bad address
[1595480623.199533] [monte:1320189:0]           sock.c:344  UCX  ERROR 
sendv(fd=30) failed: Connection reset by peer
[monte:1320189] *** An error occurred in MPI_Bcast
[monte:1320189] *** reported by process [1297350657,2]
[monte:1320189] *** on communicator MPI COMMUNICATOR 5 SPLIT FROM 3
[monte:1320189] *** MPI_ERR_OTHER: known error not in list
[monte:1320189] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[monte:1320189] ***    and potentially your MPI job)

--
Alastair McKinstry, email: alast...@sceal.ie, matrix: @alastair:sceal.ie, 
phone: 087-6847928
Green Party Councillor, Galway County Council

Reply via email to