Hi all,
I'm having a few isolated failed tests in the test-suite as well as a general
OpenFabrics initialization error and want to check why these are happening and
if it's "OK". I'm able to get all tests that don't skip to pass with serial
compilation using gfortran 13.1.0. I only get failures when I switch to
parallel compilation using openmpi/4.1.6. Can anyone help steer me in a
direction for how to get a robust parallel compilation? Thanks in advance!
Some details on my configuration:
GCC/Gfortran 13.1.0
QE 7.4.1
Openmpi 4.1.6
Running make run-tests NPROCS=12
Red Hat Enterprise Linux 8
Using QE internal BLAS & LAPACK
Many of the tests are having errors like the following, even if they pass:
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: pn5657
Local device: mlx5_0
--------------------------------------------------------------------------
Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
[pn5657:3197139] 11 more processes have sent help message
help-mpi-btl-openib.txt / error in device init
[pn5657:3197139] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages
Here are the tests that are failing:
1. pw_plugins - plugin-pw2casino_1.in (arg(s): 1): **FAILED**.
Different sets of data extracted from benchmark and test.
Data only in benchmark: p1.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Error in routine pw2casino (1):
pool/band/image parallelization not (yet) implemented
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
stopping ...
1. pw_vdw - xdm.in: **FAILED**.
ef1
ERROR: absolute error 5.62e-01 greater than 8.00e-02. (Test: 10.7872.
Benchmark: 10.2253.)
ERROR: relative error 5.50e-02 greater than 2.00e-02. (Test: 10.7872.
Benchmark: 10.2253.)
1. cp_al_edft - Al.uspp.in: **FAILED**.
t1
ERROR: absolute error 1.75e-02 greater than 6.00e-03. (Test: 159.46581.
Benchmark: 159.44833.)
1. ph_1d - ch4.scf.in (arg(s): 1): **FAILED**.
n1
ERROR: absolute error 6.00e+00 greater than 5.00e+00. (Test: 32.0.
Benchmark: 26.0.)
1. /hpc/data/idunn/qe/7.4.1/test-suite/..//test-suite/run-hp.sh 2 Fe.scf.in
test.out.070425-2.inp=Fe.scf.in.args=2 test.err.070425-2.inp=Fe.scf.in.args=2
Running PW ...
mpirun -np 12 /hpc/data/idunn/qe/7.4.1/test-suite/..//bin/pw.x < Fe.scf.in >
test.out.070425-2.inp=Fe.scf.in.args=2 2> test.err.070425-2.inp=Fe.scf.in.args=2
hp_metal_paw_magn - Fe.scf.in (arg(s): 2): **FAILED**.
n1
ERROR: absolute error 6.00e+00 greater than 5.00e+00. (Test: 31.0.
Benchmark: 25.0.)
1. /hpc/data/idunn/qe/7.4.1/test-suite/..//test-suite/run-hp.sh 4 bn.hp.in
test.out.070425-2.inp=bn.hp.in.args=4 test.err.070425-2.inp=bn.hp.in.args=4
Running HP ...
mpirun -np 12 /hpc/data/idunn/qe/7.4.1/test-suite/..//bin/hp.x < bn.hp.in >
test.out.070425-2.inp=bn.hp.in.args=4 2> test.err.070425-2.inp=bn.hp.in.args=4
hp_soc_UV_paw_magn - bn.hp.in (arg(s): 4): **FAILED**.
v2
ERROR: absolute error 1.37e-02 greater than 1.50e-03. (Test: -0.1254.
Benchmark: -0.1117.)
ERROR: relative error 1.23e-01 greater than 1.80e-04. (Test: -0.1254.
Benchmark: -0.1117.)
v1
ERROR: absolute error 1.72e+00 greater than 1.50e-03. (Test: 6.4294.
Benchmark: 4.7069.)
ERROR: relative error 3.66e-01 greater than 1.20e-04. (Test: 6.4294.
Benchmark: 4.7069.)
u
ERROR: absolute error 1.72e+00 greater than 1.50e-03. (Test: 6.4294.
Benchmark: 4.7069.)
ERROR: relative error 3.66e-01 greater than 1.20e-04. (Test: 6.4294.
Benchmark: 4.7069.)
1. It seems all the KCW tests that need the kcw executable are failing with
error messages like:
mpirun was unable to launch the specified application as it could not access
or execute an executable:
Executable: /hpc/data/sm-euv_rs/idunn/qe/7.4.1/test-suite/..//bin/kcw.x
Node: pn5657
while attempting to start process rank 0.
I'm not sure why kcw.x isn't in the bin folder.
Best regards,
Ian Dunn (he/him)
ASML Wilton MDEV Analysis Architect
--- The information contained in this communication and any attachments is
confidential and may be privileged, and is for the sole use of the intended
recipient(s). Any unauthorized review, use, disclosure or distribution is
prohibited. Unless explicitly stated otherwise in the body of this
communication or the attachment thereto (if any), the information is provided
on an AS-IS basis without any express or implied warranties or liabilities. To
the extent you are relying on this information, you are doing so at your own
risk. If you are not the intended recipient, please notify the sender
immediately by replying to this message and destroy all copies of this message
and any attachments. Neither the sender nor the company/group of companies he
or she represents shall be liable for the proper and complete transmission of
the information contained in this communication, or for any delay in its
receipt.
_______________________________________________________________________________
The Quantum ESPRESSO Foundation stands in solidarity with all civilians
worldwide who are victims of terrorism, military aggression, and indiscriminate
warfare.
--------------------------------------------------------------------------------
Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users