Re: [Wien] lapw2_mpi crashes during TB-mBJ calculations for WIEN2k_23.2
Dear Prof. Blaha, Thank you for your quick response and fixed code. Now calculations with the -tau switch work fine. Best regards, Hitoshi > 2023/06/19 2:26、Peter Blaha のメール: > > Thank you very much for the report. > > I can confirm the problem, not only for mpi-calculations and mbj, but also > for meta-GGAs.The -tau switch causes the problem in lapw2_mpi. > > An endif statement was placed in the wrong place causing the problems. > > Attached is a new l2main.F.gz subroutine, which should be put into > $WIENROOT/SRC_lapw2. Change into this directory and type > > gunzip l2main.F.gz > > make all > > cp lapw2 lapw2c lapw2_mpi lapw2c_mpi .. > > > Regards > > Peter Blaha > > > Am 18.06.2023 um 13:06 schrieb 髙村仁: >> Dear WIEN2k developers and users, >> >> I would like to share the following situation I have for WIEN2k_23.2. >> WIEN2k_23.2 works fine for me, except for the crash of lapw2_mpi during >> TB-mBJ calculations using MPI parallel. First, I have performed TB-mBJ >> calculations for some oxides, such as MgO and TiO2, using WIEN2k_21.1 and >> MPI parallel without any problems. The results, e.g., corrected band gaps, >> are also excellent. Standard SCF calculations using WIEN2k_23.2, including >> MPI parallel, are also fine. >> >> Meanwhile, after init_mbj (now -tau switch is on for lapw2), MPI parallel >> calculations using WIEN2k_23.2 always crash during the first lapw2 process. >> The crash is reproducible for any case.struct I tested, including TiO2 on >> the Wien2k website. It should also be noted that serial or only k-point >> parallel (without MPI) TB-mBJ calculations are fine for the same WIEN2k_23.2 >> environment. The error messages regarding the lapw2_mpi crashes are just as >> follows: >> >> lapw2.error: >> ** testerror: Error in Parallel LAPW2 >> lapw2_i.error: >> Error in LAPW2 >> >> So, this crash appears to be a sudden death of MPI processes; STDOUT >> actually shows the following MPI error messages for 4 MPI processes: >> >> ... >> LAPW1 END >> LAPW1 END >> LAPW1 END >> LAPW1 END >> LAPW2 - FERMI; weights written >> Abort(805421582) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: >> Message truncated, error stack: >> PMPI_Recv(171): MPI_Recv(buf=0x7ffdcd24c678, count=1, MPI_INTEGER, src=1, >> tag=MPI_ANY_TAG, comm=0x8405, status=0x2ae8b59a3fe0) failed >> (unknown)(): Message truncated >> Abort(67224078) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: >> Message truncated, error stack: >> PMPI_Recv(171): MPI_Recv(buf=0x7ffc824e7f78, count=1, MPI_INTEGER, src=1, >> tag=MPI_ANY_TAG, comm=0x8405, status=0x2af2d64d3fe0) failed >> (unknown)(): Message truncated >> Abort(939639310) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: >> Message truncated, error stack: >> PMPI_Recv(171): MPI_Recv(buf=0x7ffea8ea88f8, count=1, MPI_INTEGER, src=1, >> tag=MPI_ANY_TAG, comm=0x8405, status=0x2b513d417fe0) failed >> (unknown)(): Message truncated >> Abort(402768398) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: >> Message truncated, error stack: >> PMPI_Recv(171): MPI_Recv(buf=0x7ffde26becf8, count=1, MPI_INTEGER, src=1, >> tag=MPI_ANY_TAG, comm=0x8405, status=0x2b92d0adffe0) failed >> (unknown)(): Message truncated >> ... >> >> I thought this might be only the case for my cluster: >> 12 nodes x Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, Linux >> 3.10.0-1160.el7.x86_64 >> Intel compilers (2021.7.1 20221019) >> Intel MPI libraries (Intel(R) MPI Library for Linux* OS, Version 2021.7 >> Build 20221022); >> So, I compiled WIEN2k_23.2 on a different cluster with different versions of >> Intel compilers and MPI libraries (ifort (IFORT) 19.1.3.304 20200925 and >> Intel(R) MPI Library for Linux* OS, Version 2019 Update 9 Build 20200923). >> The results are exactly the same, i.e., no problem for TB-mBJ calculations >> with MPI parallel for WIEN2k_21.1, but lapw2 always crashes for WIEN2k_23.2 >> only when TB-mBJ calculations are performed using MPI parallel (no problem >> for serial or k-point parallel (no MPI)). Again, this crash is reproducible >> for any case.struct for oxides I tested. >> >> I would greatly appreciate any comments and suggestions to solve this >> problem. >> >> Best regards, >> >> >> Dr. Hitoshi Takamura >> Tohoku Univ., Japan >> >> ___ >> Wien mailing list >> Wien@zeus.theochem.tuwien.ac.at >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien >> SEARCH the MAILING-LIST at: >> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html > > -- > --- > Peter Blaha, Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna > Phone: +43-158801165300 > Email: peter.bl...@tuwien.ac.at > WWW: http://www.imc.tuwien.ac.at WIEN2k: http://www.wien2k.at > - >
Re: [Wien] lapw2_mpi crashes during TB-mBJ calculations for WIEN2k_23.2
Thank you very much for the report. I can confirm the problem, not only for mpi-calculations and mbj, but also for meta-GGAs. The -tau switch causes the problem in lapw2_mpi. An endif statement was placed in the wrong place causing the problems. Attached is a new l2main.F.gz subroutine, which should be put into $WIENROOT/SRC_lapw2. Change into this directory and type gunzip l2main.F.gz make all cp lapw2 lapw2c lapw2_mpi lapw2c_mpi .. Regards Peter Blaha Am 18.06.2023 um 13:06 schrieb 髙村仁: Dear WIEN2k developers and users, I would like to share the following situation I have for WIEN2k_23.2. WIEN2k_23.2 works fine for me, except for the crash of lapw2_mpi during TB-mBJ calculations using MPI parallel. First, I have performed TB-mBJ calculations for some oxides, such as MgO and TiO2, using WIEN2k_21.1 and MPI parallel without any problems. The results, e.g., corrected band gaps, are also excellent. Standard SCF calculations using WIEN2k_23.2, including MPI parallel, are also fine. Meanwhile, after init_mbj (now -tau switch is on for lapw2), MPI parallel calculations using WIEN2k_23.2 always crash during the first lapw2 process. The crash is reproducible for any case.struct I tested, including TiO2 on the Wien2k website. It should also be noted that serial or only k-point parallel (without MPI) TB-mBJ calculations are fine for the same WIEN2k_23.2 environment. The error messages regarding the lapw2_mpi crashes are just as follows: lapw2.error: ** testerror: Error in Parallel LAPW2 lapw2_i.error: Error in LAPW2 So, this crash appears to be a sudden death of MPI processes; STDOUT actually shows the following MPI error messages for 4 MPI processes: ... LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW2 - FERMI; weights written Abort(805421582) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: Message truncated, error stack: PMPI_Recv(171): MPI_Recv(buf=0x7ffdcd24c678, count=1, MPI_INTEGER, src=1, tag=MPI_ANY_TAG, comm=0x8405, status=0x2ae8b59a3fe0) failed (unknown)(): Message truncated Abort(67224078) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: Message truncated, error stack: PMPI_Recv(171): MPI_Recv(buf=0x7ffc824e7f78, count=1, MPI_INTEGER, src=1, tag=MPI_ANY_TAG, comm=0x8405, status=0x2af2d64d3fe0) failed (unknown)(): Message truncated Abort(939639310) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: Message truncated, error stack: PMPI_Recv(171): MPI_Recv(buf=0x7ffea8ea88f8, count=1, MPI_INTEGER, src=1, tag=MPI_ANY_TAG, comm=0x8405, status=0x2b513d417fe0) failed (unknown)(): Message truncated Abort(402768398) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: Message truncated, error stack: PMPI_Recv(171): MPI_Recv(buf=0x7ffde26becf8, count=1, MPI_INTEGER, src=1, tag=MPI_ANY_TAG, comm=0x8405, status=0x2b92d0adffe0) failed (unknown)(): Message truncated ... I thought this might be only the case for my cluster: 12 nodes x Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, Linux 3.10.0-1160.el7.x86_64 Intel compilers (2021.7.1 20221019) Intel MPI libraries (Intel(R) MPI Library for Linux* OS, Version 2021.7 Build 20221022); So, I compiled WIEN2k_23.2 on a different cluster with different versions of Intel compilers and MPI libraries (ifort (IFORT) 19.1.3.304 20200925 and Intel(R) MPI Library for Linux* OS, Version 2019 Update 9 Build 20200923). The results are exactly the same, i.e., no problem for TB-mBJ calculations with MPI parallel for WIEN2k_21.1, but lapw2 always crashes for WIEN2k_23.2 only when TB-mBJ calculations are performed using MPI parallel (no problem for serial or k-point parallel (no MPI)). Again, this crash is reproducible for any case.struct for oxides I tested. I would greatly appreciate any comments and suggestions to solve this problem. Best regards, Dr. Hitoshi Takamura Tohoku Univ., Japan ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html -- --- Peter Blaha, Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna Phone: +43-158801165300 Email: peter.bl...@tuwien.ac.at WWW: http://www.imc.tuwien.ac.at WIEN2k: http://www.wien2k.at - l2main.F.gz Description: GNU Zip compressed data ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
[Wien] lapw2_mpi crashes during TB-mBJ calculations for WIEN2k_23.2
Dear WIEN2k developers and users, I would like to share the following situation I have for WIEN2k_23.2. WIEN2k_23.2 works fine for me, except for the crash of lapw2_mpi during TB-mBJ calculations using MPI parallel. First, I have performed TB-mBJ calculations for some oxides, such as MgO and TiO2, using WIEN2k_21.1 and MPI parallel without any problems. The results, e.g., corrected band gaps, are also excellent. Standard SCF calculations using WIEN2k_23.2, including MPI parallel, are also fine. Meanwhile, after init_mbj (now -tau switch is on for lapw2), MPI parallel calculations using WIEN2k_23.2 always crash during the first lapw2 process. The crash is reproducible for any case.struct I tested, including TiO2 on the Wien2k website. It should also be noted that serial or only k-point parallel (without MPI) TB-mBJ calculations are fine for the same WIEN2k_23.2 environment. The error messages regarding the lapw2_mpi crashes are just as follows: lapw2.error: ** testerror: Error in Parallel LAPW2 lapw2_i.error: Error in LAPW2 So, this crash appears to be a sudden death of MPI processes; STDOUT actually shows the following MPI error messages for 4 MPI processes: ... LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW2 - FERMI; weights written Abort(805421582) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: Message truncated, error stack: PMPI_Recv(171): MPI_Recv(buf=0x7ffdcd24c678, count=1, MPI_INTEGER, src=1, tag=MPI_ANY_TAG, comm=0x8405, status=0x2ae8b59a3fe0) failed (unknown)(): Message truncated Abort(67224078) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: Message truncated, error stack: PMPI_Recv(171): MPI_Recv(buf=0x7ffc824e7f78, count=1, MPI_INTEGER, src=1, tag=MPI_ANY_TAG, comm=0x8405, status=0x2af2d64d3fe0) failed (unknown)(): Message truncated Abort(939639310) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: Message truncated, error stack: PMPI_Recv(171): MPI_Recv(buf=0x7ffea8ea88f8, count=1, MPI_INTEGER, src=1, tag=MPI_ANY_TAG, comm=0x8405, status=0x2b513d417fe0) failed (unknown)(): Message truncated Abort(402768398) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Recv: Message truncated, error stack: PMPI_Recv(171): MPI_Recv(buf=0x7ffde26becf8, count=1, MPI_INTEGER, src=1, tag=MPI_ANY_TAG, comm=0x8405, status=0x2b92d0adffe0) failed (unknown)(): Message truncated ... I thought this might be only the case for my cluster: 12 nodes x Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, Linux 3.10.0-1160.el7.x86_64 Intel compilers (2021.7.1 20221019) Intel MPI libraries (Intel(R) MPI Library for Linux* OS, Version 2021.7 Build 20221022); So, I compiled WIEN2k_23.2 on a different cluster with different versions of Intel compilers and MPI libraries (ifort (IFORT) 19.1.3.304 20200925 and Intel(R) MPI Library for Linux* OS, Version 2019 Update 9 Build 20200923). The results are exactly the same, i.e., no problem for TB-mBJ calculations with MPI parallel for WIEN2k_21.1, but lapw2 always crashes for WIEN2k_23.2 only when TB-mBJ calculations are performed using MPI parallel (no problem for serial or k-point parallel (no MPI)). Again, this crash is reproducible for any case.struct for oxides I tested. I would greatly appreciate any comments and suggestions to solve this problem. Best regards, Dr. Hitoshi Takamura Tohoku Univ., Japan ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html