A suggestion: check your mkl version, as there is a mkl bug that was recently fixed, see https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Problem-with-LAPACK-subroutine-ZHEEVR-input-array-quot-isuppz/td-p/1150816 _____ Professor Laurence Marks "Research is to see what everybody else has seen, and to think what nobody else has thought", Albert Szent-Györgyi www.numis.northwestern.edu
On Thu, Aug 19, 2021, 06:45 Peter Blaha <pbl...@theochem.tuwien.ac.at> wrote: > I'm still on vacations, so cannot test myself. > > However, I experienced such problems before. It has to do with > multithreading (1 thread works always fine) and the mkl routine zheevr. > > In my case I could fix the problem by enlarging the workspace beyond > what the routine calculates itself. (see comment in hmsec on line 841). > > Right below, the workspace was enlarged by a factor 10, which fixed my > problem. But I can easily envision that it might not be enough in some > other cases. > > An alternative is to switch back to zheevx (commented in the code). > > Peter Blaha > > Am 18.08.2021 um 20:01 schrieb Pavel Ondračka: > > Right, I think that the reason deallocate is failing because the memory > > has been corrupted at some earlier point is quite clear, the only other > > option why it should crash would be that it was not allocated at all, > > which seem not to be the case here... The question is what corrupted > > the memory and even more strange is why does it work if we disable MKL > > multithreading? > > > > It could indeed be that we are doing something wrong. I can imagine the > > memory could be corrupted in some BLAS call if the number of > > columns/rows passed to the specific BLAS call is more than the actual > > size of the matrix, than this could easily happen (and the > > multithreading is somehow influencing what the final value of the > > corrupted memory, and depending on the final value the deallocate could > > fail or pass somehow). This should be possible to diagnose with > > valgrind as suggested. > > > > Luis, can you upload the testcase somewhere, or recompile with > > debuginfo as suggested by Laurence earlier, run "valgrind --track- > > origins=yes lapwso lapwso.def" and send the output? Just be warned, > > there is a massive slowdown with valgrind (up to 100x) and the logfile > > can get very large. > > > > Best regards > > Pavel > > > > > > On Wed, 2021-08-18 at 12:10 -0500, Laurence Marks wrote: > >> Correction, I was looking at an older modules.F. It looks like it > >> should be > >> > >> DEALLOCATE(vect,stat=IV) ; if(IV .ne. 0)write(*,*)IV > >> > >> > >> On Wed, Aug 18, 2021 at 11:23 AM Laurence Marks > >> <laurence.ma...@gmail.com> wrote: > >>> I do wonder about this. I suggest editing module.F and changing > >>> lines 118 and 119 to > >>> DEALLOCATE(en,stat=Ien) ; if(Ien .ne. 0)write(*,*)'Err en > >>> ',ien > >>> DEALLOCATE(vnorm,stat=Ivn ; ) if(Ivn .ne. 0)write(*,*)'Err > >>> vnorm ',Ivn > >>> > >>> There is every chance that the bug is not in those lines, but > >>> somewhere completely different. SIGSEV often means that the code > >>> has been overwritten, for instance arrays going out of bounds. > >>> > >>> You can also recompile with -g (don't change other options) > >>> added, and/or -C. Sometimes this is better. Or use other things > >>> like debuggers or valgrind. > >>> > >>> On Wed, Aug 18, 2021 at 10:47 AM Pavel Ondračka > >>> <pavel.ondra...@email.cz> wrote: > >>>> I'm CCing the list back as the crash was now diagnosed to a > >>>> likely > >>>> MKL > >>>> problem, see below for more details. > >>>>> > >>>>>> So just to be clear, explicitly setting OMP_STACKSIZE=1g does > >>>> not > >>>>>> help > >>>>>> to solve the issue? > >>>>>> > >>>>> > >>>>> Right! OMP_STACKSIZE=1g with OMP_NUM_THREADS=4 does not solve > >>>>> the > >>>>> problem! > >>>>> > >>>>>> The problem is that the OpenMP code in lapwso is very simple, > >>>> so I'm > >>>>>> having problems seeing how it could be causing the problems. > >>>>>> > >>>>>> Could you also try to see what happens if run with: > >>>>>> OMP_NUM_THREADS=1 > >>>>>> MKL_NUM_THREADS=4 > >>>>>> > >>>>> > >>>>> It does not work with these values, but I checked and it works > >>>>> reverting them: > >>>>> OMP_NUM_THREADS=4 > >>>>> MKL_NUM_THREADS=1 > >>>> This was very helpfull and IMO points to a problem with MKL > >>>> instead > >>>> of > >>>> Wien2k. > >>>> > >>>> Unfortunatelly setting MKL_NUM_THREADS=1 globally will reduce the > >>>> OpenMP performance, mostly in lapw1 but also at other places. So > >>>> if > >>>> you > >>>> want to keep the OpenMP BLAS/lapack level parallelism you have to > >>>> either find some MKL version that works (if you do please report > >>>> it > >>>> here), link with OpenBLAS (using it for lapwso is enough) or > >>>> create > >>>> a > >>>> simple wrapper that sets the MKL_NUM_THREADS=1 just for lapwso, > >>>> i.e., > >>>> rename lapwso binary in WIENROOT to lapwso_bin and create new > >>>> lapwso > >>>> file there with: > >>>> > >>>> #!/bin/bash > >>>> MKL_NUM_THREADS=1 lapwso_bin $1 > >>>> > >>>> and set it to executable with chmod +x lapwso. > >>>> > >>>> Or maybe MKL has a non-OpenMP version which you could link with > >>>> just > >>>> lapwso and use standard one in other parts, but dunno, I mostly > >>>> use > >>>> OpenBLAS. If you need some further help, let me know. > >>>> > >>>> Reporting the issue to intel could be also nice, however I never > >>>> had > >>>> any real luck there and it is also a bit problematic as you can't > >>>> provide testcase due to Wien2k being proprietary code... > >>>> > >>>> Best regards > >>>> Pavel > >>>> > >>>>> > >>>>>> This should disable the Wien2k-specific OpenMP parallelism > >>>>>> but > >>>> still > >>>>>> keep the rest of paralellism at the BLAS/lapack level. > >>>>>> > >>>>> > >>>>> So, perhaps, the problem is related to MKL! > >>>>> > >>>>>> Another option is that something is going wrong before lapwso > >>>> and the > >>>>>> lapwso crash is just the symptom. What happens if you run > >>>> everything > >>>>>> up > >>>>>> to lapwso without OpenMP (OMP_NUM_THREADS=1) and than enable > >>>>>> it > >>>> just > >>>>>> for lapwso? > >>>>>> > >>>>> > >>>>> If I run lapw0 and lapw1 with OMP_NUM_THREADS=4 and then change > >>>> it to 1 > >>>>> just before lapwso, it works. > >>>>> If I do the opposite, starting with OMP_NUM_THREADS=1 and then > >>>> change > >>>>> it to 4 just before lapwso, it does not work. > >>>>> So I believe that the problem is really at lapwso. > >>>>> > >>>>> If you need more information, please, let me know! > >>>>> All the best, > >>>>> Luis > >>>> > >>>> _______________________________________________ > >>>> Wien mailing list > >>>> Wien@zeus.theochem.tuwien.ac.at > >>>> > https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3YA4jSb3A$ > >>>> > >>>> SEARCH the MAILING-LIST at: > >>>> > https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3aDFmAN4g$ > >>>> > >>> > >>> -- > >>> Professor Laurence Marks > >>> Department of Materials Science and Engineering > >>> Northwestern University > >>> http://www.numis.northwestern.edu > >>> "Research is to see what everybody else has seen, and to think what > >>> nobody else has thought" Albert Szent-Györgyi > >> > >> _______________________________________________ > >> Wien mailing list > >> Wien@zeus.theochem.tuwien.ac.at > >> > https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4E5AcHVtA$ > >> SEARCH the MAILING-LIST at: > >> > https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GA_JKurA$ > > > > _______________________________________________ > > Wien mailing list > > Wien@zeus.theochem.tuwien.ac.at > > > https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4E5AcHVtA$ > > SEARCH the MAILING-LIST at: > https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GA_JKurA$ > > -- > ----------------------------------------------------------------------- > Peter Blaha, Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna > Phone: +43-158801165300 > Email: peter.bl...@tuwien.ac.at > WWW: > https://urldefense.com/v3/__http://www.imc.tuwien.ac.at__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GOq7u_3g$ > WIEN2k: > https://urldefense.com/v3/__http://www.wien2k.at__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4HnwEsS1A$ > ------------------------------------------------------------------------- > > _______________________________________________ > Wien mailing list > Wien@zeus.theochem.tuwien.ac.at > > https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4E5AcHVtA$ > SEARCH the MAILING-LIST at: > https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!HF2PQeyPOOPTweMAtKX-0JDVvq33-IxKuq0rp4xRH5r9Zgxq_eFDeApwqHjuW4GA_JKurA$ >
_______________________________________________ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html