I'm still on vacations, so cannot test myself.

However, I experienced such problems before. It has to do with multithreading (1 thread works always fine) and the mkl routine zheevr.

In my case I could fix the problem by enlarging the workspace beyond what the routine calculates itself. (see comment in hmsec on line 841).

Right below, the workspace was enlarged by a factor 10, which fixed my problem. But I can easily envision that it might not be enough in some other cases.

An alternative is to switch back to zheevx (commented in the code).

Peter Blaha

Am 18.08.2021 um 20:01 schrieb Pavel Ondračka:
Right, I think that the reason deallocate is failing because the memory
has been corrupted at some earlier point is quite clear, the only other
option why it should crash would be that it was not allocated at all,
which seem not to be the case here... The question is what corrupted
the memory and even more strange is why does it work if we disable MKL
multithreading?

It could indeed be that we are doing something wrong. I can imagine the
memory could be corrupted in some BLAS call if the number of
columns/rows passed to the specific BLAS call is more than the actual
size of the matrix, than this could easily happen (and the
multithreading is somehow influencing what the final value of the
corrupted memory, and depending on the final value the deallocate could
fail or pass somehow). This should be possible to diagnose with
valgrind as suggested.

Luis, can you upload the testcase somewhere, or recompile with
debuginfo as suggested by Laurence earlier, run "valgrind --track-
origins=yes lapwso lapwso.def" and send the output? Just be warned,
there is a massive slowdown with valgrind (up to 100x) and the logfile
can get very large.

Best regards
Pavel


On Wed, 2021-08-18 at 12:10 -0500, Laurence Marks wrote:
Correction, I was looking at an older modules.F. It looks like it
should be

DEALLOCATE(vect,stat=IV) ; if(IV .ne. 0)write(*,*)IV


On Wed, Aug 18, 2021 at 11:23 AM Laurence Marks
<laurence.ma...@gmail.com> wrote:
I do wonder about this. I suggest editing module.F and changing
lines 118 and 119 to
      DEALLOCATE(en,stat=Ien) ; if(Ien .ne. 0)write(*,*)'Err en
',ien
      DEALLOCATE(vnorm,stat=Ivn ; ) if(Ivn .ne. 0)write(*,*)'Err
vnorm ',Ivn

There is every chance that the bug is not in those lines, but
somewhere completely different. SIGSEV often means that the code
has been overwritten, for instance arrays going out of bounds.

You can also recompile with -g (don't change other options)
added, and/or -C. Sometimes this is better. Or use other things
like debuggers or valgrind.

On Wed, Aug 18, 2021 at 10:47 AM Pavel Ondračka
<pavel.ondra...@email.cz> wrote:
I'm CCing the list back as the crash was now diagnosed to a
likely
MKL
problem, see below for more details.

So just to be clear, explicitly setting OMP_STACKSIZE=1g does
not
help
to solve the issue?


Right! OMP_STACKSIZE=1g with OMP_NUM_THREADS=4 does not solve
the
problem!
The problem is that the OpenMP code in lapwso is very simple,
so I'm
having problems seeing how it could be causing the problems.

Could you also try to see what happens if run with:
OMP_NUM_THREADS=1
MKL_NUM_THREADS=4


It does not work with these values, but I checked and it works
reverting them:
OMP_NUM_THREADS=4
MKL_NUM_THREADS=1
This was very helpfull and IMO points to a problem with MKL
instead
of
Wien2k.

Unfortunatelly setting MKL_NUM_THREADS=1 globally will reduce the
OpenMP performance, mostly in lapw1 but also at other places. So
if
you
want to keep the OpenMP BLAS/lapack level parallelism you have to
either find some MKL version that works (if you do please report
it
here), link with OpenBLAS (using it for lapwso is enough) or
create
a
simple wrapper that sets the MKL_NUM_THREADS=1 just for lapwso,
i.e.,
rename lapwso binary in WIENROOT to lapwso_bin and create new
lapwso
file there with:

#!/bin/bash
MKL_NUM_THREADS=1 lapwso_bin $1

and set it to executable with chmod +x lapwso.

Or maybe MKL has a non-OpenMP version which you could link with
just
lapwso and use standard one in other parts, but dunno, I mostly
use
OpenBLAS. If you need some further help, let me know.

Reporting the issue to intel could be also nice, however I never
had
any real luck there and it is also a bit problematic as you can't
provide testcase due to Wien2k being proprietary code...

Best regards
Pavel

This should disable the Wien2k-specific OpenMP parallelism
but
still
keep the rest of paralellism at the BLAS/lapack level.


So, perhaps, the problem is related to MKL!
Another option is that something is going wrong before lapwso
and the
lapwso crash is just the symptom. What happens if you run
everything
up
to lapwso without OpenMP (OMP_NUM_THREADS=1) and than enable
it
just
for lapwso?


If I run lapw0 and lapw1 with OMP_NUM_THREADS=4 and then change
it to 1
just before lapwso, it works.
If I do the opposite, starting with OMP_NUM_THREADS=1 and then
change
it to 4 just before lapwso, it does not work.
So I believe that the problem is really at lapwso.
   If you need more information, please, let me know!
    All the best,
             Luis

_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
https://urldefense.com/v3/__http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3YA4jSb3A$

SEARCH the MAILING-LIST at:
https://urldefense.com/v3/__http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html__;!!Dq0X2DkFhyF93HkjWTBQKhk!H_VXJmyf6v2ZSCmTICvdVDv1QuKxPqCDcjbbytr7Fh51-KF5rv8A2uvyMlW3x3aDFmAN4g$


--
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
"Research is to see what everybody else has seen, and to think what
nobody else has thought" Albert Szent-Györgyi

_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

--
-----------------------------------------------------------------------
Peter Blaha,  Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-158801165300
Email: peter.bl...@tuwien.ac.at
WWW:   http://www.imc.tuwien.ac.at      WIEN2k: http://www.wien2k.at
-------------------------------------------------------------------------

_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Reply via email to