[Wien] SIGSEGV fault error with mBJ
Dear?support ?I tried to calculate the TiC simply for test. The scf cycle completes without any error. While the mBJ encounters the following type of error LAPW0 END forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PCRoutineLineSource lapw0 0040519B c3fft_1_ 119 fftpack_helpers.f lapw0 00415128 fftpack_mp_c3fft_ 397 fft_modules.F lapw0 0048B865 vresp_106 vresp.F lapw0 004A239D xcpot3_ 147 xcpot3.F lapw0 0046664E MAIN__ 1935 lapw0.F lapw0 004039BC Unknown Unknown Unknown libc.so.6 003D1C01EC5D Unknown Unknown Unknown lapw0 004038B9 Unknown Unknown Unknown > stop error My computer is i3 hp desktop. I used intel fortran composer xe (l_fcompxe_2013.1.117.tgz) and wien2k 12. And my operating system is centos6. Help required. Thanks Yours sincerely Jameson Maibam -- next part -- An HTML attachment was scrubbed... URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20130319/b15e2402/attachment.htm>
[Wien] IBM AIX error
N.B., unless Peter can do the essl coversions, I can only add to the mixer which will be in the next release (which is better than the current one). --- Professor Laurence Marks Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu 1-847-491-3996 "Research is to see what everybody else has seen, and to think what nobody else has thought" Albert Szent-Gyorgi On Mar 19, 2013 8:32 PM, "Laurence Marks" wrote: > H, this is tricky. Based upon the links below it looks like essl uses > non-standard lapack versions. > > http://www.cpmd.org:81/pipermail/cpmd-list/2006-December/003584.html > http://cms.mpi.univie.ac.at/vasp-forum/forum_viewtopic.php?2.45 > > To handle this, I see two options: > a) Someone with access to essl works (i can help) to add "#ifdef essl" to > the mixer routines. Since I have no access to aix/essl I cannot do this. > b) You, and perhaps others switch to standard lapack for the mixer. > > I believe essl should conform to the published standard. > > N.B., there may be a problem if essl decides to do its own error handling > with, for instance, eigenvalues of singular matrices. These are supposed to > fail and if essl crashes out the mixer will fail. > > N.N.B. In emergency you can try regressing to MSEC1 although this is not > as good as MSR1 & MSR1a. This will let you know if the other codes are > working. > > --- > Professor Laurence Marks > Department of Materials Science and Engineering > Northwestern University > www.numis.northwestern.edu 1-847-491-3996 > "Research is to see what everybody else has seen, and to think what nobody > else has thought" > Albert Szent-Gyorgi > On Mar 19, 2013 7:15 PM, "Oliver Albertini" wrote: > >> Dear WIEN2k users, >> >> I recently compiled 12.1 on AIX (v 6.1) pwr6. Like Luis, I also had to >> make some changes to SRC's in order to finish the compilation. These were >> mostly issues with xlf like syntax. 9.2 was the most recent version before >> this. >> >> To check the program, ran NiO 2x2x2 supercell. >> init_lapw went well, and upon running runsp_lapw, got the following >> output: >> >> # runsp_lapw >> hup: Command not found. >> STOP LAPW0 END >> STOP LAPW1 END >> STOP LAPW1 END >> STOP LAPW2 END >> syntax error on line 1 stdin >> STOP LAPW2 END >> syntax error on line 1 stdin >> STOP CORE END >> STOP CORE END >> STOP MIXER END >> Sending nohup output to nohup.out. >> hup: Command not found. >> STOP LAPW0 END >> STOP LAPW1 END >> STOP LAPW1 END >> STOP LAPW2 END >> syntax error on line 1 stdin >> STOP LAPW2 END >> syntax error on line 1 stdin >> STOP CORE END >> STOP CORE END >> STOP MIXER END >> Sending nohup output to nohup.out. >> hup: Command not found. >> STOP LAPW0 END >> STOP LAPW1 END >> STOP LAPW1 END >> STOP LAPW2 END >> syntax error on line 1 stdin >> STOP LAPW2 END >> syntax error on line 1 stdin >> STOP CORE END >> STOP CORE END >> STOP 1 >> >> > stop error >> >> >> I ran a few more times with '-NI' and got a few more cycles out. The >> energies are reasonable in comparison with other machines. in mixer.error, >> the following was printed: >> >> Error in MIXER >> >> Also , the NiO.output2up/dn files have the line 'no read error', and >> NiO.outputm says the following: >> >> DGEEV : 2538-2099 >> End of input argument error reporting. For more information, refer to >> Engineering and Scientific Subroutine Library Guide and Reference >> (SA22-7904). >> >> DGEEV : 2538-2604 >> Execution terminating due to error count for error number 2099. >> >> Finally, the dayfile reveals the following error: >> >> error: command /usr/bin/WIEN2k/12.1/mixer mixer.def failed >> >> mixer was the last program that I compiled, and I had to install a >> 64-bit version of LAPACK to make this work, since the routines dggglm and >> dgelsy were coming back as undefined symbols. >> >> I look forward to hearing suggestions. >> >> Sincerely, >> >> Oliver Albertini >> > -- next part -- An HTML attachment was scrubbed... URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20130319/a9fbac9f/attachment.htm>
[Wien] IBM AIX error
H, this is tricky. Based upon the links below it looks like essl uses non-standard lapack versions. http://www.cpmd.org:81/pipermail/cpmd-list/2006-December/003584.html http://cms.mpi.univie.ac.at/vasp-forum/forum_viewtopic.php?2.45 To handle this, I see two options: a) Someone with access to essl works (i can help) to add "#ifdef essl" to the mixer routines. Since I have no access to aix/essl I cannot do this. b) You, and perhaps others switch to standard lapack for the mixer. I believe essl should conform to the published standard. N.B., there may be a problem if essl decides to do its own error handling with, for instance, eigenvalues of singular matrices. These are supposed to fail and if essl crashes out the mixer will fail. N.N.B. In emergency you can try regressing to MSEC1 although this is not as good as MSR1 & MSR1a. This will let you know if the other codes are working. --- Professor Laurence Marks Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu 1-847-491-3996 "Research is to see what everybody else has seen, and to think what nobody else has thought" Albert Szent-Gyorgi On Mar 19, 2013 7:15 PM, "Oliver Albertini" wrote: > Dear WIEN2k users, > > I recently compiled 12.1 on AIX (v 6.1) pwr6. Like Luis, I also had to > make some changes to SRC's in order to finish the compilation. These were > mostly issues with xlf like syntax. 9.2 was the most recent version before > this. > > To check the program, ran NiO 2x2x2 supercell. > init_lapw went well, and upon running runsp_lapw, got the following output: > > # runsp_lapw > hup: Command not found. > STOP LAPW0 END > STOP LAPW1 END > STOP LAPW1 END > STOP LAPW2 END > syntax error on line 1 stdin > STOP LAPW2 END > syntax error on line 1 stdin > STOP CORE END > STOP CORE END > STOP MIXER END > Sending nohup output to nohup.out. > hup: Command not found. > STOP LAPW0 END > STOP LAPW1 END > STOP LAPW1 END > STOP LAPW2 END > syntax error on line 1 stdin > STOP LAPW2 END > syntax error on line 1 stdin > STOP CORE END > STOP CORE END > STOP MIXER END > Sending nohup output to nohup.out. > hup: Command not found. > STOP LAPW0 END > STOP LAPW1 END > STOP LAPW1 END > STOP LAPW2 END > syntax error on line 1 stdin > STOP LAPW2 END > syntax error on line 1 stdin > STOP CORE END > STOP CORE END > STOP 1 > > > stop error > > > I ran a few more times with '-NI' and got a few more cycles out. The > energies are reasonable in comparison with other machines. in mixer.error, > the following was printed: > > Error in MIXER > > Also , the NiO.output2up/dn files have the line 'no read error', and > NiO.outputm says the following: > > DGEEV : 2538-2099 > End of input argument error reporting. For more information, refer to > Engineering and Scientific Subroutine Library Guide and Reference > (SA22-7904). > > DGEEV : 2538-2604 > Execution terminating due to error count for error number 2099. > > Finally, the dayfile reveals the following error: > > error: command /usr/bin/WIEN2k/12.1/mixer mixer.def failed > > mixer was the last program that I compiled, and I had to install a > 64-bit version of LAPACK to make this work, since the routines dggglm and > dgelsy were coming back as undefined symbols. > > I look forward to hearing suggestions. > > Sincerely, > > Oliver Albertini > -- next part -- An HTML attachment was scrubbed... URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20130319/02cca61b/attachment.htm>
[Wien] IBM AIX error
Dear WIEN2k users, I recently compiled 12.1 on AIX (v 6.1) pwr6. Like Luis, I also had to make some changes to SRC's in order to finish the compilation. These were mostly issues with xlf like syntax. 9.2 was the most recent version before this. To check the program, ran NiO 2x2x2 supercell. init_lapw went well, and upon running runsp_lapw, got the following output: # runsp_lapw hup: Command not found. STOP LAPW0 END STOP LAPW1 END STOP LAPW1 END STOP LAPW2 END syntax error on line 1 stdin STOP LAPW2 END syntax error on line 1 stdin STOP CORE END STOP CORE END STOP MIXER END Sending nohup output to nohup.out. hup: Command not found. STOP LAPW0 END STOP LAPW1 END STOP LAPW1 END STOP LAPW2 END syntax error on line 1 stdin STOP LAPW2 END syntax error on line 1 stdin STOP CORE END STOP CORE END STOP MIXER END Sending nohup output to nohup.out. hup: Command not found. STOP LAPW0 END STOP LAPW1 END STOP LAPW1 END STOP LAPW2 END syntax error on line 1 stdin STOP LAPW2 END syntax error on line 1 stdin STOP CORE END STOP CORE END STOP 1 > stop error I ran a few more times with '-NI' and got a few more cycles out. The energies are reasonable in comparison with other machines. in mixer.error, the following was printed: Error in MIXER Also , the NiO.output2up/dn files have the line 'no read error', and NiO.outputm says the following: DGEEV : 2538-2099 End of input argument error reporting. For more information, refer to Engineering and Scientific Subroutine Library Guide and Reference (SA22-7904). DGEEV : 2538-2604 Execution terminating due to error count for error number 2099. Finally, the dayfile reveals the following error: error: command /usr/bin/WIEN2k/12.1/mixer mixer.def failed mixer was the last program that I compiled, and I had to install a 64-bit version of LAPACK to make this work, since the routines dggglm and dgelsy were coming back as undefined symbols. I look forward to hearing suggestions. Sincerely, Oliver Albertini -- next part -- An HTML attachment was scrubbed... URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20130319/c2c008b7/attachment.htm>
[Wien] SIGSEGV fault error with mBJ
Please search the mailing list. It was mentioned before that you have to fix fftpack (patch was in cluded in the mailing list), or you switch to FFTW2/3 On 03/19/2013 02:18 PM, Jameson Maibam wrote: > Dear support > I tried to calculate the TiC simply for test. The scf cycle completes > without any error. While the mBJ encounters the following type of error > LAPW0 END > forrtl: severe (174): SIGSEGV, segmentation fault occurred > Image PC Routine Line Source > lapw0 0040519B c3fft_1_ 119 fftpack_helpers.f > lapw0 00415128 fftpack_mp_c3fft_ 397 fft_modules.F > lapw0 0048B865 vresp_ 106 vresp.F > lapw0 004A239D xcpot3_ 147 xcpot3.F > lapw0 0046664E MAIN__ 1935 lapw0.F > lapw0 004039BC Unknown Unknown Unknown > libc.so.6 003D1C01EC5D Unknown Unknown Unknown > lapw0 004038B9 Unknown Unknown Unknown >> stop error > My computer is i3 hp desktop. I used intel fortran composer xe > (l_fcompxe_2013.1.117.tgz) and wien2k 12. And my operating system is > centos6. > Help required. > Thanks > Yours sincerely > Jameson Maibam > > > ___ > Wien mailing list > Wien at zeus.theochem.tuwien.ac.at > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien > -- P.Blaha -- Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna Phone: +43-1-58801-165300 FAX: +43-1-58801-165982 Email: blaha at theochem.tuwien.ac.atWWW: http://info.tuwien.ac.at/theochem/ --
[Wien] QTL-B message in scf2 after "x lapw2 -qtl"
Dear Prof.Blaha: Thank you very much for your kind answer to my question about details. I must say, about most interested energy range(for me) 0-2Ry from Fermi level, I could get rid of ghost band by your first advise. But I want to get rid of as higher range as possible... So please let me continue. I tried your suggestion in your last mail and met oscillating behavior like this: "After getting rid of atom2 l=1 ghost band, I got atom1 l=1 ghost band .After getting rid of atom1 l=1 ghost band, I got atom2 l=0 ghost band. After getting rid of atom2 l=0 ghost band, I got atom2 l~1 ghost band...". Is there further strategy to get rid of ghost band from 0-3Ry? Adding APW+lo and LOCAL ORBITAL of l=3,4 is no sense? Datails are below: (I had to set emax as 3.5 ,not 2.5. But even when I set emax=3.5, situation doesn't change.) At first I changed atom2 l=0's LO energy parameter to 2.5Ry as below. - WFFIL EF= 0.5 (WFFIL, WFPRI, ENFIL, SUPWF) 7.00 104 (R-MT*K-MAX; MAX L IN WF, V-NMT 0.304 0 (GLOBAL E-PARAMETER WITH n OTHER CHOICES, global APW/LAPW) 00.30 0.000 CONT 1 0 -5.57 0.001 STOP 1 10.30 0.000 CONT 1 1 -3.12 0.001 STOP 1 0.304 0 (G...) 0 -1.46 0.002 CONT 1 00.30 0.000 CONT 1 10.30 0.000 CONT 1 12.50 0.000 CONT 1 K-VECTORS FROM UNIT:4 -9.0 2.518 emin/emax/nband #red - But after DOS,I got QTL-B value=7.55419 in 2.07915Ry of atom1,l=1. Next I changed atom2 l=0's LO energy parameter to 3.0Ry or 4.0Ry. But the result was almost same. So next, I changed atom1 l=0 energy parameter from (0.3 and -3.12) to(2.0&-3.12). After DOS,I got QTL-B value=3.26254 in 2.39181Ry of atom2 l=0. I thought this QTL-B value is rather small, so checked help032 file and found below; L= 0 12.72219 9.975 3.26346.972 -14.698-9.046 3.263/12.72=26% is larger than a few percent, so it may not be good. (I saw ) So next, I changed atom2 l=0 energy parameter from(-1.46&1.3)to(-1.46&2.3). The in1_st file at this point is as below. --- WFFIL EF= 0.5 (WFFIL, WFPRI, ENFIL, SUPWF) 7.00 104 (R-MT*K-MAX; MAX L IN WF, V-NMT 0.304 0 (GLOBAL E-PARAMETER WITH n OTHER CHOICES, global APW/LAPW) 00.30 0.000 CONT 1 0 -5.57 0.001 STOP 1 12.00 0.000 CONT 1 1 -3.12 0.001 STOP 1 0.304 0 (GLOBAL...) 0 -1.46 0.002 CONT 1 02.30 0.000 CONT 1 10.30 0.000 CONT 1 12.50 0.000 CONT 1 K-VECTORS FROM UNIT:4 -9.0 2.518 emin/emax/nband #red --- After DOS, I got QTL-B value=2.94021 in 1.22167Ry of atom2 l=1 and found below in help032 file. --- L= 1 47.80363 28.378 2.940 2.635 4.587 2.339 --- 2.94/47.80=6.2% is still a few percent but below 10%. User Guide says " The few percent message (e.g up to 10 %) does not indicate a ghost band, but can happen e.g. in narrow d-bands, where the linearization reaches its limits. In these cases one can add a local orbital to improve the flexibility of the basis set." But I have already added local orbital at atom2 l=0. I changed atom2 l=1 energy parameter from(0.3 &2.5) to (0.3,1.5). But I got QTL-B value=10.05510 in 2.48523Ry of atom2 l=1. I think some oscillating behavior occurred. Best regards, S.Fujita -- next part -- An HTML attachment was scrubbed... URL: <http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20130319/8b8b7443/attachment.htm>
[Wien] Systematic slowing down of calculations with time
I was very lucky; the issue is related to cached memory and running sync; echo 3 > /proc/sys/vm/drop_caches solved the problem. (see http://www.hosting.com/support/linux/clear-memory-cache-on-linux-server & http://www.linuxinsight.com/proc_sys_vm_drop_caches.html ) No idea why this occurred but obviously something (impi, mkl, ...) is leading to some combination of clean caches, dentries and inodes sitting in memory and degrading performance. I will put an appropriate cron task in, others might want to talk to their sys_admin if they ever see this. On Tue, Mar 19, 2013 at 8:11 AM, Laurence Marks wrote: > I have a reproducible slowing down of calculations which appears to be > in lapw1 due to something (memory leak,?) which is going to be hard to > track down so I welcome suggestions. > > I first noticed it when one newish E5-2660 node was systematically > running at ~1/2 the speed of others, reproducibly. After rebooting it > went back to running at the same speed as others. > > I have now reproduced a systematic slowing down of lapw1 (I cannot see > anything in lapw2) for a long calculation (-it -noHinv, but I don't > think this matters). It is shown in the attached with the x axis > iteration, the y axis time in minutes. (The image may get shuffled to > a link by the listserver software.) Starting from ~ 7minutes the > slowdown is approximately 8 seconds/iteration. This is a fairly big > calculation with a matrix size of 45456 and 835m/core (virtual) > running on 64 cores. There is no indication that this is > communications related, the slowdown is in CPU and WALL remains very > close to this. > > Obviously recompiling with debug on is not going to be a viable > approach. Also a scatter debug strategy, for instance trying to add > calls to release memory from mkl calls is going to be very painful as > we are talking about ~1 day to test. Ideal is innovative ideas to > trace down why it has gone slow. > > Ideas? > > For reference, I am using composer_xe_2013.2.146 and Intel impi. I > don't see this on older E5410 nodes but I have not run enough > iterations to notice. > > N.B., others might want to look in long recent runs to see if they > also have evidence for this. > > -- > Professor Laurence Marks > Department of Materials Science and Engineering > Northwestern University > www.numis.northwestern.edu 1-847-491-3996 > "Research is to see what everybody else has seen, and to think what > nobody else has thought" > Albert Szent-Gyorgi -- Professor Laurence Marks Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu 1-847-491-3996 "Research is to see what everybody else has seen, and to think what nobody else has thought" Albert Szent-Gyorgi
[Wien] Systematic slowing down of calculations with time
Minor correction, x-axis is iteration*4 On Tue, Mar 19, 2013 at 8:11 AM, Laurence Marks wrote: > I have a reproducible slowing down of calculations which appears to be > in lapw1 due to something (memory leak,?) which is going to be hard to > track down so I welcome suggestions. > > I first noticed it when one newish E5-2660 node was systematically > running at ~1/2 the speed of others, reproducibly. After rebooting it > went back to running at the same speed as others. > > I have now reproduced a systematic slowing down of lapw1 (I cannot see > anything in lapw2) for a long calculation (-it -noHinv, but I don't > think this matters). It is shown in the attached with the x axis > iteration, the y axis time in minutes. (The image may get shuffled to > a link by the listserver software.) Starting from ~ 7minutes the > slowdown is approximately 8 seconds/iteration. This is a fairly big > calculation with a matrix size of 45456 and 835m/core (virtual) > running on 64 cores. There is no indication that this is > communications related, the slowdown is in CPU and WALL remains very > close to this. > > Obviously recompiling with debug on is not going to be a viable > approach. Also a scatter debug strategy, for instance trying to add > calls to release memory from mkl calls is going to be very painful as > we are talking about ~1 day to test. Ideal is innovative ideas to > trace down why it has gone slow. > > Ideas? > > For reference, I am using composer_xe_2013.2.146 and Intel impi. I > don't see this on older E5410 nodes but I have not run enough > iterations to notice. > > N.B., others might want to look in long recent runs to see if they > also have evidence for this. > > -- > Professor Laurence Marks > Department of Materials Science and Engineering > Northwestern University > www.numis.northwestern.edu 1-847-491-3996 > "Research is to see what everybody else has seen, and to think what > nobody else has thought" > Albert Szent-Gyorgi -- Professor Laurence Marks Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu 1-847-491-3996 "Research is to see what everybody else has seen, and to think what nobody else has thought" Albert Szent-Gyorgi
[Wien] QTL-B message in scf2 after "x lapw2 -qtl"
Getting rid of those qtl-B is an iterative process. However, I do not understand some of your "reactions": > "After getting rid of atom2 l=1 ghost band, I got atom1 l=1 ghost band .After > getting rid of atom1 l=1 ghost band, I got atom2 l=0 ghost band. After > getting rid of atom2 > l=0 ghost band, I got atom2 l~1 ghost band...". > Adding APW+lo and LOCAL ORBITAL of l=3,4 is no sense? No, this does not make sense. > At first I changed atom2 l=0's LO energy parameter to 2.5Ry as below. > > But after DOS,I got QTL-B value=7.55419 in 2.07915Ry of atom1,l=1. > > Next I changed atom2 l=0's LO energy parameter to 3.0Ry or 4.0Ry. > > But the result was almost same. You got a qtl-b on atom 1, l=1. So you need to modify atoms 1, l=1 Energyparameter, not atoms 2, l=0 ?? !! Always modify the energy parameters of the atom and l, where qtl-bs occur. And small values like 2.x for states at high energy are probably ok > The in1_st file at this point is as below. Why case.in1_st ??? It must be case.in1 -- - Peter Blaha Inst. Materials Chemistry, TU Vienna Getreidemarkt 9, A-1060 Vienna, Austria Tel: +43-1-5880115671 Fax: +43-1-5880115698 email: pblaha at theochem.tuwien.ac.at -