Dear Iurii Thank you. I'm less than a dummy with allocation/parallelization etc. issues. Otherwise I would be glad to help... Best Giuseppe
> Really? And there is no problem on Linux? no, not at all...;-) On Friday, February 05, 2016 01:07:22 PM Timrov Iurii wrote: > Dear Giuseppe, > > I am going to check if there is some extra allocations and/or a memory leak, > when I have some time. > > > Indeed!!! Only microsoft windows requires more memory to do the same thing > > in newer versions!!!! > > Really? And there is no problem on Linux? > > Best regards, > Iurii > > -- > Dr. Iurii Timrov > Postdoctoral Researcher > École Polytechnique Fédérale de Lausanne, > Theory and Simulation of Materials > CH-1015 Lausanne, Switzerland > +41 21 69 34 881 > http://people.epfl.ch/265334 > > ________________________________________ > From: Giuseppe Mattioli <giuseppe.matti...@ism.cnr.it> > Sent: Friday, February 5, 2016 1:55 PM > To: Timrov Iurii > Cc: pw_forum@pwscf.org > Subject: Re: Re: Re: [Pw_forum] possible i/o bug in turbo_lanczos.x and > turbo_davidson.x 5.3.0 > > Dear Iurii > > > It seems that this is a RAM issue. > > Maybe something connected with memory allocation which is changed between > 5.1.1 and 5.2.1 > > > I runned your test case with QE-5.2.1 on my local workstation with 8 cores > > and 64 Gb RAM and the Lanczos code crashed. When I changed the input of > > PWscf so that only the occupied states are computed (actually, the empty > > states are not need in the Lanczos calculation), which of course > > decreased > > RAM requirements, the code didn't crash. > > the code didn't crash even on my 8 cores 16 GB RAM cluster with 5.1.1. And it > is not a large benchmark. I used to run larger ones on the same node > and far larger ones on two nodes of the same machine with older versions. The > problem cannot be due to the overall memory requirements, but to some > problematic memory allocation (a leak somewhere?). > > > If this is indeed the reason of the problem, then it seems strange to me > > why the QE-5.1.1 does not have this problem. Maybe some investigation > > would be desired. > > Indeed!!! Only microsoft windows requires more memory to do the same thing in > newer versions!!!! > > Best > Giuseppe > > On Friday, February 05, 2016 10:07:14 AM Timrov Iurii wrote: > > Dear Giuseppe, > > > > It seems that this is a RAM issue. > > > > I runned your test case with QE-5.2.1 on my local workstation with 8 cores > > and 64 Gb RAM and the Lanczos code crashed. When I changed the input of > > PWscf so that only the occupied states are computed (actually, the empty > > states are not need in the Lanczos calculation), which of course > > decreased > > RAM requirements, the code didn't crash. > > > > In Bluegene machine you may try to optimize RAM too. Maybe you know, one > > can allocate all cores per node but use only a few of them which would > > allow you to increase RAM per core. Please note that with Davidson the RAM > > requirements are even much larger, so it is not easy to optimize the > > script for submission the jobs for large systems. > > > > In order to verify if you also have a memory issue, you may try to decrease > > the value of celldm(1), cutoffs etc. and see if the code does not > > crash. > > > > If this is indeed the reason of the problem, then it seems strange to me > > why the QE-5.1.1 does not have this problem. Maybe some investigation > > would be desired. > > > > HTH > > > > Best regards, > > Iurii > > > > -- > > Dr. Iurii Timrov > > Postdoctoral Researcher > > École Polytechnique Fédérale de Lausanne, > > Theory and Simulation of Materials > > CH-1015 Lausanne, Switzerland > > +41 21 69 34 881 > > http://people.epfl.ch/265334 > > > > ________________________________________ > > From: Giuseppe Mattioli <giuseppe.matti...@ism.cnr.it> > > Sent: Thursday, February 4, 2016 5:59 PM > > To: Timrov Iurii > > Cc: pw_forum@pwscf.org > > Subject: Re: Re: [Pw_forum] possible i/o bug in turbo_lanczos.x and > > turbo_davidson.x 5.3.0 > > > > Silent crash on bluegene with 5.2.1 (I have no time to compile 5.3.0 now. I > > may try tomorrow if you think it is important). > > > > Program turboTDDFT v.5.2.1 starts on 4Feb2016 at 17:56:55 > > > > This program is part of the open-source Quantum ESPRESSO suite > > for quantum simulation of materials; please cite > > > > "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009); > > > > URL http://www.quantum-espresso.org", > > > > in publications or presentations arising from this work. More details > > at > > http://www.quantum-espresso.org/quote > > > > Parallel version (MPI & OpenMP), running on 2048 processor cores > > Number of MPI processes: 512 > > Threads/MPI process: 4 > > R & G space division: proc/nbgrp/npool/nimage = 512 > > > > Reading data from directory: > > > > /gpfs/scratch/userexternal/gmattiol/test/tddft/run/tmp/././l0-5.3.0.save > > > > Info: using nr1, nr2, nr3 values from input > > > > Info: using nr1, nr2, nr3 values from input > > > > IMPORTANT: XC functional enforced from input : > > Exchange-correlation = SLA PW PBE PBE ( 1 4 3 4 0 0) > > Any further DFT definition will be discarded > > Please, verify this is what you really want > > > > > > Parallelization info > > -------------------- > > sticks: dense smooth PW G-vecs: dense smooth PW > > Min 78 38 8 12054 4220 492 > > Max 80 40 10 12104 4300 550 > > Sum 40733 20369 5097 6186431 2186841 273425 > > Tot 20367 10185 2549 > > > > > > negative rho (up, down): 9.597E-02 0.000E+00 > > > > Subspace diagonalization in iterative solution of the eigenvalue > > problem: > > scalapack distributed-memory algorithm (size of sub-group: 16* 16 > > procs) > > > > > > Warning: There are virtual states in the input file, trying to > > disregard in response calculation > > > > Ultrasoft (Vanderbilt) Pseudopotentials > > > > Normal read > > > > Gamma point algorithm > > > > 2016-02-04 17:57:18.063 (WARN ) [0x40000ee8d50] > > :7014845:ibm.runjob.client.Job: terminated by signal 6 > > 2016-02-04 17:57:18.065 (WARN ) [0x40000ee8d50] > > :7014845:ibm.runjob.client.Job: abnormal termination by signal 6 from rank > > 295 > > > > On Thursday, February 04, 2016 03:46:10 PM Timrov Iurii wrote: > > > Dear Giuseppe, > > > > > > As far as I understand the code crashes when it tries to write the > > > vectors "d0psi" to the disc. First thing to do, I think, is to check that > > > you > > > have enough space on the disc. If this is not the issue, then let's > > > continue looking for a reason. > > > > > > You may want to look in the routine TDDFPT/src/lr_solve_e.f90 at lines > > > 110-138 where the code writes vectors to the disc in parallel. Please > > > make > > > sure that the "outdir" is the same in PWscf and in Lanczos/Davidson (and > > > don't specify wfcdir). If this does not solve the problem, could you > > > report please also the output of Lanczos/Davidson (better Lanczos)? > > > > > > HTH > > > > > > Best regards, > > > Iurii Timrov > > > Post-doctoral researcher > > > THEOS - École Polytechnique Fédérale de Lausanne > > > Lausanne, Switzerland > > > > > > > > > ________________________________________ > > > From: pw_forum-boun...@pwscf.org <pw_forum-boun...@pwscf.org> on behalf > > > of Giuseppe Mattioli <giuseppe.matti...@ism.cnr.it> > > > Sent: Thursday, February 4, 2016 11:34 AM > > > To: pw_forum@pwscf.org > > > Subject: [Pw_forum] possible i/o bug in turbo_lanczos.x and > > > turbo_davidson.x 5.3.0 > > > > > > Dear All > > > I'm having problems when performing nontrivial runs of turbo_davidson.x > > > and turbo_lanczos.x with 5.2.1 and 5.3.0 versions of QE. > > > Let me say first that "trivial" runs (CH4 molecule with same > > > pseudopotentials and cutoffs but a smaller 30 a.u.^3 cubic cell) work > > > fine with all > > > the tested versions. > > > However, the input files for a nontrivial case that leads to crash should > > > run on a decent pc in about 1 hr, so they provide a significant but > > > not > > > huge test. *Note* that if I run the same input files with the 5.1.1 > > > version (compiled against the very same environment) everything goes (more > > > slowly but) fine! The 5.3.0 (and 5.2.1) crashes have been reproduced on > > > two different machines (intel 8 cores 16GB RAM, amd 32 cores 64 GB RAM), > > > so > > > they should not be considered as erratic. > > > > > > here is the pw.x run. The PPs are quite old and can be found in the > > > online library (or provided by me on demand). > > > > > > &control > > > > > > calculation = 'scf' > > > restart_mode='from_scratch', > > > prefix='l0-5.3.0', > > > pseudo_dir = '/home/mattioligi/PP_UPF/', > > > outdir='/home/mattioligi/cocat/test_tddft/5.2.1/l0/5.3.0/run/tmp/', > > > nstep=300, > > > max_seconds=80000, > > > disk_io='low', > > > tprnfor=.true., > > > > > > / > > > &system > > > > > > ibrav=1, celldm(1)=40.000000, > > > nat=42, ntyp=4, nbnd=75, > > > ecutwfc = 40.0, > > > ecutrho = 320.0, > > > nspin=1, > > > > > > / > > > &electrons > > > > > > diagonalization='david', > > > mixing_mode='plain' > > > mixing_beta=0.1 > > > conv_thr=1.0d-8 > > > electron_maxstep=100 > > > > > > / > > > &ions > > > / > > > > > > ATOMIC_SPECIES > > > O 15.999 O_pbe.van.UPF > > > N 14.007 N.pbe-van_bm.UPF > > > C 12.011 C_pbe.van.UPF > > > H 1.008 H_pbe.van.UPF > > > ATOMIC_POSITIONS {angstrom} > > > C 4.815369179 12.355337788 8.111406911 > > > C 5.639537337 12.072210478 7.018248617 > > > C 6.373883049 10.886794669 6.974735758 > > > H 5.707874252 12.778745273 6.179910928 > > > C 4.734413944 11.441350355 9.166316558 > > > H 4.235443595 13.287281698 8.140567718 > > > C 6.304598307 9.977077773 8.041477142 > > > H 7.012644682 10.659891408 6.111132336 > > > C 5.477180541 10.260422385 9.138835842 > > > H 4.092409998 11.653694694 10.031778418 > > > H 5.418528381 9.546881383 9.971310698 > > > N 7.058612774 8.759574945 8.006208499 > > > C 6.384981399 7.544139013 8.340645249 > > > C 6.997532612 6.588483316 9.168188787 > > > C 5.084708421 7.308024697 7.864810575 > > > C 6.325550737 5.410241765 9.493833204 > > > H 8.006262126 6.776794433 9.557919083 > > > C 4.414663626 6.134355690 8.210976959 > > > H 4.597637090 8.055249046 7.224770074 > > > C 5.030975670 5.176070562 9.020776666 > > > H 6.819890970 4.670618768 10.138154855 > > > H 3.397721512 5.964689741 7.832306200 > > > H 4.503298572 4.249946635 9.284425745 > > > C 8.412602212 8.773905175 7.652414992 > > > C 9.197305040 9.938168667 7.841458619 > > > C 9.043381168 7.634703664 7.098599788 > > > C 10.533008285 9.972397555 7.486007356 > > > H 8.740413757 10.828552107 8.290447985 > > > C 10.383506998 7.674400214 6.758021800 > > > H 8.466388332 6.717306584 6.931252215 > > > C 11.175184928 8.838234071 6.927523312 > > > H 11.098162573 10.894629696 7.663657304 > > > H 10.849606517 6.778483121 6.322529487 > > > C 12.554045113 8.768090174 6.529797787 > > > C 13.538745611 9.729179498 6.474718127 > > > H 12.882286114 7.769870632 6.203237321 > > > C 13.338246843 11.096686263 6.810664645 > > > N 13.160471613 12.223162736 7.083088078 > > > C 14.914360413 9.407055683 6.034105289 > > > O 15.832284936 10.221452163 5.956798921 > > > O 15.091537629 8.085358800 5.710801225 > > > H 16.043983143 8.016066678 5.436328923 > > > K_POINTS {gamma} > > > > > > And here are the turbo_lanczos.x and turbo davidson.x input files > > > > > > lanczos > > > > > > &lr_input > > > > > > prefix="l0-5.3.0", > > > outdir='/state/partition1/mattioligi/34339', > > > wfcdir='/state/partition1/mattioligi/34339', > > > restart_step=6, > > > restart=.false. > > > > > > / > > > &lr_control > > > > > > itermax=12, > > > ipol=4, > > > > > > / > > > > > > davidson > > > > > > &lr_input > > > > > > prefix="l0-5.3.0", > > > outdir='/state/partition1/mattioligi/34340', > > > restart=.false. > > > > > > / > > > &lr_dav > > > > > > num_eign=2 > > > num_init=4 > > > num_basis_max=10 > > > residue_conv_thr=1.0E-4 > > > start=0.1 > > > finish=1.5 > > > step=0.0002 > > > broadening=0.005 > > > reference=0.2 > > > p_nbnd_occ=5 > > > p_nbnd_virt=5 > > > poor_of_ram=.false. > > > poor_of_ram2=.false. > > > > > > / > > > > > > In both cases and on both machines the CRASH report is something like > > > > > > > > > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > > > > > > task # 1 > > > from davcio : error # 20 > > > error while writing from file > > > "/state/partition1/mattioligi/34340/l0-5.3.0.d0psi.32" > > > > > > > > > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > > > > > > I suppose that it is some kind of I/O error, but I warmly require your > > > opinion...:-) > > > Thank you in advance > > > Giuseppe > > > > > > ******************************************************** > > > - Article premier - Les hommes naissent et demeurent > > > libres et égaux en droits. Les distinctions sociales > > > ne peuvent être fondées que sur l'utilité commune > > > - Article 2 - Le but de toute association politique > > > est la conservation des droits naturels et > > > imprescriptibles de l'homme. Ces droits sont la liberté, > > > la propriété, la sûreté et la résistance à l'oppression. > > > ******************************************************** > > > > > > Giuseppe Mattioli > > > CNR - ISTITUTO DI STRUTTURA DELLA MATERIA > > > v. Salaria Km 29,300 - C.P. 10 > > > I 00015 - Monterotondo Stazione (RM), Italy > > > Tel + 39 06 90672836 - Fax +39 06 90672316 > > > E-mail: <giuseppe.matti...@ism.cnr.it> > > > http://www.ism.cnr.it/english/staff/mattiolig > > > ResearcherID: F-6308-2012 > > > > > > _______________________________________________ > > > Pw_forum mailing list > > > Pw_forum@pwscf.org > > > http://pwscf.org/mailman/listinfo/pw_forum > > > > ******************************************************** > > - Article premier - Les hommes naissent et demeurent > > libres et égaux en droits. Les distinctions sociales > > ne peuvent être fondées que sur l'utilité commune > > - Article 2 - Le but de toute association politique > > est la conservation des droits naturels et > > imprescriptibles de l'homme. Ces droits sont la liberté, > > la propriété, la sûreté et la résistance à l'oppression. > > ******************************************************** > > > > Giuseppe Mattioli > > CNR - ISTITUTO DI STRUTTURA DELLA MATERIA > > v. Salaria Km 29,300 - C.P. 10 > > I 00015 - Monterotondo Stazione (RM), Italy > > Tel + 39 06 90672836 - Fax +39 06 90672316 > > E-mail: <giuseppe.matti...@ism.cnr.it> > > http://www.ism.cnr.it/english/staff/mattiolig > > ResearcherID: F-6308-2012 > > ******************************************************** > - Article premier - Les hommes naissent et demeurent > libres et égaux en droits. Les distinctions sociales > ne peuvent être fondées que sur l'utilité commune > - Article 2 - Le but de toute association politique > est la conservation des droits naturels et > imprescriptibles de l'homme. Ces droits sont la liberté, > la propriété, la sûreté et la résistance à l'oppression. > ******************************************************** > > Giuseppe Mattioli > CNR - ISTITUTO DI STRUTTURA DELLA MATERIA > v. Salaria Km 29,300 - C.P. 10 > I 00015 - Monterotondo Stazione (RM), Italy > Tel + 39 06 90672836 - Fax +39 06 90672316 > E-mail: <giuseppe.matti...@ism.cnr.it> > http://www.ism.cnr.it/english/staff/mattiolig > ResearcherID: F-6308-2012 ******************************************************** - Article premier - Les hommes naissent et demeurent libres et égaux en droits. Les distinctions sociales ne peuvent être fondées que sur l'utilité commune - Article 2 - Le but de toute association politique est la conservation des droits naturels et imprescriptibles de l'homme. Ces droits sont la liberté, la propriété, la sûreté et la résistance à l'oppression. ******************************************************** Giuseppe Mattioli CNR - ISTITUTO DI STRUTTURA DELLA MATERIA v. Salaria Km 29,300 - C.P. 10 I 00015 - Monterotondo Stazione (RM), Italy Tel + 39 06 90672836 - Fax +39 06 90672316 E-mail: <giuseppe.matti...@ism.cnr.it> http://www.ism.cnr.it/english/staff/mattiolig ResearcherID: F-6308-2012 _______________________________________________ Pw_forum mailing list Pw_forum@pwscf.org http://pwscf.org/mailman/listinfo/pw_forum