I did some tests. For 1000 Si atoms, I use 2010 bands because I need to get the band gap value; moreover, being a cluster, the surface states of the truncated bonds might close the gap, especially at the first steps of the geometry optimization, so it's better I use few empty bands. I managed to run the calculation by using 10 nodes and a max of 40 cores per node. My question now is: can you suggest me optimal command line options and/or input settings to speed up the calculation? And, if possible, also to reduce the number of nodes? The relevant parameters in the input file are the following:

    input_dft= 'pz'
    ecutwfc= 25
    occupations= 'smearing'
    smearing= 'cold'
    degauss= 0.05 ! I know it's quite large, but necessary to stabilize the SCF at this preliminary stage (no geometry step done yet)
    nbnd= 2010

    diagonalization= 'ppcg'
    mixing_mode= 'plain'
    mixing_beta= 0.4

The actual time spent per scf cycle is about 33 minutes. I use QE v. 7.3 compiled with openmpi and scalapack. I have access to the intel compilers too but I did some tests and the difference is just tens of seconds. I have only the Gamma point; please, here you have some info about the grid and the estimated RAM usage:

     Dense  grid: 24616397 G-vectors     FFT dimensions: ( 375, 375, 375)
     Dynamical RAM for                 wfc:     235.91 MB
     Dynamical RAM for     wfc (w. buffer):     235.91 MB
     Dynamical RAM for           str. fact:       0.94 MB
     Dynamical RAM for           local pot:       0.00 MB
     Dynamical RAM for          nlocal pot:    2112.67 MB
     Dynamical RAM for                qrad:       0.80 MB
     Dynamical RAM for          rho,v,vnew:       6.04 MB
     Dynamical RAM for               rhoin:       2.01 MB
     Dynamical RAM for            rho*nmix:      15.03 MB
     Dynamical RAM for           G-vectors:       3.99 MB
     Dynamical RAM for          h,s,v(r/c):       0.46 MB
     Dynamical RAM for          <psi|beta>:     552.06 MB
     Dynamical RAM for      wfcinit/wfcrot:    1305.21 MB
     Estimated static dynamical RAM per process >       2.31 GB
     Estimated max dynamical RAM per process >       3.60 GB
     Estimated total dynamical RAM >    1441.34 GB

Thanks a lot in advance for your kind help.

All the best


pw.x -nk 1 -nt 1 -nb 1 -nd 768 -inp qe.in > qe.out

too many processors for linear-algebra parallelization. 1000 Si atoms = 2000 bands (assuming an insulator with no spin polarization). Use a few tens of processors at most

"some processors have no G-vectors for symmetrization".

which sounds strange to me: with the Gamma point symmetrization is not even needed

      Dense  grid: 30754065 G-vectors FFT dimensions: ( 400, 400, 400)

This is what a 256-atom Si supercell with 30 Ry cutoff yields:

     Dense  grid:   825897 G-vectors     FFT dimensions: ( 162, 162, 162)

I guess you may reduce the size of your supercell


      Dynamical RAM for wfc:     153.50 MB
      Dynamical RAM for     wfc (w. buffer):     153.50 MB
      Dynamical RAM for           str. fact:       0.61 MB
      Dynamical RAM for           local pot:       0.00 MB
      Dynamical RAM for          nlocal pot:    1374.66 MB
      Dynamical RAM for                qrad:       0.87 MB
      Dynamical RAM for          rho,v,vnew:       5.50 MB
      Dynamical RAM for               rhoin:       1.83 MB
      Dynamical RAM for            rho*nmix:       9.78 MB
      Dynamical RAM for           G-vectors:       2.60 MB
      Dynamical RAM for          h,s,v(r/c):       0.25 MB
      Dynamical RAM for          <psi|beta>:     552.06 MB
      Dynamical RAM for      wfcinit/wfcrot:     977.20 MB
      Estimated static dynamical RAM per process >       1.51 GB
      Estimated max dynamical RAM per process >       2.47 GB
      Estimated total dynamical RAM >    1900.41 GB

I managed to run the simulation with 512 atoms, cg diagonalization and 3 nodes on the same machine with command line

pw.x -nk 1 -nt 1 -nd 484 -inp qe.in > qe.out

Please, do you have any suggestion on how to set optimal parallelization parameters to avoid the memory issue and run the calculation? I am also planning to run simulations on nanoclusters with more than 1000 atoms.

Thanks a lot in advance for your kind help.


