Hi,
I am using gromacs-2019.4. I have been running simulations box that contains a 
peptide embedded in a DOPC bilayer membrane, using all atom simulations. I have 
been running for weeks in a TACC computer that has 4 gpu in a single node, so I 
usually run 4 trajectories in a single node using the -multidir option. My 
submission script is:

#!/bin/bash
#SBATCH -J sb16             # Job name
#SBATCH -o test.o%j              # Job name
#SBATCH -e test.e%j              # Job name
#SBATCH -N 1                  # Total number of nodes requested
#SBATCH -n 4                # Total number of mpi tasks requested
#SBATCH -p rtx # Queue (partition) name -- normal, development, etc.
#SBATCH -t 48:00:00           # Run time (hh:mm:ss) - 1.5 hours
module load cuda/10.1
module use -a 
/home1/01247/alfredo/Software/ForGPU/plumed-2.5.3/MyInstall/lib/plumed/ModuleFile
module load plumed_gpu
export OMP_NUM_THREADS=4
ibrun 
/home1/01247/alfredo/Software/gromacs-2019.4_gpu/build-gpu-mpi-plumed/My_install/bin/mdrun_mpi
 -s topol.tpr -plumed plumed.dat -multidir 1 2 3 4


Because the system is going to be down for a week I want to do continuation 
runs in a slower computer system, also using gpus. Because the system is slower 
I want to run it using two nodes. A script that I have used successfully in 
that old machine is:

#!/bin/bash
#SBATCH -J SB9_pi1             # Job name
#SBATCH -o test.o%j              # Job name
#SBATCH -N 2                  # Total number of nodes requested
#SBATCH -n 2                # Total number of mpi tasks requested
#SBATCH -p gpu # Queue (partition) name -- normal, development, etc.
#SBATCH -t 24:00:00           # Run time (hh:mm:ss) - 1.5 hours
module load gcc/5.2.0
module load cray_mpich/7.7.3
module load cuda/9.0
# Launc hMPI-based executable
export OMP_NUM_THREADS=6
ibrun  /home1/01247/alfredo/gromacs-2019.4/build_MPI/My_install/bin/mdrun_mpi 
-s topol2.tpr -pin on -cpi state.cpt -noappend

It works great if I setup a new simulation of the same molecular system (create 
a new tpr file). But if I attempt to run a continuation run coming from the 
other machine (that used 4 threads). I get

Not all bonded interactions have been properly assigned to the domain 
decomposition cells
A list of missing interactions:
                Bond of  10801 missing     -5
                 U-B of  53187 missing     22
         Proper Dih. of  89703 missing    119
               LJ-14 of  73729 missing      3
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

And it stops. If I modified the script for the old machine to use 1 nodes, 1 
task and 4 thread, it runs well but it is a lot slower.
My question is if there is any way to avoid this error, so I can do a 
continuation run using state.cpt with a different domain decomposition. I have 
seen in the list that is suggested to use -rdd. The value printed in the log 
file is 1.595 nm. I increased to 2.0 and gave a similar error.

Thanks,

Alfredo



-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Reply via email to