Re: [gmx-users] Replica Exchange MD on more than 64 processors
Hey, thanks a lot for the quick answers. Installation of mvapich 1.2 and compiliation and linking of mdrun against their libraries seem to do the trick. Kind regards Sebastian Mark Abraham schrieb: - Original Message - From: Berk Hess Date: Wednesday, February 3, 2010 5:13 Subject: RE: [gmx-users] Replica Exchange MD on more than 64 processors To: Discussion list for GROMACS users --- | Hi, One issue could be MPI memory usage. I have noticed that many MPI implementations use an amount of memory per process that is quadratic (!) in the number of processes involved. This can quickly get out of hand. But 28 GB is a lot of memory. The OP was using MVAPICH 1.1, which is not the most current version. MVAPICH 1.2 claims to scale with near-constant memory usage. I suggest an upgrade. Mark One thing that might help slightly is to not use double precision, which is almost never required. This will also make your simulations a factor 1.4 faster. Berk Date: Tue, 2 Feb 2010 18:55:37 +0100 From: breue...@uni-koeln.de To: gmx-users@gromacs.org Subject: [gmx-users] Replica Exchange MD on more than 64 processors Dear list I recently came up with a problem concerning a replica exchange simulation. The simulation is run with gromacs-mpi in Version 4.0.7 compiled with following flags --enable-threads --enable-mpi --with-fft=mkl -enable-double, intel compiler version 11.0 mvapich version 1.1.0 mkl version 10.1 The program is working fine in this cluster evironment consisting of 32 nodes with 8 processors and 32GB each. I've already run several simulations using the MPI feature. It seems that I stuck in a similar problem that was already announced on this list by bharat v. adkar in december 2009 without an eventual solution: http://www.mail-archive.com/gmx-users@gromacs.org/msg27175.html I am doing a replica exchange simulation on a simulation box with 5000 molecules (81 atoms each) and 4 different temperatures. The simulation runs nicely with 64 processors (8 nodes) but stops with an error message on 128 processors (16 nodes). Taking the following four points into account: 1. every cluster node has at least 28GB memory in a usable way available 2. the system I am working with should only use 5000*81*900B=347.614MB (according to the FAQ) 3. even if every replica (4) is run on the same node the memory usage should be less than 2GB 4. the simulation works fine with 64 processors it seems to me the following error --- Program mdrun, VERSION 4.0.7 Source code file: smalloc.c, line: 179 Fatal error: Not enough memory. Failed to realloc 790760 bytes for nlist->jjnr, nlist->jjnr=0xae70b7b0 (called from file ns.c, line 503) --- has to be caused by another issue than missing memory. I am wondering if there is anyone else who is still facing the same problem or has already found a solution for this issue. Kind regards Sebastian -- _ Sebastian BreuersTel: +49-221-470-4108 EMail: breue...@uni-koeln.de Universität zu Köln University of Cologne Department für ChemieDepartment Chemistry Organische Chemieuniversity of Cologne Greinstr. 4 Greinstr. 4 D-50939 Köln D-50939 Cologne, Federal Rep. of Germany _ -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/mailing_lists/users.php --- New Windows 7: Simplify what you do everyday. Find the right PC for you. | --- -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/mailing_lists/users.php -- _ Sebastian BreuersTel: +49-221-470-4108 EMail: breue...@uni-koeln.de Universität zu Köln University of Cologne Department für Chemie
Re: RE: [gmx-users] Replica Exchange MD on more than 64 processors
- Original Message - From: Berk Hess Date: Wednesday, February 3, 2010 5:13 Subject: RE: [gmx-users] Replica Exchange MD on more than 64 processors To: Discussion list for GROMACS users --- | > Hi, > > One issue could be MPI memory usage. > I have noticed that many MPI implementations use an amount of memory > per process that is quadratic (!) in the number of processes involved. > This can quickly get out of hand. But 28 GB is a lot of memory. The OP was using MVAPICH 1.1, which is not the most current version. MVAPICH 1.2 claims to scale with near-constant memory usage. I suggest an upgrade. Mark > One thing that might help slightly is to not use double precision, > which is almost never required. This will also make your simulations > a factor 1.4 faster. > > Berk > > > Date: Tue, 2 Feb 2010 18:55:37 +0100 > > From: breue...@uni-koeln.de > > To: gmx-users@gromacs.org > > Subject: [gmx-users] Replica Exchange MD on more than 64 processors > > > > Dear list > > > > I recently came up with a problem concerning a replica exchange simulation. > > The simulation is run > > with gromacs-mpi in Version 4.0.7 compiled with following flags > > --enable-threads --enable-mpi --with-fft=mkl -enable-double, > > > > intel compiler version 11.0 > > mvapich version 1.1.0 > > mkl version 10.1 > > > > The program is working fine in this cluster evironment consisting of 32 > > nodes with 8 processors and > > 32GB each. I've already run several simulations using the MPI feature. > > > > It seems that I stuck in a similar problem that was already announced on > > this list by bharat v. > > adkar in december 2009 without an eventual solution: > > > > http://www.mail-archive.com/gmx-users@gromacs.org/msg27175.html > > > > I am doing a replica exchange simulation on a simulation box with 5000 > > molecules (81 atoms each) and > > 4 different temperatures. The simulation runs nicely with 64 processors (8 > > nodes) but stops with an > > error message on 128 processors (16 nodes). > > > > Taking the following four points into account: > > > > 1. every cluster node has at least 28GB memory in a usable way > > available > > 2. the system I am working with should only use > > 5000*81*900B=347.614MB (according to the FAQ) > > 3. even if every replica (4) is run on the same node the memory > > usage > > should be less than 2GB > > 4. the simulation works fine with 64 processors > > > > it seems to me the following error > > > > --- > > Program mdrun, VERSION 4.0.7 > > Source code file: smalloc.c, line: 179 > > > > Fatal error: > > Not enough memory. Failed to realloc 790760 bytes for nlist->jjnr, > > nlist->jjnr=0xae70b7b0 > > (called from file ns.c, line 503) > > --- > > > > has to be caused by another issue than missing memory. > > > > I am wondering if there is anyone else who is still facing the same problem > > or has already found a > > solution for this issue. > > > > Kind regards > > > > Sebastian > > > > -- > > _ > > > > Sebastian BreuersTel: +49-221-470-4108 > > EMail: breue...@uni-koeln.de > > > > Universität zu Köln University of Cologne > > Department für ChemieDepartment Chemistry > > Organische Chemieuniversity of Cologne > > > > Greinstr. 4 Greinstr. 4 > > D-50939 Köln D-50939 Cologne, Federal Rep. of > > Germany > > _ > > > > -- > > gmx-users mailing listgmx-users@gromacs.org > > http://lists.gromacs.org/mailman/listinfo/gmx-users > > Please search the archive at http://www.gromacs.org/search before posting! > > Please don't post (un)subscribe requests to the list. Use the > > www interface or send it to gmx-users-requ...@gromacs.org. > > Can't post? Read http://www.gromacs.org/mailing_lists/users.php > --- > New Windows 7: Simplify what you do everyday. Find the right PC for you. | --
RE: [gmx-users] Replica Exchange MD on more than 64 processors
Hi, One issue could be MPI memory usage. I have noticed that many MPI implementations use an amount of memory per process that is quadratic (!) in the number of processes involved. This can quickly get out of hand. But 28 GB is a lot of memory. One thing that might help slightly is to not use double precision, which is almost never required. This will also make your simulations a factor 1.4 faster. Berk > Date: Tue, 2 Feb 2010 18:55:37 +0100 > From: breue...@uni-koeln.de > To: gmx-users@gromacs.org > Subject: [gmx-users] Replica Exchange MD on more than 64 processors > > Dear list > > I recently came up with a problem concerning a replica exchange simulation. > The simulation is run > with gromacs-mpi in Version 4.0.7 compiled with following flags > --enable-threads --enable-mpi --with-fft=mkl -enable-double, > > intel compiler version 11.0 > mvapich version 1.1.0 > mkl version 10.1 > > The program is working fine in this cluster evironment consisting of 32 nodes > with 8 processors and > 32GB each. I've already run several simulations using the MPI feature. > > It seems that I stuck in a similar problem that was already announced on this > list by bharat v. > adkar in december 2009 without an eventual solution: > > http://www.mail-archive.com/gmx-users@gromacs.org/msg27175.html > > I am doing a replica exchange simulation on a simulation box with 5000 > molecules (81 atoms each) and > 4 different temperatures. The simulation runs nicely with 64 processors (8 > nodes) but stops with an > error message on 128 processors (16 nodes). > > Taking the following four points into account: > > 1. every cluster node has at least 28GB memory in a usable way > available > 2. the system I am working with should only use > 5000*81*900B=347.614MB (according to the FAQ) > 3. even if every replica (4) is run on the same node the memory usage > should be less than 2GB > 4. the simulation works fine with 64 processors > > it seems to me the following error > > --- > Program mdrun, VERSION 4.0.7 > Source code file: smalloc.c, line: 179 > > Fatal error: > Not enough memory. Failed to realloc 790760 bytes for nlist->jjnr, > nlist->jjnr=0xae70b7b0 > (called from file ns.c, line 503) > --- > > has to be caused by another issue than missing memory. > > I am wondering if there is anyone else who is still facing the same problem > or has already found a > solution for this issue. > > Kind regards > > Sebastian > > -- > _ > > Sebastian BreuersTel: +49-221-470-4108 > EMail: breue...@uni-koeln.de > > Universität zu Köln University of Cologne > Department für ChemieDepartment Chemistry > Organische Chemieuniversity of Cologne > > Greinstr. 4 Greinstr. 4 > D-50939 Köln D-50939 Cologne, Federal Rep. of Germany > _ > > -- > gmx-users mailing listgmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > Please search the archive at http://www.gromacs.org/search before posting! > Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > Can't post? Read http://www.gromacs.org/mailing_lists/users.php _ New Windows 7: Simplify what you do everyday. Find the right PC for you. http://windows.microsoft.com/shop-- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/mailing_lists/users.php
Re: [gmx-users] Replica Exchange MD on more than 64 processors
bharat v. adkar wrote: On Mon, 28 Dec 2009, David van der Spoel wrote: bharat v. adkar wrote: On Mon, 28 Dec 2009, Mark Abraham wrote: > bharat v. adkar wrote: > > On Sun, 27 Dec 2009, Mark Abraham wrote: > > > > > bharat v. adkar wrote: > > > >On Sun, 27 Dec 2009, Mark Abraham wrote: > > > > > > >bharat v. adkar wrote: > > > > > > > > Dear all, > > > > > > I am trying to perform replica exchange MD (REMD) on a > > > > 'protein in > > > > > > water' system. I am following instructions given on wiki > > > > (How-Tos -> > > > > > > REMD). I have to perform the REMD simulation with 35 > > different > > > > > > temperatures. As per advise on wiki, I equilibrated the > > system > > > > at > > > > > > respective temperatures (total of 35 equilibration > > > > simulations). > > After > > > > > > this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr > > files > > from the > > > > > > equilibrated structures. > > > > > > > >Now when I submit final job for REMD with following > > > > command-line, it > > gives > > > > > > some error: > > > > > > > >command line: mpiexec -np 70 mdrun -multi 35 -replex > > 1000 -s > > chk_.tpr > > -v > > > > > > > > error msg: > > > > > > --- > > > > > > Program mdrun_mpi, VERSION 4.0.7 > > > > > > Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: > > 179 > > > > > > > > Fatal error: > > > > > > Not enough memory. Failed to realloc 790760 bytes for > > > > > > nlist->jjnr, > > > > > > nlist->jjnr=0x9a400030 > > > > > > (called from file ../../../SRC/src/mdlib/ns.c, line 503) > > > > > > --- > > > > > > > > Thanx for Using GROMACS - Have a Nice Day > > > > > > : Cannot allocate memory > > > > > > Error on node 19, will try to stop all the nodes > > > > > > Halting parallel program mdrun_mpi on CPU 19 out of 70 > > > > > > > > > > *** > > > > > > > > > >The individual node on the cluster has 8GB of > > physical > > memory and 16GB > > of > > > > > > swap memory. Moreover, when logged onto the individual > > nodes, > > it > >shows > > > > > > more than 1GB of free memory, so there should be no > > problem > > with > >cluster > > > > > > memory. Also, the equilibration jobs for the same system > > are > > run on > > the > > > > > > same cluster without any problem. > > > > > > > >What I have observed by submitting different test jobs > > with > > varying > >number > > > > > >of processors (and no. of replicas, wherever necessary), > > that > > any job > > with > > > > > >total number of processors <= 64, runs faithfully without > > any > > problem. > > As > > > > > > soon as total number of processors are more than 64, it > > gives > > the > > above > > > > > >error. I have tested this with 65 processors/65 replicas > > > > > >also. > > > > > >This sounds like you might be running on fewer physical > > CPUs > > than you > have available. If so, running multiple MPI > > processes per > > physical CPU > can lead to memory shortage > > conditions. > > > > > > I don't understand what you mean. Do you mean, there might > > be more > > than 8 > > > >processes running per node (each node has 8 processors)? But > > that > >also > > > >does not seem to be the case, as SGE (sun grid engine) output > > shows > > only > > > > eight processes per node. > > > > 65 processes can't have 8 processes per node. > > why can't it have? as i said, there are 8 processors per node. what i > > have > > not mentioned is that how many nodes it is using. The jobs got > > distributed > > over 9 nodes. 8 of which corresponds to 64 processors + 1 processor > > from > > 9th node. > > OK, that's a full description. Your symptoms are indicative of someone > making an error somewhere. Since GROMACS works over more than 64 > processors elsewhere, the presumption is that you are doing something > wrong or the machine is not set up in the way you think it is or should > be. To get the most effective help, you need to be sure you're providing > full information - else we can't tell which error you're making or > (potentially) eliminate you as a source of error. > Sorry for not being clear in statements. > > As far I can tell you, job distribution seems okay to me. It is 1 job > > per > > processor. > > Does non-REMD GROMACS run on more than 64 processors? Does your cluster > support using more than 8 nodes in a run? Can you run an MPI "Hello > world" application that prints the processor and node ID across more > than 64 processors? Yes, the cluster supports runs with more than 8 nodes. I generated a system with 10 nm water
Re: [gmx-users] Replica Exchange MD on more than 64 processors
On Mon, 28 Dec 2009, David van der Spoel wrote: bharat v. adkar wrote: On Mon, 28 Dec 2009, Mark Abraham wrote: > bharat v. adkar wrote: > > On Sun, 27 Dec 2009, Mark Abraham wrote: > > > > > bharat v. adkar wrote: > > > >On Sun, 27 Dec 2009, Mark Abraham wrote: > > > > > > >bharat v. adkar wrote: > > > > > > > > Dear all, > > > > > > I am trying to perform replica exchange MD (REMD) on a > > > > 'protein in > > > > > > water' system. I am following instructions given on wiki > > > > (How-Tos -> > > > > > > REMD). I have to perform the REMD simulation with 35 > > different > > > > > > temperatures. As per advise on wiki, I equilibrated the > > system > > > > at > > > > > > respective temperatures (total of 35 equilibration > > > > simulations). > > After > > > > > > this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr > > files > > from the > > > > > > equilibrated structures. > > > > > > > >Now when I submit final job for REMD with following > > > > command-line, it > > gives > > > > > > some error: > > > > > > > >command line: mpiexec -np 70 mdrun -multi 35 -replex > > 1000 -s > > chk_.tpr > > -v > > > > > > > > error msg: > > > > > > --- > > > > > > Program mdrun_mpi, VERSION 4.0.7 > > > > > > Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: > > 179 > > > > > > > > Fatal error: > > > > > > Not enough memory. Failed to realloc 790760 bytes for > > > > > > nlist->jjnr, > > > > > > nlist->jjnr=0x9a400030 > > > > > > (called from file ../../../SRC/src/mdlib/ns.c, line 503) > > > > > > --- > > > > > > > > Thanx for Using GROMACS - Have a Nice Day > > > > > > : Cannot allocate memory > > > > > > Error on node 19, will try to stop all the nodes > > > > > > Halting parallel program mdrun_mpi on CPU 19 out of 70 > > > > > > > > > > *** > > > > > > > > > >The individual node on the cluster has 8GB of > > physical > > memory and 16GB > > of > > > > > > swap memory. Moreover, when logged onto the individual > > nodes, > > it > >shows > > > > > > more than 1GB of free memory, so there should be no > > problem > > with > >cluster > > > > > > memory. Also, the equilibration jobs for the same system > > are > > run on > > the > > > > > > same cluster without any problem. > > > > > > > >What I have observed by submitting different test jobs > > with > > varying > >number > > > > > >of processors (and no. of replicas, wherever necessary), > > that > > any job > > with > > > > > >total number of processors <= 64, runs faithfully without > > any > > problem. > > As > > > > > > soon as total number of processors are more than 64, it > > gives > > the > > above > > > > > >error. I have tested this with 65 processors/65 replicas > > > > > >also. > > > > > >This sounds like you might be running on fewer physical > > CPUs > > than you > have available. If so, running multiple MPI > > processes per > > physical CPU > can lead to memory shortage > > conditions. > > > > > > I don't understand what you mean. Do you mean, there might > > be more > > than 8 > > > >processes running per node (each node has 8 processors)? But > > that > >also > > > >does not seem to be the case, as SGE (sun grid engine) output > > shows > > only > > > > eight processes per node. > > > > 65 processes can't have 8 processes per node. > > why can't it have? as i said, there are 8 processors per node. what i > > have > > not mentioned is that how many nodes it is using. The jobs got > > distributed > > over 9 nodes. 8 of which corresponds to 64 processors + 1 processor > > from > > 9th node. > > OK, that's a full description. Your symptoms are indicative of someone > making an error somewhere. Since GROMACS works over more than 64 > processors elsewhere, the presumption is that you are doing something > wrong or the machine is not set up in the way you think it is or should > be. To get the most effective help, you need to be sure you're providing > full information - else we can't tell which error you're making or > (potentially) eliminate you as a source of error. > Sorry for not being clear in statements. > > As far I can tell you, job distribution seems okay to me. It is 1 job > > per > > processor. > > Does non-REMD GROMACS run on more than 64 processors? Does your cluster > support using more than 8 nodes in a run? Can you run an MPI "Hello > world" application that prints the processor and node ID across more > than 64 processors? Yes, the cluster supports runs with more than 8 nodes. I generated a system with 10 nm water box and submitted on
Re: [gmx-users] Replica Exchange MD on more than 64 processors
bharat v. adkar wrote: On Mon, 28 Dec 2009, Mark Abraham wrote: bharat v. adkar wrote: On Sun, 27 Dec 2009, Mark Abraham wrote: > bharat v. adkar wrote: > > On Sun, 27 Dec 2009, Mark Abraham wrote: > > > > > bharat v. adkar wrote: > > > > > >Dear all, > > > > I am trying to perform replica exchange MD (REMD) on a > > 'protein in > > > >water' system. I am following instructions given on wiki > > (How-Tos -> > > > >REMD). I have to perform the REMD simulation with 35 different > > > >temperatures. As per advise on wiki, I equilibrated the system > > > >at > > > >respective temperatures (total of 35 equilibration > > simulations). > > After > > > >this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files > > from the > > > >equilibrated structures. > > > > > > Now when I submit final job for REMD with following > > command-line, it > > gives > > > >some error: > > > > > > command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s > > chk_.tpr > > -v > > > > > >error msg: > > > >--- > > > >Program mdrun_mpi, VERSION 4.0.7 > > > >Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179 > > > > > >Fatal error: > > > >Not enough memory. Failed to realloc 790760 bytes for > > > >nlist->jjnr, > > > >nlist->jjnr=0x9a400030 > > > >(called from file ../../../SRC/src/mdlib/ns.c, line 503) > > > >--- > > > > > >Thanx for Using GROMACS - Have a Nice Day > > > > :Cannot allocate memory > > > >Error on node 19, will try to stop all the nodes > > > >Halting parallel program mdrun_mpi on CPU 19 out of 70 > > > > > > *** > > > > > > > > The individual node on the cluster has 8GB of physical > > memory and 16GB > > of > > > >swap memory. Moreover, when logged onto the individual nodes, > > it > >shows > > > >more than 1GB of free memory, so there should be no problem > > with > > cluster > > > >memory. Also, the equilibration jobs for the same system are > > run on > > the > > > >same cluster without any problem. > > > > > > What I have observed by submitting different test jobs with > > varying > > number > > > > of processors (and no. of replicas, wherever necessary), that > > any job > > with > > > > total number of processors <= 64, runs faithfully without any > > problem. > > As > > > >soon as total number of processors are more than 64, it gives > > the > >above > > > > error. I have tested this with 65 processors/65 replicas also. > > > > This sounds like you might be running on fewer physical CPUs > > than you > have available. If so, running multiple MPI processes per > > physical CPU > can lead to memory shortage conditions. > > > > I don't understand what you mean. Do you mean, there might be more > > than 8 > > processes running per node (each node has 8 processors)? But that > > also > > does not seem to be the case, as SGE (sun grid engine) output shows > > only > > eight processes per node. > > 65 processes can't have 8 processes per node. why can't it have? as i said, there are 8 processors per node. what i have not mentioned is that how many nodes it is using. The jobs got distributed over 9 nodes. 8 of which corresponds to 64 processors + 1 processor from 9th node. OK, that's a full description. Your symptoms are indicative of someone making an error somewhere. Since GROMACS works over more than 64 processors elsewhere, the presumption is that you are doing something wrong or the machine is not set up in the way you think it is or should be. To get the most effective help, you need to be sure you're providing full information - else we can't tell which error you're making or (potentially) eliminate you as a source of error. Sorry for not being clear in statements. As far I can tell you, job distribution seems okay to me. It is 1 job per processor. Does non-REMD GROMACS run on more than 64 processors? Does your cluster support using more than 8 nodes in a run? Can you run an MPI "Hello world" application that prints the processor and node ID across more than 64 processors? Yes, the cluster supports runs with more than 8 nodes. I generated a system with 10 nm water box and submitted on 80 processors. It was running fine. It printed all 80 NODEIDs. Also showed me when the job will get over. bharat Mark bharat > > Mark > > > > I don't know what you mean by "swap memory". > > > > Sorry, I meant cache memory.. > > > > bharat > > > > > > Mark > > > > >System: Protein + water + Na ions (total 46878 atoms) > > > >Gromacs version: tested with both v4.0.5 and v4.0.7 > > > >compiled with: --enable-float --with-fft=fftw3 --enable-mpi > > > >compiler:
Re: [gmx-users] Replica Exchange MD on more than 64 processors
On Mon, 28 Dec 2009, Mark Abraham wrote: bharat v. adkar wrote: On Sun, 27 Dec 2009, Mark Abraham wrote: > bharat v. adkar wrote: > > On Sun, 27 Dec 2009, Mark Abraham wrote: > > > > > bharat v. adkar wrote: > > > > > >Dear all, > > > > I am trying to perform replica exchange MD (REMD) on a > > 'protein in > > > >water' system. I am following instructions given on wiki > > (How-Tos -> > > > >REMD). I have to perform the REMD simulation with 35 different > > > >temperatures. As per advise on wiki, I equilibrated the system > > > >at > > > >respective temperatures (total of 35 equilibration > > simulations). > > After > > > >this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files > > from the > > > >equilibrated structures. > > > > > > Now when I submit final job for REMD with following > > command-line, it > > gives > > > >some error: > > > > > > command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s > > chk_.tpr > > -v > > > > > >error msg: > > > >--- > > > >Program mdrun_mpi, VERSION 4.0.7 > > > >Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179 > > > > > >Fatal error: > > > >Not enough memory. Failed to realloc 790760 bytes for > > > >nlist->jjnr, > > > >nlist->jjnr=0x9a400030 > > > >(called from file ../../../SRC/src/mdlib/ns.c, line 503) > > > >--- > > > > > >Thanx for Using GROMACS - Have a Nice Day > > > > :Cannot allocate memory > > > >Error on node 19, will try to stop all the nodes > > > >Halting parallel program mdrun_mpi on CPU 19 out of 70 > > > > > > *** > > > > > > > > The individual node on the cluster has 8GB of physical > > memory and 16GB > > of > > > >swap memory. Moreover, when logged onto the individual nodes, > > it > >shows > > > >more than 1GB of free memory, so there should be no problem > > with > > cluster > > > >memory. Also, the equilibration jobs for the same system are > > run on > > the > > > >same cluster without any problem. > > > > > > What I have observed by submitting different test jobs with > > varying > > number > > > > of processors (and no. of replicas, wherever necessary), that > > any job > > with > > > > total number of processors <= 64, runs faithfully without any > > problem. > > As > > > >soon as total number of processors are more than 64, it gives > > the > >above > > > > error. I have tested this with 65 processors/65 replicas also. > > > > This sounds like you might be running on fewer physical CPUs > > than you > have available. If so, running multiple MPI processes per > > physical CPU > can lead to memory shortage conditions. > > > > I don't understand what you mean. Do you mean, there might be more > > than 8 > > processes running per node (each node has 8 processors)? But that > > also > > does not seem to be the case, as SGE (sun grid engine) output shows > > only > > eight processes per node. > > 65 processes can't have 8 processes per node. why can't it have? as i said, there are 8 processors per node. what i have not mentioned is that how many nodes it is using. The jobs got distributed over 9 nodes. 8 of which corresponds to 64 processors + 1 processor from 9th node. OK, that's a full description. Your symptoms are indicative of someone making an error somewhere. Since GROMACS works over more than 64 processors elsewhere, the presumption is that you are doing something wrong or the machine is not set up in the way you think it is or should be. To get the most effective help, you need to be sure you're providing full information - else we can't tell which error you're making or (potentially) eliminate you as a source of error. Sorry for not being clear in statements. As far I can tell you, job distribution seems okay to me. It is 1 job per processor. Does non-REMD GROMACS run on more than 64 processors? Does your cluster support using more than 8 nodes in a run? Can you run an MPI "Hello world" application that prints the processor and node ID across more than 64 processors? Yes, the cluster supports runs with more than 8 nodes. I generated a system with 10 nm water box and submitted on 80 processors. It was running fine. It printed all 80 NODEIDs. Also showed me when the job will get over. bharat Mark bharat > > Mark > > > > I don't know what you mean by "swap memory". > > > > Sorry, I meant cache memory.. > > > > bharat > > > > > > Mark > > > > >System: Protein + water + Na ions (total 46878 atoms) > > > >Gromacs version: tested with both v4.0.5 and v4.0.7 > > > >compiled with: --enable-float --with-fft=fftw3 --enable-mpi > > > >compiler: gcc_3.4.6 -O3 > > > >
Re: [gmx-users] Replica Exchange MD on more than 64 processors
bharat v. adkar wrote: On Sun, 27 Dec 2009, Mark Abraham wrote: bharat v. adkar wrote: On Sun, 27 Dec 2009, Mark Abraham wrote: > bharat v. adkar wrote: > > > > Dear all, > > I am trying to perform replica exchange MD (REMD) on a 'protein in > > water' system. I am following instructions given on wiki (How-Tos -> > > REMD). I have to perform the REMD simulation with 35 different > > temperatures. As per advise on wiki, I equilibrated the system at > > respective temperatures (total of 35 equilibration simulations). > > After > > this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files from the > > equilibrated structures. > > > > Now when I submit final job for REMD with following command-line, it > > gives > > some error: > > > > command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s chk_.tpr > > -v > > > > error msg: > > --- > > Program mdrun_mpi, VERSION 4.0.7 > > Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179 > > > > Fatal error: > > Not enough memory. Failed to realloc 790760 bytes for nlist->jjnr, > > nlist->jjnr=0x9a400030 > > (called from file ../../../SRC/src/mdlib/ns.c, line 503) > > --- > > > > Thanx for Using GROMACS - Have a Nice Day > > : Cannot allocate memory > > Error on node 19, will try to stop all the nodes > > Halting parallel program mdrun_mpi on CPU 19 out of 70 > > *** > > > > > > The individual node on the cluster has 8GB of physical memory and 16GB > > of > > swap memory. Moreover, when logged onto the individual nodes, it > > shows > > more than 1GB of free memory, so there should be no problem with > > cluster > > memory. Also, the equilibration jobs for the same system are run on > > the > > same cluster without any problem. > > > > What I have observed by submitting different test jobs with varying > > number > > of processors (and no. of replicas, wherever necessary), that any job > > with > > total number of processors <= 64, runs faithfully without any problem. > > As > > soon as total number of processors are more than 64, it gives the > > above > > error. I have tested this with 65 processors/65 replicas also. > > This sounds like you might be running on fewer physical CPUs than you > have available. If so, running multiple MPI processes per physical CPU > can lead to memory shortage conditions. I don't understand what you mean. Do you mean, there might be more than 8 processes running per node (each node has 8 processors)? But that also does not seem to be the case, as SGE (sun grid engine) output shows only eight processes per node. 65 processes can't have 8 processes per node. why can't it have? as i said, there are 8 processors per node. what i have not mentioned is that how many nodes it is using. The jobs got distributed over 9 nodes. 8 of which corresponds to 64 processors + 1 processor from 9th node. OK, that's a full description. Your symptoms are indicative of someone making an error somewhere. Since GROMACS works over more than 64 processors elsewhere, the presumption is that you are doing something wrong or the machine is not set up in the way you think it is or should be. To get the most effective help, you need to be sure you're providing full information - else we can't tell which error you're making or (potentially) eliminate you as a source of error. As far I can tell you, job distribution seems okay to me. It is 1 job per processor. Does non-REMD GROMACS run on more than 64 processors? Does your cluster support using more than 8 nodes in a run? Can you run an MPI "Hello world" application that prints the processor and node ID across more than 64 processors? Mark bharat Mark > I don't know what you mean by "swap memory". Sorry, I meant cache memory.. bharat > > Mark > > > System: Protein + water + Na ions (total 46878 atoms) > > Gromacs version: tested with both v4.0.5 and v4.0.7 > > compiled with: --enable-float --with-fft=fftw3 --enable-mpi > > compiler: gcc_3.4.6 -O3 > > machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux > > > > > > I tried searching the mailing-list without any luck. I am not sure, if > > i > > am doing anything wrong in giving commands. Please correct me if it > > is > > wrong. > > > > Kindly let me know the solution. > > > > > > bharat > > > > > -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/mailing_lists/users.php
Re: [gmx-users] Replica Exchange MD on more than 64 processors
On Sun, 27 Dec 2009, Mark Abraham wrote: bharat v. adkar wrote: On Sun, 27 Dec 2009, Mark Abraham wrote: > bharat v. adkar wrote: > > > > Dear all, > > I am trying to perform replica exchange MD (REMD) on a 'protein in > > water' system. I am following instructions given on wiki (How-Tos -> > > REMD). I have to perform the REMD simulation with 35 different > > temperatures. As per advise on wiki, I equilibrated the system at > > respective temperatures (total of 35 equilibration simulations). > > After > > this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files from the > > equilibrated structures. > > > > Now when I submit final job for REMD with following command-line, it > > gives > > some error: > > > > command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s chk_.tpr > > -v > > > > error msg: > > --- > > Program mdrun_mpi, VERSION 4.0.7 > > Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179 > > > > Fatal error: > > Not enough memory. Failed to realloc 790760 bytes for nlist->jjnr, > > nlist->jjnr=0x9a400030 > > (called from file ../../../SRC/src/mdlib/ns.c, line 503) > > --- > > > > Thanx for Using GROMACS - Have a Nice Day > > : Cannot allocate memory > > Error on node 19, will try to stop all the nodes > > Halting parallel program mdrun_mpi on CPU 19 out of 70 > > *** > > > > > > The individual node on the cluster has 8GB of physical memory and 16GB > > of > > swap memory. Moreover, when logged onto the individual nodes, it > > shows > > more than 1GB of free memory, so there should be no problem with > > cluster > > memory. Also, the equilibration jobs for the same system are run on > > the > > same cluster without any problem. > > > > What I have observed by submitting different test jobs with varying > > number > > of processors (and no. of replicas, wherever necessary), that any job > > with > > total number of processors <= 64, runs faithfully without any problem. > > As > > soon as total number of processors are more than 64, it gives the > > above > > error. I have tested this with 65 processors/65 replicas also. > > This sounds like you might be running on fewer physical CPUs than you > have available. If so, running multiple MPI processes per physical CPU > can lead to memory shortage conditions. I don't understand what you mean. Do you mean, there might be more than 8 processes running per node (each node has 8 processors)? But that also does not seem to be the case, as SGE (sun grid engine) output shows only eight processes per node. 65 processes can't have 8 processes per node. why can't it have? as i said, there are 8 processors per node. what i have not mentioned is that how many nodes it is using. The jobs got distributed over 9 nodes. 8 of which corresponds to 64 processors + 1 processor from 9th node. As far I can tell you, job distribution seems okay to me. It is 1 job per processor. bharat Mark > I don't know what you mean by "swap memory". Sorry, I meant cache memory.. bharat > > Mark > > > System: Protein + water + Na ions (total 46878 atoms) > > Gromacs version: tested with both v4.0.5 and v4.0.7 > > compiled with: --enable-float --with-fft=fftw3 --enable-mpi > > compiler: gcc_3.4.6 -O3 > > machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux > > > > > > I tried searching the mailing-list without any luck. I am not sure, if > > i > > am doing anything wrong in giving commands. Please correct me if it > > is > > wrong. > > > > Kindly let me know the solution. > > > > > > bharat > > > > > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/mailing_lists/users.php
Re: [gmx-users] Replica Exchange MD on more than 64 processors
bharat v. adkar wrote: On Sun, 27 Dec 2009, Mark Abraham wrote: bharat v. adkar wrote: Dear all, I am trying to perform replica exchange MD (REMD) on a 'protein in water' system. I am following instructions given on wiki (How-Tos -> REMD). I have to perform the REMD simulation with 35 different temperatures. As per advise on wiki, I equilibrated the system at respective temperatures (total of 35 equilibration simulations). After this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files from the equilibrated structures. Now when I submit final job for REMD with following command-line, it gives some error: command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s chk_.tpr -v error msg: --- Program mdrun_mpi, VERSION 4.0.7 Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179 Fatal error: Not enough memory. Failed to realloc 790760 bytes for nlist->jjnr, nlist->jjnr=0x9a400030 (called from file ../../../SRC/src/mdlib/ns.c, line 503) --- Thanx for Using GROMACS - Have a Nice Day : Cannot allocate memory Error on node 19, will try to stop all the nodes Halting parallel program mdrun_mpi on CPU 19 out of 70 *** The individual node on the cluster has 8GB of physical memory and 16GB of swap memory. Moreover, when logged onto the individual nodes, it shows more than 1GB of free memory, so there should be no problem with cluster memory. Also, the equilibration jobs for the same system are run on the same cluster without any problem. What I have observed by submitting different test jobs with varying number of processors (and no. of replicas, wherever necessary), that any job with total number of processors <= 64, runs faithfully without any problem. As soon as total number of processors are more than 64, it gives the above error. I have tested this with 65 processors/65 replicas also. This sounds like you might be running on fewer physical CPUs than you have available. If so, running multiple MPI processes per physical CPU can lead to memory shortage conditions. I don't understand what you mean. Do you mean, there might be more than 8 processes running per node (each node has 8 processors)? But that also does not seem to be the case, as SGE (sun grid engine) output shows only eight processes per node. 65 processes can't have 8 processes per node. Mark I don't know what you mean by "swap memory". Sorry, I meant cache memory.. bharat Mark System: Protein + water + Na ions (total 46878 atoms) Gromacs version: tested with both v4.0.5 and v4.0.7 compiled with: --enable-float --with-fft=fftw3 --enable-mpi compiler: gcc_3.4.6 -O3 machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux I tried searching the mailing-list without any luck. I am not sure, if i am doing anything wrong in giving commands. Please correct me if it is wrong. Kindly let me know the solution. bharat -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/mailing_lists/users.php
Re: [gmx-users] Replica Exchange MD on more than 64 processors
On Sun, 27 Dec 2009, Mark Abraham wrote: bharat v. adkar wrote: Dear all, I am trying to perform replica exchange MD (REMD) on a 'protein in water' system. I am following instructions given on wiki (How-Tos -> REMD). I have to perform the REMD simulation with 35 different temperatures. As per advise on wiki, I equilibrated the system at respective temperatures (total of 35 equilibration simulations). After this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files from the equilibrated structures. Now when I submit final job for REMD with following command-line, it gives some error: command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s chk_.tpr -v error msg: --- Program mdrun_mpi, VERSION 4.0.7 Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179 Fatal error: Not enough memory. Failed to realloc 790760 bytes for nlist->jjnr, nlist->jjnr=0x9a400030 (called from file ../../../SRC/src/mdlib/ns.c, line 503) --- Thanx for Using GROMACS - Have a Nice Day : Cannot allocate memory Error on node 19, will try to stop all the nodes Halting parallel program mdrun_mpi on CPU 19 out of 70 *** The individual node on the cluster has 8GB of physical memory and 16GB of swap memory. Moreover, when logged onto the individual nodes, it shows more than 1GB of free memory, so there should be no problem with cluster memory. Also, the equilibration jobs for the same system are run on the same cluster without any problem. What I have observed by submitting different test jobs with varying number of processors (and no. of replicas, wherever necessary), that any job with total number of processors <= 64, runs faithfully without any problem. As soon as total number of processors are more than 64, it gives the above error. I have tested this with 65 processors/65 replicas also. This sounds like you might be running on fewer physical CPUs than you have available. If so, running multiple MPI processes per physical CPU can lead to memory shortage conditions. I don't understand what you mean. Do you mean, there might be more than 8 processes running per node (each node has 8 processors)? But that also does not seem to be the case, as SGE (sun grid engine) output shows only eight processes per node. I don't know what you mean by "swap memory". Sorry, I meant cache memory.. bharat Mark System: Protein + water + Na ions (total 46878 atoms) Gromacs version: tested with both v4.0.5 and v4.0.7 compiled with: --enable-float --with-fft=fftw3 --enable-mpi compiler: gcc_3.4.6 -O3 machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux I tried searching the mailing-list without any luck. I am not sure, if i am doing anything wrong in giving commands. Please correct me if it is wrong. Kindly let me know the solution. bharat -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/mailing_lists/users.php
Re: [gmx-users] Replica Exchange MD on more than 64 processors
bharat v. adkar wrote: Dear all, I am trying to perform replica exchange MD (REMD) on a 'protein in water' system. I am following instructions given on wiki (How-Tos -> REMD). I have to perform the REMD simulation with 35 different temperatures. As per advise on wiki, I equilibrated the system at respective temperatures (total of 35 equilibration simulations). After this I generated chk_0.tpr, chk_1.tpr, ..., chk_34.tpr files from the equilibrated structures. Now when I submit final job for REMD with following command-line, it gives some error: command line: mpiexec -np 70 mdrun -multi 35 -replex 1000 -s chk_.tpr -v error msg: --- Program mdrun_mpi, VERSION 4.0.7 Source code file: ../../../SRC/src/gmxlib/smalloc.c, line: 179 Fatal error: Not enough memory. Failed to realloc 790760 bytes for nlist->jjnr, nlist->jjnr=0x9a400030 (called from file ../../../SRC/src/mdlib/ns.c, line 503) --- Thanx for Using GROMACS - Have a Nice Day : Cannot allocate memory Error on node 19, will try to stop all the nodes Halting parallel program mdrun_mpi on CPU 19 out of 70 *** The individual node on the cluster has 8GB of physical memory and 16GB of swap memory. Moreover, when logged onto the individual nodes, it shows more than 1GB of free memory, so there should be no problem with cluster memory. Also, the equilibration jobs for the same system are run on the same cluster without any problem. What I have observed by submitting different test jobs with varying number of processors (and no. of replicas, wherever necessary), that any job with total number of processors <= 64, runs faithfully without any problem. As soon as total number of processors are more than 64, it gives the above error. I have tested this with 65 processors/65 replicas also. This sounds like you might be running on fewer physical CPUs than you have available. If so, running multiple MPI processes per physical CPU can lead to memory shortage conditions. I don't know what you mean by "swap memory". Mark System: Protein + water + Na ions (total 46878 atoms) Gromacs version: tested with both v4.0.5 and v4.0.7 compiled with: --enable-float --with-fft=fftw3 --enable-mpi compiler: gcc_3.4.6 -O3 machine details: uname -mpio: x86_64 x86_64 x86_64 GNU/Linux I tried searching the mailing-list without any luck. I am not sure, if i am doing anything wrong in giving commands. Please correct me if it is wrong. Kindly let me know the solution. bharat -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/mailing_lists/users.php