Re: mdrun on 8-core AMD + GTX TITAN (was: Re: [gmx-users] Re: Gromacs-4.6 on two Titans GPUs)
Hi Mark and Szilard Thanks for your both suggestions. They are very helpful. Neither run had a PP-PME work distribution suitable for the hardware it was running on (and fixing that for each run requires opposite changes). Adding a GPU and hoping to see scaling requires that there be proportionately more GPU work available to do, *and* enough absolute work to do. mdrun tries to do this, and reports early in the log file, which is one of the reasons Szilard asked to see whole log files - please use a file sharing service to do that. This task involves GPU calculation. We would not see PP-PME work distribution. This is a good hint from the angle of PP-PME work distribution. And I guessed that two GPUs' calculations are fast / or no enough work for GPU calculation, which is aligned with your explanation. Please see logs below again. ONE GPU## http://pastebin.com/B6bRUVSa TWO GPUs## http://pastebin.com/SLAYnejP As you can see, this test was made on the same node regardless of networking. Can the performance be improved say 50% more when 2 GPUs are used on a general task ? If yes, how ? Indeed, as Richard pointed out, I was asking for *full* logs, these summaries can't tell much, the table above the summary entitled R E A L C Y C L E A N D T I M E A C C O U N T I N G as well as other reported information across the log file is what I need to make an assessment of your simulations' performance. Please see below. However, in your case I suspect that the bottleneck is multi-threaded scaling on the AMD CPUs and you should probably decrease the number of threads per MPI rank and share GPUs between 2-4 ranks. After I test all three clusters, I found it may NOT be an issue of AMD cpus. Intel cpus has the SAME scaling issue. However, I am curious as to how you justify the setup of 2-4 ranks sharing GPUs ? Can you please explain it a bit more ? NUMA effects on multi-socket AMD processors are particularly severe; the way GROMACS uses OpenMP is not well suited to them. Using a rank (or two) per socket will greatly reduce those effects, but introduces different algorithmic overhead from the need to do DD and explicitly communicate between ranks. (You can see the latter in your .log file snippets below.) Also, that means the parcel of PP work available from a rank to give to the GPU is smaller, which is the opposite of what you'd like for GPU performance and/or scaling. We are working on a general solution for this and lots of related issues in the post-5.0 space, but there is a very hard limitation imposed by the need to amortize the cost of CPU-GPU transfer by having lots of PP work available to do. Is this reason why the scaling of two GPUs won't happen because of smaller PP workload ? From the implication, I am wondering if we can increase PP workload through parameters in a mdp file. The question is what parameters are mostly related to PP workload ? Would you please give more specific suggestions ? NOTE: The GPU has 20% more load than the CPU. This imbalance causes performance loss, consider using a shorter cut-off and a finer PME grid. This note needs to be addressed before maximum throughput is achieved and the question of scaling is worth considering. Ideally, Wait GPU local should be nearly zero, achieved as suggested above. Note that launch+force+mesh+wait is the sum of gpu total! But much of the information needed is higher up the log file, and the whole question is constrained by things like rvdw. From the note, it clearly suggested a shorter cut-off and a finer PME grid. I am not sure how to set up a finer PME grid but I am able to set up shorter cut-offs . However, it is risky to do so based on others' reports. Indeed, I see differences among tests for 1 GPU. Here cutoffs refer to rlist, rvdw and rcoulomb. I found that the smaller values of cutoffs, the faster computations. The question is how small they can go because it is interesting to know if these different cutoffs generate equally good simulations. As to two GPUs, when I set up larger cut-offs, these two GPUs in the same node had been very busy. However, the outcome in such a configuration is worse in terms of ns/day and time. So what dose a finer PME grid mean, with respect to GPU workload ? You mention the sum of GPU total is launch + force + mesh + wait.I thought PME mesh is carried out by CPU instead of GPU. Do I miss something here ? I thought GPU is responsible for the calculation of short-ranged non-bonded force whereas CPU is responsible for that of bonded and PME long-ranged force. Can you clarify it here ? Also, would rvdw play an important role in improving the performance of GPU calculation ? Unfortunately you didn't copy the GPU timing stuff here! Roughly, all the performance gain you are seeing here is eliminating most of the single-GPU wait gpu term by throwing
Re: mdrun on 8-core AMD + GTX TITAN (was: Re: [gmx-users] Re: Gromacs-4.6 on two Titans GPUs)
Hi Szilard, Thank you very much for your suggestions. Actually, I was jumping to conclusions too early, as you mentioned AMD cluster, I assumed you must have 12-16-core Opteron CPUs. If you have an 8-core (desktop?) AMD CPU, than you may not need to run more than one rank per GPU. Yes, we do have independent clusters of AMD, AMD opteron, Intel Corei7. All nodes of three clusters are installed with (at least) 1 GPU card. I have run the same test on these three clusters. Let's focus on a basic scaling issue: One GPU v.s Two GPUs within the same node of 8-core AMD cpu. Using 1 GPU, we can have a performance of ~32 ns/day. Using two GPU, we gain not much more ( ~38.5 ns/day ). It is about ~20% more performance. However, this is not really true because in some tests, I also saw only 2-5% more, which really surprised me. As you can see, this test was made on the same node regardless of networking. Can the performance be improved say 50% more when 2 GPUs are used on a general task ? If yes, how ? Indeed, as Richard pointed out, I was asking for *full* logs, these summaries can't tell much, the table above the summary entitled R E A L C Y C L E A N D T I M E A C C O U N T I N G as well as other reported information across the log file is what I need to make an assessment of your simulations' performance. Please see below. However, in your case I suspect that the bottleneck is multi-threaded scaling on the AMD CPUs and you should probably decrease the number of threads per MPI rank and share GPUs between 2-4 ranks. After I test all three clusters, I found it may NOT be an issue of AMD cpus. Intel cpus has the SAME scaling issue. However, I am curious as to how you justify the setup of 2-4 ranks sharing GPUs ? Can you please explain it a bit more ? You could try running mpirun -np 4 mdrun -ntomp 2 -gpu_id 0011 but I suspect this won't help because your scaling issue Your guess is correct but why is that ? it is worse. The more nodes are involved in a task, the performance is worse. in my experience even reaction field runs don't scale across nodes with 10G ethernet if you have more than 4-6 ranks per node trying to communicate (let alone with PME). What dose it mean let alone with PME ? how to do so ? by mdrun ? I do know mdrun -npme to specify PME process. Thank you. Dwey ### One GPU R E A L C Y C L E A N D T I M E A C C O U N T I N G Computing: Nodes Th. Count Wall t (s) G-Cycles % - Neighbor search18 11 431.81713863.390 1.6 Launch GPU ops.18501 472.90615182.556 1.7 Force 185011328.61142654.785 4.9 PME mesh 18501 11561.327 371174.09042.8 Wait GPU local 185016888.008 221138.11125.5 NB X/F buffer ops. 189911216.49939055.455 4.5 Write traj.18 1030 12.741 409.039 0.0 Update 185011696.35854461.226 6.3 Constraints185011969.72663237.647 7.3 Rest 11458.82046835.133 5.4 - Total 1 27036.812 868011.431 100.0 - - PME spread/gather 18 10026975.086 223933.73925.8 PME 3D-FFT 18 10023928.259 126115.97614.5 PME solve 18501 636.48820434.327 2.4 - GPU timings - Computing: Count Wall t (s) ms/step % - Pair list H2D 11 43.4350.434 0.2 X / q H2D501 567.1680.113 2.8 Nonbonded F kernel 400 14174.3163.54470.8 Nonbonded F+ene k.904314.4384.79421.5 Nonbonded F+ene+prune k. 11 572.3705.724 2.9 F D2H501 358.1200.072 1.8 - Total 20029.8464.006 100.0 - Force evaluation time GPU/CPU: 4.006 ms/2.578 ms = 1.554 For optimal performance this ratio should be close to 1!
[gmx-users] Re: Hardware for best gromacs performance?
Hi Timo, Can you provide a benchmark with 1 Xeon E5-2680 with 1 Nvidia k20x GPGPU on the same test of 29420 atoms ? Are these two GPU cards (within the same node) connected by a SLI (Scalable Link Interface) ? Thanks, Dwey -- View this message in context: http://gromacs.5086.x6.nabble.com/Hardware-for-best-gromacs-performance-tp5012124p5012276.html Sent from the GROMACS Users Forum mailing list archive at Nabble.com. -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
[gmx-users] Re: Gromacs-4.6 on two Titans GPUs
Hi Szilard, Thanks for your suggestions. I am indeed aware of this page. In a 8-core AMD with 1GPU, I am very happy about its performance. See below. My intention is to obtain a even better one because we have multiple nodes. ### 8 core AMD with 1 GPU, Force evaluation time GPU/CPU: 4.006 ms/2.578 ms = 1.554 For optimal performance this ratio should be close to 1! NOTE: The GPU has 20% more load than the CPU. This imbalance causes performance loss, consider using a shorter cut-off and a finer PME grid. Core t (s) Wall t (s)(%) Time: 216205.51027036.812 799.7 7h30:36 (ns/day)(hour/ns) Performance: 31.9560.751 ### 8 core AMD with 2 GPUs Core t (s) Wall t (s)(%) Time: 178961.45022398.880 799.0 6h13:18 (ns/day)(hour/ns) Performance: 38.5730.622 Finished mdrun on node 0 Sat Jul 13 09:24:39 2013 However, in your case I suspect that the bottleneck is multi-threaded scaling on the AMD CPUs and you should probably decrease the number of threads per MPI rank and share GPUs between 2-4 ranks. OK but can you give a example of mdrun command ? given a 8 core AMD with 2 GPUs. I will try to run it again. Regarding scaling across nodes, you can't expect much from gigabit ethernet - especially not from the cheaper cards/switches, in my experience even reaction field runs don't scale across nodes with 10G ethernet if you have more than 4-6 ranks per node trying to communicate (let alone with PME). However, on infiniband clusters we have seen scaling to 100 atoms/core (at peak). From your comments, it sounds like a cluster of AMD cpus is difficult to scale across nodes in our current setup. Let's assume we install Infiniband (20 or 40GB/s) in the same system of 16 nodes of 8 core AMD with 1 GPU only. Considering the same AMD system, what is a good way to obtain better performance when we run a task across nodes ? in other words, what dose mudrun_mpi look like ? Thanks, Dwey -- View this message in context: http://gromacs.5086.x6.nabble.com/Gromacs-4-6-on-two-Titans-GPUs-tp5012186p5012279.html Sent from the GROMACS Users Forum mailing list archive at Nabble.com. -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
[gmx-users] Re: Hardware for best gromacs performance?
Hi Szilard, Thanks. From Timo's benchmark, 1 node142 ns/day 2 nodes FDR14 218 ns/day 4 nodes FDR14 257 ns/day 8 nodes FDR14 326 ns/day It looks like a infiniband network is required in order to scale up when running a task across nodes. Is it correct ? Dwey -- View this message in context: http://gromacs.5086.x6.nabble.com/Hardware-for-best-gromacs-performance-tp5012124p5012280.html Sent from the GROMACS Users Forum mailing list archive at Nabble.com. -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
[gmx-users] RE: average pressure of a system
I carried out independent NPT processes with different tau_p values = 1.5, 1.0 and 0.5 ## tau_p 1.5 Energy Average Err.Est. RMSD Tot-Drift --- Pressure2.628592.6 185.682.67572 (bar) ## tau_p 1.0 Energy Average Err.Est. RMSD Tot-Drift --- Pressure 0.8867691.7187.737 0.739 (bar) ## tau_p 0.5 Energy Average Err.Est. RMSD Tot-Drift --- Pressure2.399112.2185.708 6.8189 (bar) ## It is clear that when tau_p =1.0, average pressure of the system (=0.89 bar) is close to ref_p =1.0 bar However, it is unclear to me as to how to assign a good value to tau_p in order to reach at a close value of ref_p. As shown above, both of the average pressures as tau_p =1.5 and 0.5 are much higher than that as tau_p =1.0. A smaller tau_p may or may not help. As has been mentioned a number of times 0.9 +- 190 and 2.3 +- 190 are not statistically different. If you use that in a publication then any conclusions based on that will be rejected. Statistically, I understood the indistinguishable difference between the resulted average pressures. Here, I altered tau_p values to determine if tau_p helps stabilize a desired value of average pressure. To demonstrate to yourself how variable the pressure is, the tau_p=1 run, run the pressure analysis again using g_analyze, but using only the first half and the last half of the trajectory. You will find that the average values for both parts of the trajectory are not the same. Thank you for the suggestion of applying g_analyze to trajectory. Another issue caused by system pressure is about pbc box size. Since I use pressure coupling, the box size is not fixed such that protein moved away the center of membrane for a long simulation like 30 ns. Box size That is not due to the pressure coupling. The changed box-size is problematic because I see that molecules are split. During NPT process, the box of dimensions (7.12158 7.14945 9.0) changed over time to the end at that of dimensions ( 6.43804 6.46323 8.28666).This is because of pressure coupling. See noted also http://www.gromacs.org/Documentation/Errors#The_cut-off_length_is_longer_than_half_the_shortest_box_vector_or_longer_than_the_smallest_box_diagonal_element._Increase_the_box_size_or_decrease_rlist Motion of the protein within the box is simply due to diffusion etc. Also remember, that you have in effect an infinite repeating box in all directions, so the center of the box is arbitrary. If so, how to make a membrane protein relatively fixed (embedded) in bilayer wthout escaping away during simulation ? In fact, this membrane has been embedded in membrane by g_membed. Due to diffusion ?? the protein moved away from bilayer and escaped toward extracellular space. Is there a way to fix it or only allow this protein diffusing in xy plane instead of z direction ? If you want the protein to remain in the center for visualisation purposes, then you do post processing on the box using trjconv. Thanks, but this dose not change the fact that protein moved away bilayer during a long simulation. changes significantly during production MD. Is there a way to fix the box size at the very beginning ? although turning off pressure coupling will make box size fixed. If you want fixed box dimensions / volume then you perform NVT. But that will not help with either issues above. Right. The box of dimensions remains unchanged if pressure coupling is removed in production MD. However, can it be justified in a system of membrane protein ? because the purpose of pressure coupling is to stabilize the pressure and density. For example, for 10 ns simulation, the average pressure of this system is -5.55 bar, which is less convincing. Energy Average Err.Est. RMSD Tot-Drift --- Pressure -5.555722.6155.552 0.846162 (bar) Thanks. Dwey The problem here is you are trying to make comparisons in the behaviour of simulations where there will not be a statistically significant difference in the property you are adjusting. Any differences you observe are more than likely going to be due to chance, rather than pressure. Catch ya, Dr. Dallas Warren Drug Delivery, Disposition and Dynamics Monash Institute of Pharmaceutical Sciences, Monash University 381 Royal Parade, Parkville VIC 3052 [hidden email] +61 3 9903 9304
[gmx-users] RE: average pressure of a system
Justin Lemkul wrote On 9/11/13 12:12 AM, Dwey Kauffman wrote: True, but thermostats allow temperatures to oscillate on the order of a few K, and that doesn't happen on the macroscopic level either. Hence the small disconnect between a system that has thousands of atoms and one that has millions or trillions. Pressure fluctuations decrease on the order of sqrt(N), so the system size itself is a determining factor for the pressure fluctuations. As previous discussions have rightly concluded, pressure is a somewhat ill-defined quantity in molecular systems like these. Dose it also imply that it is not a good idea to study the relationship between dimer (multimer) dissociation and macroscopic pressure in this case ? (due to the ill defined pressure). I would simply think it would be very hard to draw any meaningful conclusions if they depend on a microscopic quantity that varies so strongly. It is hard to be justified if I assign a set of various ref_p= 0.7, 0.8, 0.9, 1.0, 1.1, 1.2 , perform independent simulations, and then obtain outcomes of targeted quantities for comparison. As with the original issue, I would find it hard to believe that any of the differences observed in such a setup would be meaningful. Is 0.7 ± 100 actually different from 1.2 ± 100? You could try altering tau_p, but I doubt there is any value in doing so. I would give it a try. This will really only change the relaxation time. Smaller values of tau_p may improve the average slightly, but may also (more likely) lead to instability, especially with Parrinello-Rahman. I carried out independent NPT processes with different tau_p values = 1.5, 1.0 and 0.5 ## tau_p 1.5 Energy Average Err.Est. RMSD Tot-Drift --- Pressure2.628592.6 185.682.67572 (bar) ## tau_p 1.0 Energy Average Err.Est. RMSD Tot-Drift --- Pressure 0.8867691.7187.737 0.739 (bar) ## tau_p 0.5 Energy Average Err.Est. RMSD Tot-Drift --- Pressure2.399112.2185.708 6.8189 (bar) ## It is clear that when tau_p =1.0, average pressure of the system (=0.89 bar) is close to ref_p =1.0 bar However, it is unclear to me as to how to assign a good value to tau_p in order to reach at a close value of ref_p. As shown above, both of the average pressures as tau_p =1.5 and 0.5 are much higher than that as tau_p =1.0. A smaller tau_p may or may not help. Another issue caused by system pressure is about pbc box size. Since I use pressure coupling, the box size is not fixed such that protein moved away the center of membrane for a long simulation like 30 ns. Box size changes significantly during production MD. Is there a way to fix the box size at the very beginning ? although turning off pressure coupling will make box size fixed. Best regards, Dwey -- View this message in context: http://gromacs.5086.x6.nabble.com/average-pressure-of-a-system-tp5011095p5011137.html Sent from the GROMACS Users Forum mailing list archive at Nabble.com. -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
[gmx-users] RE: average pressure of a system
Hi Dallas and Justin, Thanks for the reply. Yes, I did plot pressure changes over time by g_energy and I have been aware of the note at http://www.gromacs.org/Documentation/Terminology/Pressure I am concerned about the average pressure is because our experiment shows that our target membrane protein is a hexamer and our observation is that variation of system pressure seems causing hexamer or dimer dissociation. Also, it is quite sensitive to pressure fluctuation. Such a fluctuation of pressure certainly brings my attention in this specific case, because life does not exist at large variations of system pressure. If not because of multimer dissociation likely caused by pressure fluctuation, I agree with both of you. I also run longer simulations like 20 ns and 30 ns ### 20 ns Energy Average Err.Est. RMSD Tot-Drift --- Pressure 0.886396 0.84162.6551.38476 (bar) ## 30 ns Energy Average Err.Est. RMSD Tot-Drift --- Pressure1.69086 0.58162.8793.35668 (bar) Running longer simulations seems to me that the improvement of system pressure is not helpful too much. If I need to modify mdp file, what it would be ? Many thanks, Dwey My mdp file for NPT is used in the simulation like define = -DPOSRES integrator = md nsteps = 50 dt = 0.002 nstxout = 100 nstvout = 100 nstenergy = 100 nstlog = 100 continuation= yes constraint_algorithm = lincs constraints = all-bonds lincs_iter = 1 lincs_order = 4 ns_type = grid nstlist = 5 rlist = 1.2 rcoulomb= 1.2 rvdw= 1.2 coulombtype = PME pme_order = 4 fourierspacing = 0.16 tcoupl = Nose-Hoover tc-grps = Protein DPPC SOL_CL tau_t = 0.5 0.5 0.5 ref_t = 323 323 323 pcoupl = Parrinello-Rahman pcoupltype = semiisotropic tau_p = 5.0 ref_p = 1.0 1.0 compressibility = 4.5e-54.5e-5 pbc = xyz DispCorr= EnerPres gen_vel = no nstcomm = 1 comm-mode = Linear comm-grps = Protein_DPPC SOL_CL refcoord_scaling = com cutoff-scheme = Verlet -- View this message in context: http://gromacs.5086.x6.nabble.com/average-pressure-of-a-system-tp5011095p5011098.html Sent from the GROMACS Users Forum mailing list archive at Nabble.com. -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
[gmx-users] RE: average pressure of a system
True, but thermostats allow temperatures to oscillate on the order of a few K, and that doesn't happen on the macroscopic level either. Hence the small disconnect between a system that has thousands of atoms and one that has millions or trillions. Pressure fluctuations decrease on the order of sqrt(N), so the system size itself is a determining factor for the pressure fluctuations. As previous discussions have rightly concluded, pressure is a somewhat ill-defined quantity in molecular systems like these. Dose it also imply that it is not a good idea to study the relationship between dimer (multimer) dissociation and macroscopic pressure in this case ? (due to the ill defined pressure). It is hard to be justified if I assign a set of various ref_p= 0.7, 0.8, 0.9, 1.0, 1.1, 1.2 , perform independent simulations, and then obtain outcomes of targeted quantities for comparison. You could try altering tau_p, but I doubt there is any value in doing so. I would give it a try. Thanks for the hint. Dwey www interface or send it to gmx-users-request@. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- View this message in context: http://gromacs.5086.x6.nabble.com/average-pressure-of-a-system-tp5011095p5011102.html Sent from the GROMACS Users Forum mailing list archive at Nabble.com. -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
[gmx-users] Re: GPU version of Gromacs
Hi Grita, Yes it is. You need to re-compile a GPU version of Gromacs from source codes. You also need to use Verlet cutoff-scheme. That is, place a new line like cutoff-scheme = Verlet in your mdp file. Finally, run the GPU version of mdrun, adding a parameter -gpu_id 0if you have one GPU in your box. hope this helps. Dwey -- View this message in context: http://gromacs.5086.x6.nabble.com/GPU-version-of-Gromacs-tp5010581p5010606.html Sent from the GROMACS Users Forum mailing list archive at Nabble.com. -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists