Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
On Tue, Aug 26, 2014 at 7:51 AM, Theodore Si sjyz...@gmail.com wrote: Hi Szilárd, But CUDA 5.5 won't work with icc 14, right? Sure, but I don't see how is gcc 4.4 + CUDA 5.0 superior to [a recent compiler that nvcc supports] + CUDA 5.5? Additionally, as I said before, gcc 4.8 will almost certainly outperform icc. Cheers, -- Szilárd It only works with 12.1 unless a header of CUDA 5.5 to be modified. Theo On 8/25/2014 9:44 PM, Szilárd Páll wrote: On Mon, Aug 25, 2014 at 8:08 AM, Mark Abraham mark.j.abra...@gmail.com wrote: On Mon, Aug 25, 2014 at 5:01 AM, Theodore Si sjyz...@gmail.com wrote: Hi, https://onedrive.live.com/redir?resid=990FCE59E48164A4! 2572authkey=!AP82sTNxS6MHgUkithint=file%2clog https://onedrive.live.com/redir?resid=990FCE59E48164A4! 2482authkey=!APLkizOBzXtPHxsithint=file%2clog These are 2 log files. The first one is using 64 cpu cores(64 / 16 = 4 nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU. When we look at the 64 cores log file, we find that in the R E A L C Y C L E A N D T I M E A C C O U N T I N G table, the total wall time is the sum of every line, that is 37.730=2.201+0.082+...+1.150. So we think that when the CPUs is doing PME, GPUs are doing nothing. That's why we say they are working sequentially. Please note that sequential means one phase after another. Your log files don't show the timing breakdown for the GPUs, which is distinct from showing that the GPUs ran and then the CPUs ran (which I don't think the code even permits!). References to CUDA 8x8 kernels do show the GPU was active. There was an issue with mdrun not always being able to gather and publish the GPU timing results; I don't recall the conditions (Szilard might remember), but it might be fixed in a later release. It is a limitation (well, I'd say borderline bug) in CUDA that if you have multiple work-queues (=streams), reliable timing using the CUDA built-in mechanisms is impossible. There may be a way to work around this, but that won't happen in the current versions. What's important is to observe the wait time on the CPU sideand of course, if the Op is profiling this is not an issue. In any case, you should probably be doing performance optimization on a GROMACS version that isn't a year old. I gather that you didn't actually observe the GPUs idle - e.g. with a performance monitoring tool? Otherwise, and in the absence of a description of your simulation system, I'd say that log file looks somewhere between normal and optimal. For the record, for better performance, you should probably be following the advice of the install guide and not compiling FFTW with AVX support, and using one of the five gcc minor versions released since 4.4 ;-) And besides avoiding ancient gcc versions, I suggest using CUDA 5.5 (which you can use because you have version 5.5 driver which I see in your log file): Additionally, I suggest avoiding MKL and using FFTW instead. For the grid sizes of our interest all benchmarks I did in the past showed considerably higher FFTW performance. Same goes for icc, but feel free to benchmark and please report back if you find the opposite. As for the 512 cores log file, the total wall time is approximately the sum of PME mesh and PME wait for PP. We think this is because PME-dedicated nodes finished early, and the total wall time is the time spent on PP nodes, therefore time spent on PME is covered. Yes, using an offload model makes it awkward to report CPU timings, because there are two kinds of CPU ranks. The total of the Wall t column adds up to twice the total time taken (which is noted explicitly in more recent mdrun versions). By design, the PME ranks do finish early, as you know from Figure 3.16 of the manual. As you can see in the table, the PP ranks spend 26% of their time waiting for the results from the PME ranks, and this is the origin of the note (above the table) that you might want to balance things better. Mark On 8/23/2014 9:30 PM, Mark Abraham wrote: On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si sjyz...@gmail.com wrote: Hi, When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with no PME-dedicated node), we noticed that when CPU are doing PME, GPU are idle, That could happen if the GPU completes its work too fast, in which case the end of the log file will probably scream about imbalance. that is they are doing their work sequentially. Highly unlikely, not least because the code is written to overlap the short-range work on the GPU with everything else on the CPU. What's your evidence for *sequential* rather than *imbalanced*? Is it supposed to be so? No, but without seeing your .log files, mdrun command lines and knowing about your hardware, there's nothing we can say. Is it the same reason as GPUs on PME-dedicated nodes won't be used during a run like you said before? Why would you
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
On Mon, Aug 25, 2014 at 8:08 AM, Mark Abraham mark.j.abra...@gmail.com wrote: On Mon, Aug 25, 2014 at 5:01 AM, Theodore Si sjyz...@gmail.com wrote: Hi, https://onedrive.live.com/redir?resid=990FCE59E48164A4! 2572authkey=!AP82sTNxS6MHgUkithint=file%2clog https://onedrive.live.com/redir?resid=990FCE59E48164A4! 2482authkey=!APLkizOBzXtPHxsithint=file%2clog These are 2 log files. The first one is using 64 cpu cores(64 / 16 = 4 nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU. When we look at the 64 cores log file, we find that in the R E A L C Y C L E A N D T I M E A C C O U N T I N G table, the total wall time is the sum of every line, that is 37.730=2.201+0.082+...+1.150. So we think that when the CPUs is doing PME, GPUs are doing nothing. That's why we say they are working sequentially. Please note that sequential means one phase after another. Your log files don't show the timing breakdown for the GPUs, which is distinct from showing that the GPUs ran and then the CPUs ran (which I don't think the code even permits!). References to CUDA 8x8 kernels do show the GPU was active. There was an issue with mdrun not always being able to gather and publish the GPU timing results; I don't recall the conditions (Szilard might remember), but it might be fixed in a later release. It is a limitation (well, I'd say borderline bug) in CUDA that if you have multiple work-queues (=streams), reliable timing using the CUDA built-in mechanisms is impossible. There may be a way to work around this, but that won't happen in the current versions. What's important is to observe the wait time on the CPU sideand of course, if the Op is profiling this is not an issue. In any case, you should probably be doing performance optimization on a GROMACS version that isn't a year old. I gather that you didn't actually observe the GPUs idle - e.g. with a performance monitoring tool? Otherwise, and in the absence of a description of your simulation system, I'd say that log file looks somewhere between normal and optimal. For the record, for better performance, you should probably be following the advice of the install guide and not compiling FFTW with AVX support, and using one of the five gcc minor versions released since 4.4 ;-) And besides avoiding ancient gcc versions, I suggest using CUDA 5.5 (which you can use because you have version 5.5 driver which I see in your log file): Additionally, I suggest avoiding MKL and using FFTW instead. For the grid sizes of our interest all benchmarks I did in the past showed considerably higher FFTW performance. Same goes for icc, but feel free to benchmark and please report back if you find the opposite. As for the 512 cores log file, the total wall time is approximately the sum of PME mesh and PME wait for PP. We think this is because PME-dedicated nodes finished early, and the total wall time is the time spent on PP nodes, therefore time spent on PME is covered. Yes, using an offload model makes it awkward to report CPU timings, because there are two kinds of CPU ranks. The total of the Wall t column adds up to twice the total time taken (which is noted explicitly in more recent mdrun versions). By design, the PME ranks do finish early, as you know from Figure 3.16 of the manual. As you can see in the table, the PP ranks spend 26% of their time waiting for the results from the PME ranks, and this is the origin of the note (above the table) that you might want to balance things better. Mark On 8/23/2014 9:30 PM, Mark Abraham wrote: On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si sjyz...@gmail.com wrote: Hi, When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with no PME-dedicated node), we noticed that when CPU are doing PME, GPU are idle, That could happen if the GPU completes its work too fast, in which case the end of the log file will probably scream about imbalance. that is they are doing their work sequentially. Highly unlikely, not least because the code is written to overlap the short-range work on the GPU with everything else on the CPU. What's your evidence for *sequential* rather than *imbalanced*? Is it supposed to be so? No, but without seeing your .log files, mdrun command lines and knowing about your hardware, there's nothing we can say. Is it the same reason as GPUs on PME-dedicated nodes won't be used during a run like you said before? Why would you suppose that? I said GPUs do work from the PP ranks on their node. That's true here. So if we want to exploit our hardware, we should map PP-PME ranks manually, right? Say, use one node as PME-dedicated node and leave the GPUs on that node idle, and use two nodes to do the other stuff. How do you think about this arrangement? Probably a terrible idea. You should identify the cause of the imbalance, and fix that. Mark Theo On 8/22/2014 7:20 PM, Mark Abraham wrote: Hi,
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
Theodore Si sjyzhxw@... writes: Hi, https://onedrive.live.com/redir? resid=990FCE59E48164A4!2572authkey=!AP82sTNxS6MHgUkithint=file%2clog https://onedrive.live.com/redir? resid=990FCE59E48164A4!2482authkey=!APLkizOBzXtPHxsithint=file%2clog These are 2 log files. The first one is using 64 cpu cores(64 / 16 = 4 nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU. When we look at the 64 cores log file, we find that in the R E A L C Y C L E A N D T I M E A C C O U N T I N G table, the total wall time is the sum of every line, that is 37.730=2.201+0.082+...+1.150. So we think that when the CPUs is doing PME, GPUs are doing nothing. That's why we say they are working sequentially. As for the 512 cores log file, the total wall time is approximately the sum of PME mesh and PME wait for PP. We think this is because PME-dedicated nodes finished early, and the total wall time is the time spent on PP nodes, therefore time spent on PME is covered. Hi, I have a naive question: In your log file there are only 2 GPUs being detected: 2 GPUs detected on host gpu42: #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible #1: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible In the end you selected 8 GPUs 8 GPUs user-selected for this run: #0, #0, #0, #0, #1, #1, #1, #1 Did you choose 8 GPUs or 2 GPUs? What is your mdrun command? Thank you, -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
On Mon, Aug 25, 2014 at 7:12 PM, Xingcheng Lin linxingcheng50...@gmail.com wrote: Theodore Si sjyzhxw@... writes: Hi, https://onedrive.live.com/redir? resid=990FCE59E48164A4!2572authkey=!AP82sTNxS6MHgUkithint=file%2clog https://onedrive.live.com/redir? resid=990FCE59E48164A4!2482authkey=!APLkizOBzXtPHxsithint=file%2clog These are 2 log files. The first one is using 64 cpu cores(64 / 16 = 4 nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU. When we look at the 64 cores log file, we find that in the R E A L C Y C L E A N D T I M E A C C O U N T I N G table, the total wall time is the sum of every line, that is 37.730=2.201+0.082+...+1.150. So we think that when the CPUs is doing PME, GPUs are doing nothing. That's why we say they are working sequentially. As for the 512 cores log file, the total wall time is approximately the sum of PME mesh and PME wait for PP. We think this is because PME-dedicated nodes finished early, and the total wall time is the time spent on PP nodes, therefore time spent on PME is covered. Hi, I have a naive question: In your log file there are only 2 GPUs being detected: 2 GPUs detected on host gpu42: #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible #1: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible In the end you selected 8 GPUs 8 GPUs user-selected for this run: #0, #0, #0, #0, #1, #1, #1, #1 Did you choose 8 GPUs or 2 GPUs? What is your mdrun command? That's an outdated message which is only correct if a single rank/GPU is used. If you use a more up-to-date 4.6.x or 5.0.x version, you'd get something like this instead: 2 GPUs user-selected for this run. Mapping of GPUs to the 8 PP ranks in this node: #0, #0, #0, #0, #1, #1, #1, #1 Cheers, -- Szilárd Thank you, -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
I mapped 2 GPUs to multiple MPI ranks by using -gpu_id On 8/26/2014 1:12 AM, Xingcheng Lin wrote: Theodore Si sjyzhxw@... writes: Hi, https://onedrive.live.com/redir? resid=990FCE59E48164A4!2572authkey=!AP82sTNxS6MHgUkithint=file%2clog https://onedrive.live.com/redir? resid=990FCE59E48164A4!2482authkey=!APLkizOBzXtPHxsithint=file%2clog These are 2 log files. The first one is using 64 cpu cores(64 / 16 = 4 nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU. When we look at the 64 cores log file, we find that in the R E A L C Y C L E A N D T I M E A C C O U N T I N G table, the total wall time is the sum of every line, that is 37.730=2.201+0.082+...+1.150. So we think that when the CPUs is doing PME, GPUs are doing nothing. That's why we say they are working sequentially. As for the 512 cores log file, the total wall time is approximately the sum of PME mesh and PME wait for PP. We think this is because PME-dedicated nodes finished early, and the total wall time is the time spent on PP nodes, therefore time spent on PME is covered. Hi, I have a naive question: In your log file there are only 2 GPUs being detected: 2 GPUs detected on host gpu42: #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible #1: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible In the end you selected 8 GPUs 8 GPUs user-selected for this run: #0, #0, #0, #0, #1, #1, #1, #1 Did you choose 8 GPUs or 2 GPUs? What is your mdrun command? Thank you, -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
Hi Szilárd, But CUDA 5.5 won't work with icc 14, right? It only works with 12.1 unless a header of CUDA 5.5 to be modified. Theo On 8/25/2014 9:44 PM, Szilárd Páll wrote: On Mon, Aug 25, 2014 at 8:08 AM, Mark Abraham mark.j.abra...@gmail.com wrote: On Mon, Aug 25, 2014 at 5:01 AM, Theodore Si sjyz...@gmail.com wrote: Hi, https://onedrive.live.com/redir?resid=990FCE59E48164A4! 2572authkey=!AP82sTNxS6MHgUkithint=file%2clog https://onedrive.live.com/redir?resid=990FCE59E48164A4! 2482authkey=!APLkizOBzXtPHxsithint=file%2clog These are 2 log files. The first one is using 64 cpu cores(64 / 16 = 4 nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU. When we look at the 64 cores log file, we find that in the R E A L C Y C L E A N D T I M E A C C O U N T I N G table, the total wall time is the sum of every line, that is 37.730=2.201+0.082+...+1.150. So we think that when the CPUs is doing PME, GPUs are doing nothing. That's why we say they are working sequentially. Please note that sequential means one phase after another. Your log files don't show the timing breakdown for the GPUs, which is distinct from showing that the GPUs ran and then the CPUs ran (which I don't think the code even permits!). References to CUDA 8x8 kernels do show the GPU was active. There was an issue with mdrun not always being able to gather and publish the GPU timing results; I don't recall the conditions (Szilard might remember), but it might be fixed in a later release. It is a limitation (well, I'd say borderline bug) in CUDA that if you have multiple work-queues (=streams), reliable timing using the CUDA built-in mechanisms is impossible. There may be a way to work around this, but that won't happen in the current versions. What's important is to observe the wait time on the CPU sideand of course, if the Op is profiling this is not an issue. In any case, you should probably be doing performance optimization on a GROMACS version that isn't a year old. I gather that you didn't actually observe the GPUs idle - e.g. with a performance monitoring tool? Otherwise, and in the absence of a description of your simulation system, I'd say that log file looks somewhere between normal and optimal. For the record, for better performance, you should probably be following the advice of the install guide and not compiling FFTW with AVX support, and using one of the five gcc minor versions released since 4.4 ;-) And besides avoiding ancient gcc versions, I suggest using CUDA 5.5 (which you can use because you have version 5.5 driver which I see in your log file): Additionally, I suggest avoiding MKL and using FFTW instead. For the grid sizes of our interest all benchmarks I did in the past showed considerably higher FFTW performance. Same goes for icc, but feel free to benchmark and please report back if you find the opposite. As for the 512 cores log file, the total wall time is approximately the sum of PME mesh and PME wait for PP. We think this is because PME-dedicated nodes finished early, and the total wall time is the time spent on PP nodes, therefore time spent on PME is covered. Yes, using an offload model makes it awkward to report CPU timings, because there are two kinds of CPU ranks. The total of the Wall t column adds up to twice the total time taken (which is noted explicitly in more recent mdrun versions). By design, the PME ranks do finish early, as you know from Figure 3.16 of the manual. As you can see in the table, the PP ranks spend 26% of their time waiting for the results from the PME ranks, and this is the origin of the note (above the table) that you might want to balance things better. Mark On 8/23/2014 9:30 PM, Mark Abraham wrote: On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si sjyz...@gmail.com wrote: Hi, When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with no PME-dedicated node), we noticed that when CPU are doing PME, GPU are idle, That could happen if the GPU completes its work too fast, in which case the end of the log file will probably scream about imbalance. that is they are doing their work sequentially. Highly unlikely, not least because the code is written to overlap the short-range work on the GPU with everything else on the CPU. What's your evidence for *sequential* rather than *imbalanced*? Is it supposed to be so? No, but without seeing your .log files, mdrun command lines and knowing about your hardware, there's nothing we can say. Is it the same reason as GPUs on PME-dedicated nodes won't be used during a run like you said before? Why would you suppose that? I said GPUs do work from the PP ranks on their node. That's true here. So if we want to exploit our hardware, we should map PP-PME ranks manually, right? Say, use one node as PME-dedicated node and leave the GPUs on that node idle, and use two nodes to do the other stuff. How do you think about this arrangement? Probably a terrible idea. You should
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
Hi, https://onedrive.live.com/redir?resid=990FCE59E48164A4!2572authkey=!AP82sTNxS6MHgUkithint=file%2clog https://onedrive.live.com/redir?resid=990FCE59E48164A4!2482authkey=!APLkizOBzXtPHxsithint=file%2clog These are 2 log files. The first one is using 64 cpu cores(64 / 16 = 4 nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU. When we look at the 64 cores log file, we find that in the R E A L C Y C L E A N D T I M E A C C O U N T I N G table, the total wall time is the sum of every line, that is 37.730=2.201+0.082+...+1.150. So we think that when the CPUs is doing PME, GPUs are doing nothing. That's why we say they are working sequentially. As for the 512 cores log file, the total wall time is approximately the sum of PME mesh and PME wait for PP. We think this is because PME-dedicated nodes finished early, and the total wall time is the time spent on PP nodes, therefore time spent on PME is covered. On 8/23/2014 9:30 PM, Mark Abraham wrote: On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si sjyz...@gmail.com wrote: Hi, When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with no PME-dedicated node), we noticed that when CPU are doing PME, GPU are idle, That could happen if the GPU completes its work too fast, in which case the end of the log file will probably scream about imbalance. that is they are doing their work sequentially. Highly unlikely, not least because the code is written to overlap the short-range work on the GPU with everything else on the CPU. What's your evidence for *sequential* rather than *imbalanced*? Is it supposed to be so? No, but without seeing your .log files, mdrun command lines and knowing about your hardware, there's nothing we can say. Is it the same reason as GPUs on PME-dedicated nodes won't be used during a run like you said before? Why would you suppose that? I said GPUs do work from the PP ranks on their node. That's true here. So if we want to exploit our hardware, we should map PP-PME ranks manually, right? Say, use one node as PME-dedicated node and leave the GPUs on that node idle, and use two nodes to do the other stuff. How do you think about this arrangement? Probably a terrible idea. You should identify the cause of the imbalance, and fix that. Mark Theo On 8/22/2014 7:20 PM, Mark Abraham wrote: Hi, Because no work will be sent to them. The GPU implementation can accelerate domains from PP ranks on their node, but with an MPMD setup that uses dedicated PME nodes, there will be no PP ranks on nodes that have been set up with only PME ranks. The two offload models (PP work - GPU; PME work - CPU subset) do not work well together, as I said. One can devise various schemes in 4.6/5.0 that could use those GPUs, but they either require * each node does both PME and PP work (thus limiting scaling because of the all-to-all for PME, and perhaps making poor use of locality on multi-socket nodes), or * that all nodes have PP ranks, but only some have PME ranks, and the nodes map their GPUs to PP ranks in a way that is different depending on whether PME ranks are present (which could work well, but relies on the DD load-balancer recognizing and taking advantage of the faster progress of the PP ranks that have better GPU support, and requires that you get very dirty hands laying out PP and PME ranks onto hardware that will later match the requirements of the DD load balancer, and probably that you balance PP-PME load manually) I do not recommend the last approach, because of its complexity. Clearly there are design decisions to improve. Work is underway. Cheers, Mark On Fri, Aug 22, 2014 at 10:11 AM, Theodore Si sjyz...@gmail.com wrote: Hi Mark, Could you tell me why that when we are GPU-CPU nodes as PME-dedicated nodes, the GPU on such nodes will be idle? Theo On 8/11/2014 9:36 PM, Mark Abraham wrote: Hi, What Carsten said, if running on nodes that have GPUs. If running on a mixed setup (some nodes with GPU, some not), then arranging your MPI environment to place PME ranks on CPU-only nodes is probably worthwhile. For example, all your PP ranks first, mapped to GPU nodes, then all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder pp_pme. Mark On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote: Hi Mark, This is information of our cluster, could you give us some advice as regards to our cluster so that we can make GMX run faster on our system? Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M Device Name Device Type Specifications Number CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332 Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20 GPU Node
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
Hi, When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with no PME-dedicated node), we noticed that when CPU are doing PME, GPU are idle, that is they are doing their work sequentially. Is it supposed to be so? Is it the same reason as GPUs on PME-dedicated nodes won't be used during a run like you said before? So if we want to exploit our hardware, we should map PP-PME ranks manually, right? Say, use one node as PME-dedicated node and leave the GPUs on that node idle, and use two nodes to do the other stuff. How do you think about this arrangement? Theo On 8/22/2014 7:20 PM, Mark Abraham wrote: Hi, Because no work will be sent to them. The GPU implementation can accelerate domains from PP ranks on their node, but with an MPMD setup that uses dedicated PME nodes, there will be no PP ranks on nodes that have been set up with only PME ranks. The two offload models (PP work - GPU; PME work - CPU subset) do not work well together, as I said. One can devise various schemes in 4.6/5.0 that could use those GPUs, but they either require * each node does both PME and PP work (thus limiting scaling because of the all-to-all for PME, and perhaps making poor use of locality on multi-socket nodes), or * that all nodes have PP ranks, but only some have PME ranks, and the nodes map their GPUs to PP ranks in a way that is different depending on whether PME ranks are present (which could work well, but relies on the DD load-balancer recognizing and taking advantage of the faster progress of the PP ranks that have better GPU support, and requires that you get very dirty hands laying out PP and PME ranks onto hardware that will later match the requirements of the DD load balancer, and probably that you balance PP-PME load manually) I do not recommend the last approach, because of its complexity. Clearly there are design decisions to improve. Work is underway. Cheers, Mark On Fri, Aug 22, 2014 at 10:11 AM, Theodore Si sjyz...@gmail.com wrote: Hi Mark, Could you tell me why that when we are GPU-CPU nodes as PME-dedicated nodes, the GPU on such nodes will be idle? Theo On 8/11/2014 9:36 PM, Mark Abraham wrote: Hi, What Carsten said, if running on nodes that have GPUs. If running on a mixed setup (some nodes with GPU, some not), then arranging your MPI environment to place PME ranks on CPU-only nodes is probably worthwhile. For example, all your PP ranks first, mapped to GPU nodes, then all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder pp_pme. Mark On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote: Hi Mark, This is information of our cluster, could you give us some advice as regards to our cluster so that we can make GMX run faster on our system? Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M Device Name Device Type Specifications Number CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332 Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20 GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50 MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5 Computing Network SwitchMellanox Infiniband FDR Core Switch 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager 1 Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP Interface 1 Management Network Switch Extreme Summit X440-48t-10G 2-layer Switch 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS 9 Extreme Summit X650-24X 3-layer Switch 24× 10Giga 3-layer Ethernet Switch Summit X650-24X, authorized by ExtremeXOS1 Parallel StorageDDN Parallel Storage System DDN SFA12K Storage System 1 GPU GPU Accelerator NVIDIA Tesla Kepler K20M70 MIC MIC Intel Xeon Phi 5110P Knights Corner 10 40Gb Ethernet Card MCX314A-BCBTMellanox ConnextX-3 Chip 40Gb Ethernet Card 2× 40Gb Ethernet ports, enough QSFP cables 16 SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE 80 On 8/10/2014 5:50 AM, Mark Abraham wrote: That's not what I said You can set... -npme behaves the same whether or not GPUs are in use. Using separate ranks for PME caters to trying to minimize the cost of the all-to-all communication of the 3DFFT. That's still relevant when using GPUs, but if separate PME ranks are used, any GPUs on nodes that only have PME ranks are left idle. The most effective approach depends critically on the hardware and simulation setup, and whether you pay money for
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si sjyz...@gmail.com wrote: Hi, When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with no PME-dedicated node), we noticed that when CPU are doing PME, GPU are idle, That could happen if the GPU completes its work too fast, in which case the end of the log file will probably scream about imbalance. that is they are doing their work sequentially. Highly unlikely, not least because the code is written to overlap the short-range work on the GPU with everything else on the CPU. What's your evidence for *sequential* rather than *imbalanced*? Is it supposed to be so? No, but without seeing your .log files, mdrun command lines and knowing about your hardware, there's nothing we can say. Is it the same reason as GPUs on PME-dedicated nodes won't be used during a run like you said before? Why would you suppose that? I said GPUs do work from the PP ranks on their node. That's true here. So if we want to exploit our hardware, we should map PP-PME ranks manually, right? Say, use one node as PME-dedicated node and leave the GPUs on that node idle, and use two nodes to do the other stuff. How do you think about this arrangement? Probably a terrible idea. You should identify the cause of the imbalance, and fix that. Mark Theo On 8/22/2014 7:20 PM, Mark Abraham wrote: Hi, Because no work will be sent to them. The GPU implementation can accelerate domains from PP ranks on their node, but with an MPMD setup that uses dedicated PME nodes, there will be no PP ranks on nodes that have been set up with only PME ranks. The two offload models (PP work - GPU; PME work - CPU subset) do not work well together, as I said. One can devise various schemes in 4.6/5.0 that could use those GPUs, but they either require * each node does both PME and PP work (thus limiting scaling because of the all-to-all for PME, and perhaps making poor use of locality on multi-socket nodes), or * that all nodes have PP ranks, but only some have PME ranks, and the nodes map their GPUs to PP ranks in a way that is different depending on whether PME ranks are present (which could work well, but relies on the DD load-balancer recognizing and taking advantage of the faster progress of the PP ranks that have better GPU support, and requires that you get very dirty hands laying out PP and PME ranks onto hardware that will later match the requirements of the DD load balancer, and probably that you balance PP-PME load manually) I do not recommend the last approach, because of its complexity. Clearly there are design decisions to improve. Work is underway. Cheers, Mark On Fri, Aug 22, 2014 at 10:11 AM, Theodore Si sjyz...@gmail.com wrote: Hi Mark, Could you tell me why that when we are GPU-CPU nodes as PME-dedicated nodes, the GPU on such nodes will be idle? Theo On 8/11/2014 9:36 PM, Mark Abraham wrote: Hi, What Carsten said, if running on nodes that have GPUs. If running on a mixed setup (some nodes with GPU, some not), then arranging your MPI environment to place PME ranks on CPU-only nodes is probably worthwhile. For example, all your PP ranks first, mapped to GPU nodes, then all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder pp_pme. Mark On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote: Hi Mark, This is information of our cluster, could you give us some advice as regards to our cluster so that we can make GMX run faster on our system? Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M Device Name Device Type Specifications Number CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332 Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20 GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50 MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5 Computing Network SwitchMellanox Infiniband FDR Core Switch 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager 1 Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP Interface 1 Management Network Switch Extreme Summit X440-48t-10G 2-layer Switch 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS 9 Extreme Summit X650-24X 3-layer Switch 24× 10Giga 3-layer Ethernet Switch Summit X650-24X, authorized by ExtremeXOS1 Parallel StorageDDN Parallel Storage System DDN SFA12K Storage System 1 GPU GPU
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
Hi Mark, Could you tell me why that when we are GPU-CPU nodes as PME-dedicated nodes, the GPU on such nodes will be idle? Theo On 8/11/2014 9:36 PM, Mark Abraham wrote: Hi, What Carsten said, if running on nodes that have GPUs. If running on a mixed setup (some nodes with GPU, some not), then arranging your MPI environment to place PME ranks on CPU-only nodes is probably worthwhile. For example, all your PP ranks first, mapped to GPU nodes, then all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder pp_pme. Mark On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote: Hi Mark, This is information of our cluster, could you give us some advice as regards to our cluster so that we can make GMX run faster on our system? Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M Device Name Device Type Specifications Number CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332 Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20 GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50 MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5 Computing Network SwitchMellanox Infiniband FDR Core Switch 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager 1 Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP Interface 1 Management Network Switch Extreme Summit X440-48t-10G 2-layer Switch 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS 9 Extreme Summit X650-24X 3-layer Switch 24× 10Giga 3-layer Ethernet Switch Summit X650-24X, authorized by ExtremeXOS1 Parallel StorageDDN Parallel Storage System DDN SFA12K Storage System 1 GPU GPU Accelerator NVIDIA Tesla Kepler K20M70 MIC MIC Intel Xeon Phi 5110P Knights Corner 10 40Gb Ethernet Card MCX314A-BCBTMellanox ConnextX-3 Chip 40Gb Ethernet Card 2× 40Gb Ethernet ports, enough QSFP cables 16 SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE 80 On 8/10/2014 5:50 AM, Mark Abraham wrote: That's not what I said You can set... -npme behaves the same whether or not GPUs are in use. Using separate ranks for PME caters to trying to minimize the cost of the all-to-all communication of the 3DFFT. That's still relevant when using GPUs, but if separate PME ranks are used, any GPUs on nodes that only have PME ranks are left idle. The most effective approach depends critically on the hardware and simulation setup, and whether you pay money for your hardware. Mark On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote: Hi, You mean no matter we use GPU acceleration or not, -npme is just a reference? Why we can't set that to a exact value? On 8/9/2014 5:14 AM, Mark Abraham wrote: You can set the number of PME-only ranks with -npme. Whether it's useful is another matter :-) The CPU-based PME offload and the GPU-based PP offload do not combine very well. Mark On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote: Hi, Can we set the number manually with -npme when using GPU acceleration? -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
Hi, Because no work will be sent to them. The GPU implementation can accelerate domains from PP ranks on their node, but with an MPMD setup that uses dedicated PME nodes, there will be no PP ranks on nodes that have been set up with only PME ranks. The two offload models (PP work - GPU; PME work - CPU subset) do not work well together, as I said. One can devise various schemes in 4.6/5.0 that could use those GPUs, but they either require * each node does both PME and PP work (thus limiting scaling because of the all-to-all for PME, and perhaps making poor use of locality on multi-socket nodes), or * that all nodes have PP ranks, but only some have PME ranks, and the nodes map their GPUs to PP ranks in a way that is different depending on whether PME ranks are present (which could work well, but relies on the DD load-balancer recognizing and taking advantage of the faster progress of the PP ranks that have better GPU support, and requires that you get very dirty hands laying out PP and PME ranks onto hardware that will later match the requirements of the DD load balancer, and probably that you balance PP-PME load manually) I do not recommend the last approach, because of its complexity. Clearly there are design decisions to improve. Work is underway. Cheers, Mark On Fri, Aug 22, 2014 at 10:11 AM, Theodore Si sjyz...@gmail.com wrote: Hi Mark, Could you tell me why that when we are GPU-CPU nodes as PME-dedicated nodes, the GPU on such nodes will be idle? Theo On 8/11/2014 9:36 PM, Mark Abraham wrote: Hi, What Carsten said, if running on nodes that have GPUs. If running on a mixed setup (some nodes with GPU, some not), then arranging your MPI environment to place PME ranks on CPU-only nodes is probably worthwhile. For example, all your PP ranks first, mapped to GPU nodes, then all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder pp_pme. Mark On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote: Hi Mark, This is information of our cluster, could you give us some advice as regards to our cluster so that we can make GMX run faster on our system? Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M Device Name Device Type Specifications Number CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332 Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20 GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50 MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5 Computing Network SwitchMellanox Infiniband FDR Core Switch 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager 1 Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP Interface 1 Management Network Switch Extreme Summit X440-48t-10G 2-layer Switch 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS 9 Extreme Summit X650-24X 3-layer Switch 24× 10Giga 3-layer Ethernet Switch Summit X650-24X, authorized by ExtremeXOS1 Parallel StorageDDN Parallel Storage System DDN SFA12K Storage System 1 GPU GPU Accelerator NVIDIA Tesla Kepler K20M70 MIC MIC Intel Xeon Phi 5110P Knights Corner 10 40Gb Ethernet Card MCX314A-BCBTMellanox ConnextX-3 Chip 40Gb Ethernet Card 2× 40Gb Ethernet ports, enough QSFP cables 16 SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE 80 On 8/10/2014 5:50 AM, Mark Abraham wrote: That's not what I said You can set... -npme behaves the same whether or not GPUs are in use. Using separate ranks for PME caters to trying to minimize the cost of the all-to-all communication of the 3DFFT. That's still relevant when using GPUs, but if separate PME ranks are used, any GPUs on nodes that only have PME ranks are left idle. The most effective approach depends critically on the hardware and simulation setup, and whether you pay money for your hardware. Mark On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote: Hi, You mean no matter we use GPU acceleration or not, -npme is just a reference? Why we can't set that to a exact value? On 8/9/2014 5:14 AM, Mark Abraham wrote: You can set the number of PME-only ranks with -npme. Whether it's useful is another matter :-) The CPU-based PME offload and the GPU-based PP offload do not combine very well. Mark On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote: Hi, Can we set the number
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
Hi, How can we designate which CPU-only nodes to be PME-dedicated nodes? What mdrun options or what configuration should we use to make that happen? BR, Theo On 8/11/2014 9:36 PM, Mark Abraham wrote: Hi, What Carsten said, if running on nodes that have GPUs. If running on a mixed setup (some nodes with GPU, some not), then arranging your MPI environment to place PME ranks on CPU-only nodes is probably worthwhile. For example, all your PP ranks first, mapped to GPU nodes, then all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder pp_pme. Mark On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote: Hi Mark, This is information of our cluster, could you give us some advice as regards to our cluster so that we can make GMX run faster on our system? Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M Device Name Device Type Specifications Number CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332 Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20 GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50 MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5 Computing Network SwitchMellanox Infiniband FDR Core Switch 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager 1 Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP Interface 1 Management Network Switch Extreme Summit X440-48t-10G 2-layer Switch 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS 9 Extreme Summit X650-24X 3-layer Switch 24× 10Giga 3-layer Ethernet Switch Summit X650-24X, authorized by ExtremeXOS1 Parallel StorageDDN Parallel Storage System DDN SFA12K Storage System 1 GPU GPU Accelerator NVIDIA Tesla Kepler K20M70 MIC MIC Intel Xeon Phi 5110P Knights Corner 10 40Gb Ethernet Card MCX314A-BCBTMellanox ConnextX-3 Chip 40Gb Ethernet Card 2× 40Gb Ethernet ports, enough QSFP cables 16 SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE 80 On 8/10/2014 5:50 AM, Mark Abraham wrote: That's not what I said You can set... -npme behaves the same whether or not GPUs are in use. Using separate ranks for PME caters to trying to minimize the cost of the all-to-all communication of the 3DFFT. That's still relevant when using GPUs, but if separate PME ranks are used, any GPUs on nodes that only have PME ranks are left idle. The most effective approach depends critically on the hardware and simulation setup, and whether you pay money for your hardware. Mark On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote: Hi, You mean no matter we use GPU acceleration or not, -npme is just a reference? Why we can't set that to a exact value? On 8/9/2014 5:14 AM, Mark Abraham wrote: You can set the number of PME-only ranks with -npme. Whether it's useful is another matter :-) The CPU-based PME offload and the GPU-based PP offload do not combine very well. Mark On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote: Hi, Can we set the number manually with -npme when using GPU acceleration? -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
On Tue, Aug 19, 2014 at 4:19 PM, Theodore Si sjyz...@gmail.com wrote: Hi, How can we designate which CPU-only nodes to be PME-dedicated nodes? mpirun -np N mdrun_mpi -npme M Starts N ranks out of which M will be PME-only and (M-N) PP ranks. What mdrun options or what configuration should we use to make that happen? You can change the rank ordering with -ddorder, there are three available patters. Otherwise, you can do manual rank ordering by telling MPI how to reorder ranks presented to mdrun itself; e.g. with MPICH you hcan use the MPICH_RANK_ORDER environment variable. Cheers, -- Szilárd BR, Theo On 8/11/2014 9:36 PM, Mark Abraham wrote: Hi, What Carsten said, if running on nodes that have GPUs. If running on a mixed setup (some nodes with GPU, some not), then arranging your MPI environment to place PME ranks on CPU-only nodes is probably worthwhile. For example, all your PP ranks first, mapped to GPU nodes, then all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder pp_pme. Mark On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote: Hi Mark, This is information of our cluster, could you give us some advice as regards to our cluster so that we can make GMX run faster on our system? Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M Device Name Device Type Specifications Number CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332 Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20 GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50 MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5 Computing Network SwitchMellanox Infiniband FDR Core Switch 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager 1 Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP Interface 1 Management Network Switch Extreme Summit X440-48t-10G 2-layer Switch 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS 9 Extreme Summit X650-24X 3-layer Switch 24× 10Giga 3-layer Ethernet Switch Summit X650-24X, authorized by ExtremeXOS1 Parallel StorageDDN Parallel Storage System DDN SFA12K Storage System 1 GPU GPU Accelerator NVIDIA Tesla Kepler K20M70 MIC MIC Intel Xeon Phi 5110P Knights Corner 10 40Gb Ethernet Card MCX314A-BCBTMellanox ConnextX-3 Chip 40Gb Ethernet Card 2× 40Gb Ethernet ports, enough QSFP cables 16 SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE 80 On 8/10/2014 5:50 AM, Mark Abraham wrote: That's not what I said You can set... -npme behaves the same whether or not GPUs are in use. Using separate ranks for PME caters to trying to minimize the cost of the all-to-all communication of the 3DFFT. That's still relevant when using GPUs, but if separate PME ranks are used, any GPUs on nodes that only have PME ranks are left idle. The most effective approach depends critically on the hardware and simulation setup, and whether you pay money for your hardware. Mark On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote: Hi, You mean no matter we use GPU acceleration or not, -npme is just a reference? Why we can't set that to a exact value? On 8/9/2014 5:14 AM, Mark Abraham wrote: You can set the number of PME-only ranks with -npme. Whether it's useful is another matter :-) The CPU-based PME offload and the GPU-based PP offload do not combine very well. Mark On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote: Hi, Can we set the number manually with -npme when using GPU acceleration? -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
Hi Mark, This is information of our cluster, could you give us some advice as regards to our cluster so that we can make GMX run faster on our system? Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M Device Name Device Type Specifications Number CPU Node IntelH2216JFFKRNodes CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332 Fat Node IntelH2216WPFKRNodes CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20 GPU Node IntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50 MIC Node IntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5 Computing Network Switch Mellanox Infiniband FDR Core Switch 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager 1 Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP Interface 1 Management Network Switch Extreme Summit X440-48t-10G 2-layer Switch 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS 9 Extreme Summit X650-24X 3-layer Switch 24× 10Giga 3-layer Ethernet Switch Summit X650-24X, authorized by ExtremeXOS 1 Parallel StorageDDN Parallel Storage System DDN SFA12K Storage System 1 GPU GPU Accelerator NVIDIA Tesla Kepler K20M70 MIC MIC Intel Xeon Phi 5110P Knights Corner 10 40Gb Ethernet Card MCX314A-BCBT Mellanox ConnextX-3 Chip 40Gb Ethernet Card 2× 40Gb Ethernet ports, enough QSFP cables 16 SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE 80 On 8/10/2014 5:50 AM, Mark Abraham wrote: That's not what I said You can set... -npme behaves the same whether or not GPUs are in use. Using separate ranks for PME caters to trying to minimize the cost of the all-to-all communication of the 3DFFT. That's still relevant when using GPUs, but if separate PME ranks are used, any GPUs on nodes that only have PME ranks are left idle. The most effective approach depends critically on the hardware and simulation setup, and whether you pay money for your hardware. Mark On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote: Hi, You mean no matter we use GPU acceleration or not, -npme is just a reference? Why we can't set that to a exact value? On 8/9/2014 5:14 AM, Mark Abraham wrote: You can set the number of PME-only ranks with -npme. Whether it's useful is another matter :-) The CPU-based PME offload and the GPU-based PP offload do not combine very well. Mark On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote: Hi, Can we set the number manually with -npme when using GPU acceleration? -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
Hi, you could start twice as many MPI processes per node as you have GPUs on a node and use half of all processes for PME, e.g. on 4 nodes: mpirun -np 8 mdrun -s in.tpr -npme 4 or start 4 processes per node: mpirun -np 16 mdrun -s in.tpr -npme 4 -gpu_id 0011 or with more OpenMP threads for the PME processes: mpirun -np 16 mdrun -ntomp 2 -ntomp_pme 6 -pin on -s in.tpr -npme 4 -gpu_id 0011 With -ntomp and -ntomp_pme you can fine-tune the compute power ratio between PME and PP nodes. You need to try out different combinations to find the optimum, the comments in the md.log file give hints on what to change. This approach usually yields good performance if you use several nodes, on a single node there will for sure be better settings (most likely more MPI processes with less OpenMP threads each). Carsten On 11 Aug 2014, at 11:45, Theodore Si sjyz...@gmail.com wrote: Hi Mark, This is information of our cluster, could you give us some advice as regards to our cluster so that we can make GMX run faster on our system? Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M Device Name Device Type Specifications Number CPU Node IntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332 Fat Node IntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory 20 GPU Node IntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50 MIC Node IntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5 Computing Network Switch Mellanox Infiniband FDR Core Switch 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager 1 Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP Interface 1 Management Network Switch Extreme Summit X440-48t-10G 2-layer Switch 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS 9 Extreme Summit X650-24X 3-layer Switch24× 10Giga 3-layer Ethernet Switch Summit X650-24X, authorized by ExtremeXOS1 Parallel Storage DDN Parallel Storage System DDN SFA12K Storage System 1 GPU GPU Accelerator NVIDIA Tesla Kepler K20M70 MIC MIC Intel Xeon Phi 5110P Knights Corner 10 40Gb Ethernet CardMCX314A-BCBTMellanox ConnextX-3 Chip 40Gb Ethernet Card 2× 40Gb Ethernet ports, enough QSFP cables16 SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE 80 On 8/10/2014 5:50 AM, Mark Abraham wrote: That's not what I said You can set... -npme behaves the same whether or not GPUs are in use. Using separate ranks for PME caters to trying to minimize the cost of the all-to-all communication of the 3DFFT. That's still relevant when using GPUs, but if separate PME ranks are used, any GPUs on nodes that only have PME ranks are left idle. The most effective approach depends critically on the hardware and simulation setup, and whether you pay money for your hardware. Mark On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote: Hi, You mean no matter we use GPU acceleration or not, -npme is just a reference? Why we can't set that to a exact value? On 8/9/2014 5:14 AM, Mark Abraham wrote: You can set the number of PME-only ranks with -npme. Whether it's useful is another matter :-) The CPU-based PME offload and the GPU-based PP offload do not combine very well. Mark On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote: Hi, Can we set the number manually with -npme when using GPU acceleration? -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
Hi, What Carsten said, if running on nodes that have GPUs. If running on a mixed setup (some nodes with GPU, some not), then arranging your MPI environment to place PME ranks on CPU-only nodes is probably worthwhile. For example, all your PP ranks first, mapped to GPU nodes, then all your PME ranks, mapped to CPU-only nodes, and then use mdrun -ddorder pp_pme. Mark On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si sjyz...@gmail.com wrote: Hi Mark, This is information of our cluster, could you give us some advice as regards to our cluster so that we can make GMX run faster on our system? Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia K20M Device Name Device Type Specifications Number CPU NodeIntelH2216JFFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 332 Fat NodeIntelH2216WPFKRNodesCPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory20 GPU NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 50 MIC NodeIntelR2208GZ4GC CPU: 2×Intel Xeon E5-2670(8 Cores, 2.6GHz, 20MB Cache, 8.0GT) Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory 5 Computing Network SwitchMellanox Infiniband FDR Core Switch 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager 1 Mellanox SX1036 40Gb Switch 36× 40Gb Ethernet Switch SX1036, 36× QSFP Interface 1 Management Network Switch Extreme Summit X440-48t-10G 2-layer Switch 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS 9 Extreme Summit X650-24X 3-layer Switch 24× 10Giga 3-layer Ethernet Switch Summit X650-24X, authorized by ExtremeXOS1 Parallel StorageDDN Parallel Storage System DDN SFA12K Storage System 1 GPU GPU Accelerator NVIDIA Tesla Kepler K20M70 MIC MIC Intel Xeon Phi 5110P Knights Corner 10 40Gb Ethernet Card MCX314A-BCBTMellanox ConnextX-3 Chip 40Gb Ethernet Card 2× 40Gb Ethernet ports, enough QSFP cables 16 SSD Intel SSD910Intel SSD910 Disk, 400GB, PCIE 80 On 8/10/2014 5:50 AM, Mark Abraham wrote: That's not what I said You can set... -npme behaves the same whether or not GPUs are in use. Using separate ranks for PME caters to trying to minimize the cost of the all-to-all communication of the 3DFFT. That's still relevant when using GPUs, but if separate PME ranks are used, any GPUs on nodes that only have PME ranks are left idle. The most effective approach depends critically on the hardware and simulation setup, and whether you pay money for your hardware. Mark On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote: Hi, You mean no matter we use GPU acceleration or not, -npme is just a reference? Why we can't set that to a exact value? On 8/9/2014 5:14 AM, Mark Abraham wrote: You can set the number of PME-only ranks with -npme. Whether it's useful is another matter :-) The CPU-based PME offload and the GPU-based PP offload do not combine very well. Mark On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote: Hi, Can we set the number manually with -npme when using GPU acceleration? -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
Hi, You mean no matter we use GPU acceleration or not, -npme is just a reference? Why we can't set that to a exact value? On 8/9/2014 5:14 AM, Mark Abraham wrote: You can set the number of PME-only ranks with -npme. Whether it's useful is another matter :-) The CPU-based PME offload and the GPU-based PP offload do not combine very well. Mark On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote: Hi, Can we set the number manually with -npme when using GPU acceleration? -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
That's not what I said You can set... -npme behaves the same whether or not GPUs are in use. Using separate ranks for PME caters to trying to minimize the cost of the all-to-all communication of the 3DFFT. That's still relevant when using GPUs, but if separate PME ranks are used, any GPUs on nodes that only have PME ranks are left idle. The most effective approach depends critically on the hardware and simulation setup, and whether you pay money for your hardware. Mark On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si sjyz...@gmail.com wrote: Hi, You mean no matter we use GPU acceleration or not, -npme is just a reference? Why we can't set that to a exact value? On 8/9/2014 5:14 AM, Mark Abraham wrote: You can set the number of PME-only ranks with -npme. Whether it's useful is another matter :-) The CPU-based PME offload and the GPU-based PP offload do not combine very well. Mark On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote: Hi, Can we set the number manually with -npme when using GPU acceleration? -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] Can we set the number of pure PME nodes when using GPUCPU?
You can set the number of PME-only ranks with -npme. Whether it's useful is another matter :-) The CPU-based PME offload and the GPU-based PP offload do not combine very well. Mark On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si sjyz...@gmail.com wrote: Hi, Can we set the number manually with -npme when using GPU acceleration? -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/ Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.