Szilárd, I was wrong. When I run with GPU and use -ntomp 4, I have 400% CPU utilization and that yields about 83 ns/day. When I do -ntomp 4 -nb cpu, I get 1600% CPU utilization and get similar results. However, when I run -nt 4 -nb cpu, I get 400% CPU utilization, and then it is slower. I am doing a short test, will send the stats later on.
The stats from GPU-accelerated (-ntomp 4) are below. Pretty poor CPU-GPU sync here, actually. Will post the log for CPU-only run once it finishes. R E A L C Y C L E A N D T I M E A C C O U N T I N G On 1 MPI rank, each using 4 OpenMP threads Computing: Num Num Call Wall time Giga-Cycles Ranks Threads Count (s) total sum % ----------------------------------------------------------------------------- Neighbor search 1 4 2500001 1881.311 18061.525 1.8 Launch GPU ops. 1 4 100000001 4713.584 45252.759 4.5 Force 1 4 100000001 66892.607 642202.401 63.5 PME mesh 1 4 100000001 25192.879 241864.204 23.9 Wait GPU local 1 4 100000001 869.481 8347.456 0.8 NB X/F buffer ops. 1 4 197500001 2014.227 19337.585 1.9 COM pull force 1 4 100000001 704.950 6767.871 0.7 Write traj. 1 4 6118 15.348 147.345 0.0 Update 1 4 100000001 1747.965 16781.332 1.7 Rest 1364.705 13101.849 1.3 ----------------------------------------------------------------------------- Total 105397.057 1011864.328 100.0 ----------------------------------------------------------------------------- Breakdown of PME mesh computation ----------------------------------------------------------------------------- PME spread/gather 1 4 200000002 12874.626 123602.829 12.2 PME 3D-FFT 1 4 200000002 9285.345 89143.948 8.8 PME solve Elec 1 4 100000001 2746.973 26372.313 2.6 ----------------------------------------------------------------------------- GPU timings ----------------------------------------------------------------------------- Computing: Count Wall t (s) ms/step % ----------------------------------------------------------------------------- Pair list H2D 2500001 124.145 0.050 0.4 X / q H2D 100000001 2089.623 0.021 6.0 Nonbonded F kernel 97000000 30164.146 0.311 86.2 Nonbonded F+ene k. 500000 227.896 0.456 0.7 Nonbonded F+prune k. 2000000 708.250 0.354 2.0 Nonbonded F+ene+prune k. 500001 223.082 0.446 0.6 F D2H 100000001 1465.277 0.015 4.2 ----------------------------------------------------------------------------- Total 35002.419 0.350 100.0 ----------------------------------------------------------------------------- Force evaluation time GPU/CPU: 0.350 ms/0.921 ms = 0.380 For optimal performance this ratio should be close to 1! NOTE: The GPU has >25% less load than the CPU. This imbalance causes performance loss. Core t (s) Wall t (s) (%) Time: 421720.882 105397.057 400.1 1d05h16:37 (ns/day) (hour/ns) Performance: 81.976 0.293 Finished mdrun on rank 0 Thu Jul 2 02:29:57 2015 On Thu, Jul 2, 2015 at 7:57 AM, Szilárd Páll <pall.szil...@gmail.com> wrote: > I'm curious what are the conditions under which you get such a exceptional > speedup. Can you share your input files and/or log files? > > -- > Szilárd > > On Thu, Jul 2, 2015 at 2:18 AM, Alex <nedoma...@gmail.com> wrote: > >> Yup, about 7-8 times between with and without GPU acceleration, not >> making this up: I had 11 ns/day and now ~80-87 ns/day, the numbers vary a >> bit. I've been getting a similar boost on our GPU-accelerated cluster node >> (dual core i7, 8 cores each) with two Tesla C2075 cards (I am directing my >> simulations to one of them via -gpu_id). >> All runs are -ntomp 4, with or without GPU. The physics in all cases is >> perfectly acceptable. So far I only tested my new box on vacuum >> simulations, about to run the solvated version (~30K particles). >> >> Alex >> >> >> On Wed, Jul 1, 2015 at 6:09 PM, Szilárd Páll <pall.szil...@gmail.com> >> wrote: >> >>> Hmmm, 8x sounds rather high, are you sure you are comparing to CPU-only >>> runs that use proper SIMD optimized kernels? >>> >>> Because of the way offload-based acceleration works, the CPU and GPU >>> will inherently be executing concurrently only part of the runtime and as a >>> consequence the GPU is idle part of the run-time (during >>> integration+constraints). You can make use of this idle time by running >>> multiple independent simulations concurrently. This can yield serious >>> improvements in terms of _aggregate_ simulation performance especially with >>> small inputs and many cores (see slide 51 https://goo.gl/7DnSri)/ >>> >>> -- >>> Szilárd >>> >>> On Wed, Jul 1, 2015 at 4:16 AM, Alex <nedoma...@gmail.com> wrote: >>> >>>> I am happy to say that I am getting an 8-fold increase in simulation >>>> speeds for $200. >>>> >>>> >>>> An additional question: normally, how many simulations (separate mdruns >>>> on separate CPU cores) can be performed simultaneously on a single GPU? >>>> Say, for 20-40K particle sized simulations. >>>> >>>> The coolers are not even spinning during a single test (mdrun -ntomp >>>> 4), and I get massive acceleration. They aren't broken, the card is just >>>> cool (small system, ~3K particles). >>>> >>>> >>>> Thanks, >>>> >>>> >>>> Alex >>>> >>>> >>>> >>>> > >>>> >>>> >>>> >>>> >>>> > >>>> >>>> Ah, ok, so you can get a 6-pin from the PSU and another from a >>>> converted molex connector. That should be just fine, especially as the card >>>> should will not pull more than ~155W (under heavy graphics load) based on >>>> the Tomshardware review* and you are providing 225W max. >>>> >>>> >>>> >>>> * >>>> http://www.tomshardware.com/reviews/evga-super-super-clocked-gtx-960,4063-3.html >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Szilárd >>>> >>>> >>>> >>>> On Tue, Jun 30, 2015 at 7:31 PM, Alex <nedoma...@gmail.com> wrote: >>>> >>>> >>>> Well, I don't have one like this. What I have instead is this: >>>> >>>> >>>> 1. A single 6-pin directly from the PSU. >>>> >>>> 2. A single molex to 6-pin (my PSU does provide one molex). >>>> >>>> 3. Two 6-pins to a single 8-pin converter going to the card. >>>> >>>> >>>> In other words, I can populate both 6-pins on the 6-8 converter, just >>>> not sure about the pinouts in this case. >>>> >>>> >>>> Not good? >>>> >>>> >>>> Alex >>>> >>>> >>>> >>>> > >>>> >>>> What I meant is this: http://goo.gl/8o1B5P >>>> >>>> >>>> That is 2x molex -> 8pin PCI-E. A single molex may not be enouhg. >>>> >>>> >>>> >>>> -- >>>> >>>> Szilárd >>>> >>>> >>>> >>>> On Tue, Jun 30, 2015 at 7:10 PM, Alex <nedoma...@gmail.com> wrote: >>>> >>>> >>>> It is a 4-core CPU, single GPU box, so I doubt I will be running more >>>> >>>> than one at a time. We will very likely get a different PSU, unless... >>>> >>>> I do have a molex to 6 pin concerter sitting on this very desk. Do you >>>> >>>> think it will satisfy the card? I just don't know how much a single >>>> >>>> molex line delivers. If you feel this should work, off to installing >>>> >>>> everything I go. >>>> >>>> >>>> Thanks a bunch, >>>> >>>> Alex >>>> >>>> >>>> SP> First of all, unless you run multiple independent simulations on >>>> the same >>>> >>>> SP> GPU, GROMACS runs alone will never get anywhere near the peak power >>>> >>>> SP> consumption of the GPU. >>>> >>>> >>>> SP> The good news is that NVIDIA has gained some sanity and stopped >>>> blocking >>>> >>>> SP> GeForce GPU info in nvidia-smi - although only for newer cars, but >>>> it does >>>> >>>> SP> work with the 960 if you use a 352.xx driver: >>>> >>>> SP> +------------------------------------------------------+ >>>> >>>> >>>> SP> | NVIDIA-SMI 352.21 Driver Version: 352.21 | >>>> >>>> >>>> SP> >>>> |-------------------------------+----------------------+----------------------+ >>>> >>>> SP> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile >>>> Uncorr. >>>> >>>> SP> ECC | >>>> >>>> SP> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util >>>> Compute >>>> >>>> SP> M. | >>>> >>>> SP> >>>> |===============================+======================+======================| >>>> >>>> SP> | 0 GeForce GTX 960 Off | 0000:01:00.0 On | >>>> >>>> SP> N/A | >>>> >>>> SP> | 8% 45C P5 15W / 130W | 1168MiB / 2044MiB | 31% >>>> >>>> SP> Default | >>>> >>>> SP> >>>> +-------------------------------+----------------------+----------------------+ >>>> >>>> >>>> >>>> SP> A single 6-pin can deliver 75W, an 8-pin 150W, so in your case, the >>>> hard >>>> >>>> SP> limits of what your card can pull is 75W from the PCI-E slow + 150W >>>> from >>>> >>>> SP> the cable = 225 W. With a single 6-pin cable you'll only get ~150W >>>> max. >>>> >>>> SP> That can be OK if your card does not pull more power (e.g. the above >>>> >>>> SP> non-overclocked card would be just fine), but as your card is >>>> overclocked, >>>> >>>> SP> I'm not sure it won't peak above 150W. >>>> >>>> >>>> SP> You can try to get a molex -> PCI-E power cable converter. >>>> >>>> >>>> >>>> SP> -- >>>> >>>> SP> Szilárd >>>> >>>> >>>> >>>> SP> On Mon, Jun 29, 2015 at 9:56 PM, Alex <nedoma...@gmail.com> wrote: >>>> >>>> >>>> >> Hi all, >>>> >>>> >> >>>> >>>> >> I have a bit of a gromacs-unrelated question here, but I think this >>>> is a >>>> >>>> >> better place to ask it than, say, a gaming forum. The Nvidia GTX 960 >>>> card >>>> >>>> >> we got here came with an 8-pin AUX connector on the card side, which >>>> >>>> >> interfaces _two_ 6-pin connectors to the PSU. It is a factory >>>> superclocked >>>> >>>> >> card. My 525W PSU can only populate _one_ of those 6-pin connectors. >>>> The >>>> >>>> >> EVGA website states that I need at least 400W PSU, while I have 525. >>>> >>>> >> >>>> >>>> >> At the same time, I have a dedicated high-power PCI-e slot, which on >>>> the >>>> >>>> >> motherboard says "75W PCI-e". Do I need a different PSU to populate >>>> the AUX >>>> >>>> >> power connector completely? Are these runs equivalent to drawing max >>>> power >>>> >>>> >> during gaming? >>>> >>>> >> >>>> >>>> >> Thanks! >>>> >>>> >> >>>> >>>> >> Alex >>>> >>>> >> -- >>>> >>>> >> Gromacs Users mailing list >>>> >>>> >> >>>> >>>> >> * Please search the archive at >>>> >>>> >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >>>> >>>> >> posting! >>>> >>>> >> >>>> >>>> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>>> >>>> >> >>>> >>>> >> * For (un)subscribe requests visit >>>> >>>> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users >>>> or >>>> >>>> >> send a mail to gmx-users-requ...@gromacs.org. >>>> >>>> >> >>>> >>>> >>>> >>>> -- >>>> >>>> Gromacs Users mailing list >>>> >>>> >>>> * Please search the archive at >>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >>>> posting! >>>> >>>> >>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>>> >>>> >>>> * For (un)subscribe requests visit >>>> >>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >>>> send a mail to gmx-users-requ...@gromacs.org. >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Best regards, >>>> >>>> Alex mailto:nedoma...@gmail.com >>>> <nedoma...@gmail.com> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Best regards, >>>> >>>> Alex mailto:nedoma...@gmail.com >>>> <nedoma...@gmail.com> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Best regards, >>>> >>>> Alex mailto:nedoma...@gmail.com >>>> <nedoma...@gmail.com> >>>> >>>> -- >>>> Gromacs Users mailing list >>>> >>>> * Please search the archive at >>>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >>>> posting! >>>> >>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>>> >>>> * For (un)subscribe requests visit >>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >>>> send a mail to gmx-users-requ...@gromacs.org. >>>> >>> >>> >> > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.