Hi Szilard,

Here is an update for you and anyone following this. It turns out that the 
primary problem was that I had single process I had not noticed hanging taking 
up a core and throwing off the whole thing. Once I ensured that the server was 
completely cleared of other processes the default parameters gave me 30ns/day 
up from 2ns/day. I’m actually glad I made this mistake because I otherwise 
would have probably been content with the speed and not found out about the 
best way to use the various threading parameter parameters or the clock speed 
increase on the GPU, so thank you and know that your assistance was not in 
vain! I then implemented different combinations of the thread parameters and 
ddorder and settled on:

gmx mdrun -deffnm md_4-5_3 -v -ntmpi 16 -gpu_id 00000000 -npme 8 -ddorder pp_pme

This combination, along with the increased clock speed of my GPU is churning 
out almost 50 ns/day now for my system. That’s really fantastic, and I’m quite 
grateful to you for your help.

Best regards,
Jason

> Hi Jason,
> 
> Good point, separate PME ranks may very well be able to help in this
> case. I typically use half of the ranks for PME with AMD CPU-based
> machines (from  3-4 sockets and above).
> 
> However, based on your log file something is still not right, PME is
> barely faster than with 64 OpenMP threads (59 vs 87 ms/step) and it's
> most likely the lack of pinning that leads to bad performance.
> 
> Try the following:
> gmx mdrun -ntmpi  16 -ntomp 4 -npme 8
> gmx mdrun -ntmpi  32 -ntomp 2 -npme 16
> 
> And additionally, do try the -ddorder pp_pme option, this will bring
> your PME ranks closer to each other possibly even keep them within a
> socket.
> 
> Cheers,
> --
> Szil?rd
> 
> 
> On Wed, Dec 17, 2014 at 2:20 PM, Jason Hill <jason.h...@zoologi.su.se> wrote:
>> Hi Szilard and list,
>> 
>> Thanks for the response. First, I experimented further with the MPI thread 
>> number. Optimal performance was reached when I used  24 mpi ranks and 
>> defined 12 of those to me used for PME only. This resulted in less threads 
>> than logical cores, and pinning being off. Even though I got a warning to 
>> that effect, performance still increased 33% and now I am simulating 
>> ~3ns/day on a 90,000 atom system. Using 16 or 32 mpi ranks cuts performance 
>> in half, and I notice that the automatic PME mesh size stays much larger. 
>> Can someone please explain how these might be related? If I try set domain 
>> decomposition manually through mdrun -dd, I can?t choose a value that it 
>> seems to land on automatically, for example 72 72 72 is said to be too small 
>> when the PME mesh ends up there automatically anyway. Am I misunderstanding 
>> PME mesh size vs domain decomposition?
>> 
>> Second, I increased the GPU clock speed to its maximum of 875MHz, but saw no 
>> improvement. In fact, monitoring GPU usage showed that it never exceeded 
>> 20%! I?m somewhat at a loss for how I can further optimize my run, and more 
>> efficiently use my GPU. Any further pointers here would be much appreciated.
>> 
>> The log file for the latest run is here: 
>> https://drive.google.com/file/d/0BwAaTxAET7c5bXE2bUtfWVp4dE0/view?usp=sharing<https://drive.google.com/file/d/0BwAaTxAET7c5bXE2bUtfWVp4dE0/view?usp=sharing>
>> 
>> Best regards,
>> Jason
>> 
>> Jason Hill, Ph.D.
>> Wheat Lab
>> Zoologiska Institutionen
>> Stockholms Universitet
>> D-419 Svante Arrhenius v 18B
>> S-10691 Stockholm Sweden
>> 
>>> Date: Tue, 16 Dec 2014 19:05:52 +0100
>>> From: Szil?rd P?ll <pall.szil...@gmail.com>
>>> To: Discussion list for GROMACS users <gmx-us...@gromacs.org>
>>> Subject: Re: [gmx-users] Trouble balancing GPU/CPU force calculation
>>>      load, ratio = 0.09
>>> Message-ID:
>>>      <CANnYEw5+=XzaZf8KadoLyHs=rfgzw+pp4w-myn9mgaao6fg...@mail.gmail.com>
>>> Content-Type: text/plain; charset=UTF-8
>>> 
>>> Even 4 ranks x 16 theads is too much for AMD CPUs! In my experience
>>> the optimal is typically 2-8 threads/rank (depending on DD / imbalance
>>> behavior), so I suggest that you try these lower thread/rank counts.
>>> 
>>> Also, make sure that the application clocks are set to max on that
>>> K40, otherwise you're missing 20% GPU performance!
>>> 
>>> --
>>> Szil?rd
>>> 
>>> 
>>> On Mon, Dec 15, 2014 at 12:57 PM, Carsten Kutzner <ckut...@gwdg.de> wrote:
>>>> Hi,
>>>> 
>>>> from the log file it seems that you were actually using 64 OpenMP threads.
>>>> This is not very efficient, you could try to start mdrun with 4 thread-MPI
>>>> ranks (instead of 1), e.g.
>>>> 
>>>> mdrun -ntmpi 4 -gpu_id 0000 -s ?
>>>> 
>>>> Could it be that another process was running on your node while you
>>>> ran the simulation?
>>>> 
>>>> Carsten
>>>> 
>>>> 
>>>> On 15 Dec 2014, at 12:45, Jason Hill <jason.h...@zoologi.su.se> wrote:
>>>> 
>>>>> Hello list,
>>>>> 
>>>>> I am simulating a protein in water and am concerned that I am not using 
>>>>> my hardware to the best of it?s abilities. Here 
>>>>> (https://drive.google.com/file/d/0BwAaTxAET7c5VkZERkFsa1cyRlk/view?usp=sharing<https://drive.google.com/file/d/0BwAaTxAET7c5VkZERkFsa1cyRlk/view?usp=sharing>)
>>>>>  is the log file from a 1 nanosecond simulation. The only piece of 
>>>>> information missing from it that may be of use is that I am using the 
>>>>> OPLS/AA force field. Additionally, GROMACS only seems to be using 8-12 
>>>>> cores of the 64 available despite it?s complaint that the GPU is being 
>>>>> underutilized. Please take a look and if you can, give me some advice 
>>>>> about improving my simulation efficiency.
>>>>> 
>>>>> Best regards,
>>>>> Jason
>>>>> 

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Reply via email to