Additionally to Mark's comments, let me ask/add a couple of things. What was your benchmarking procedure on core counts that represent less than a full socket? Besides the thread affinity issue mentioned by Mark, clock frequency scaling (boost) can also distort performance plots. You will observe artificially high performance on small core counts making the scaling inherently worse - unless this is explicitly turned off in the BIOS/firmware. This can be further enhanced by the low cache traffic when only partially using a multicore CPU. These are both artificial effects that you won't see in real-world runs - unless leaving a bunch of cores empty. There is no single "right way" to avoid these issues, there certainly are ways to present data in a less than useful manner - especially when it comes to scaling plots. A simple way of avoiding such issues and eliminating the potential for incorrect strong scaling plots is to start from at least a socket (or node). Otherwise, IMO the <8 threads data points on your plot make sense only if you show strong scaling to multiple sockets/nodes by using the same amount of threads per socket as you started with, leaving the rest of the cores free.
What run configuration did you use for Verlet on single node? With the Verlet scheme no domain decomposition, that is multithreding-only (OpenMP) runs are typically more efficient than using domain-decomposition. This is typically true up to a full socket and quite often even across two Intel sockets. Did you tune the PME performance, i.e. the number of separate PME ranks? Did you use nstlist=40 for all Verlet data points? That may not be optimal across all node counts, especially on less than two nodes, but of course that's hard to tell without trying! Finally, looking at octanol Verlet plot, especially in comparison with the water plot, what's strange is that the scaling efficiency is much worse than with water and varies quite greatly between neighboring data points. This indicates that something was not entirely right with those runs. Cheers, -- Szilárd On Fri, Sep 19, 2014 at 1:35 PM, Mark Abraham <mark.j.abra...@gmail.com> wrote: > On Fri, Sep 19, 2014 at 2:50 AM, Dallas Warren <dallas.war...@monash.edu> > wrote: > >> Some scaling results that might be of interest to some people. >> >> Machine = Barcoo @ VLSCI >> 2.7GHz Intel Sandybridge cores >> 256GB RAM >> 16 cores per node >> Mellanox FDR14 InfiniBand switch >> >> Systems = water and octanol only with GROMOS53a6 >> >> # Atoms = 10,000 to 1,000,000 >> >> Comparison Group versus Verlet neighbour searching >> >> Image/graphs see https://twitter.com/dr_dbw/status/512763354566254592 >> >> Basically group neighbour searching for this setup is faster and scales >> better than Verlet. Was expecting that to be the case with water, since it >> is mentioned somewhere that is the case. However for the pure octanol >> system was expecting it to be the other way around? >> > > Thanks for sharing. Since the best way to write code that scales well is to > write code that runs slowly, we generally prefer to look at raw ns/day. > Choosing between perfect scaling of implementation A at 10 ns/day and > imperfect scaling of implementation B starting at 50 ns/day is a > no-brainer, but only if you know the throughput. > > I'd also be very suspicious of your single-core result, based on your > super-linear scaling. When using a number of cores smaller than a node, you > need to take care to pin that thread (mdrun -pin on), and not having other > processes also running on that core/node. If that result is noisy because > it ran into different other stuff over time, then every "scaling" data > point is affected. > > Also, to observe the scaling benefits of the Verlet scheme, you have to get > involved with using OpenMP as the core count gets higher, since the whole > point is that it permits more than one core to share the work of a domain, > and the (short-ranged part of the) group scheme hasn't been implemented to > do that. Since you don't mention OpenMP, you're probably not using it ;-) > Similarly, the group scheme is unbuffered by default, so it's an > apples-and-oranges comparison unless you state what buffer you used there. > > Cheers, > > Mark > > Catch ya, >> >> Dr. Dallas Warren >> Drug Delivery, Disposition and Dynamics >> Monash Institute of Pharmaceutical Sciences, Monash University >> 381 Royal Parade, Parkville VIC 3052 >> dallas.war...@monash.edu >> +61 3 9903 9304 >> --------------------------------- >> When the only tool you own is a hammer, every problem begins to resemble a >> nail. >> >> >> -- >> Gromacs Users mailing list >> >> * Please search the archive at >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >> posting! >> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> >> * For (un)subscribe requests visit >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >> send a mail to gmx-users-requ...@gromacs.org. >> > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a > mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.