Thanks a lot for such in-depth insight! Will look into all of this and get started asap!
On Wed, Sep 23, 2015 at 1:29 AM, Mark Abraham <mark.j.abra...@gmail.com> wrote: > Hi, > > On Mon, Sep 21, 2015 at 7:10 PM Sabyasachi Sahoo <ssahoo.i...@gmail.com> > wrote: > > > Dear Gromacs users and developers, > > > > I am a parallel programming researcher and would like to contribute to > > Gromacs molecular dynamics software by helping to nail down any > bottlenecks > > that occur in scaling of the software on multiple CPUs (and probably > > improved performance in exa-scale era.) I am already in process of > > collecting profiling results to identify the phases that can be improved > > (and also for few other MD softwares). > > > > Welcome! (And good luck... optimizing software that's had over 10 years of > performance focussed effort is... challenging!) > > Profiling is itself problematic. Assuming one can get GROMACS to build with > whatever constraints the tool adds, and can benchmark on a machine where > you do not suffer contention for network resources, one needs to produce > more useful information than can already be obtained at the end of the .log > file. Such data is most interesting only when near the strong scaling limit > (e.g. < 100 particles per CPU core) and thus at very short per-MD-step > times (several wallclock milliseconds). Neither function instrumentation, > nor sampling tend to work well in this regime. But we'd love to be > surprised :-) > > Hence, I would request all of you to please suggest me some possible areas > > of research, on which we can work on, for better scaling of Gromacs, > > (and/or MD softwares in general.) Going through the official website > > documentation helps me realise that implementing a truly parallel FFT (or > > making it scale better) in Gromacs will be truly helpful. > > > Yes and no. The FFT is intrinsically global, and most of the time spent > doing the PME component of MD is organizing the data to go into the FFT, > rather than doing the computation. The current implementation of spreading > charges onto the FFT grid is known to scale poorly across increasing > numbers of OpenMP cores of an MPI rank. That would be a high-impact problem > to fix - but start by getting a thorough understanding of the algorithm > before looking at the code (because the form of the code will not help you > understand anything). Some profiling with well-targeted function > instrumentation could be worthwhile here. > > Because of this, at scale it is often necessary to use a subset of the MPI > ranks to handle the PME part, MPMD style (see reference manual, and/or > GROMACS 4 paper). However, the implementation of that requires that the > user choose a division in advance, and without running some external > optimizer (like gmx tune_pme) that choice is difficult because there also > has to be a PME domain decomposition, and now an extra communication phase > mapping from one DD to the other, and that can't be efficient unless > various constraints are satisified... Approaches that would take variables > out of user space could be quite useful here. For a trivial example, it > might be best > * to interleave PME ranks with PP ranks, to spread them out over the > network to maximize communication bandwidth when doing the all-to-all for > the 3DFFT, and hopefully minimize communication latency when doing the > transformation from PP DD to PME DD by having ranks very close, or > * to pack PME ranks together on the network to minimize external network > contention during the all-to-all, but to do so in a way that doesn't lose > all the benefits by instead taking the latency hit at the PP<->PME > stages... > Currently the user has to choose one of these "up front." The latter works > well only in the presence of knowledge about the network topology, which is > unavailable until someone adds (e.g.) netloc support. > > Replacing our home-grown hardware detection support with hwloc support > would perhaps be higher reward for effort, however. > > Avoiding PME entirely is another avenue (but there are two fast-multipole > projects running here already). > > The latest paper > > on Gromacs 5.0 concludes saying an algorithm implementing preempting fine > > grained tasks based on priority can lead to improvements. I am also > trying > > to look into it and would want to know your take on this. > > > > We don't think the current approach of > > 1. spread this task over n threads per rank, then > 2. do something serial > 3. spread the next task over the same n threads per rank, then > 4. do something serial > 5. spread the next task over the same n threads per rank, then > ... continue like this > > is going to cut it in the long term. You need to be able e.g. to fire off > the necessary PME communication, then go back to doing bonded work, then 20 > microseconds later when communication arrives drop everything to handle the > next phase of PME ASAP, then go back to the bondeds, but preferably don't > trash all your cache in the meantime, etc. But there's a lot of boring code > transformation that has to happen before we can do much about it. Current > thinking is that we want to move in the direction of encapsulated tasks > that we might be able to write a custom TBB thread scheduler to handle in a > way that's automatic and efficient. > > Mark > > You could also direct me to the link on the website, or any person > > concerned with this. You could also point me to any link in developer > zone > > that I might have missed. Any more insight into matter will be really > > appreciated. > > > > Thanks in advance. > > > > -- > > Yours sincerely, > > Sabyasachi Sahoo > > Supercomputer Education & Research Center > > Indian Institute of Science - Bangalore > > -- > > Gromacs Users mailing list > > > > * Please search the archive at > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > posting! > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > * For (un)subscribe requests visit > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > send a mail to gmx-users-requ...@gromacs.org. > > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Yours sincerely, Sabyasachi Sahoo M. Tech - Computational Science Supercomputer Education & Research Center Indian Institute of Science - Bangalore -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.