On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom <c...@nsc.liu.se> wrote: > On Tuesday 19 May 2009, Roman Martonak wrote: > ... >> openmpi-1.3.2 time per one MD step is 3.66 s >> ELAPSED TIME : 0 HOURS 1 MINUTES 25.90 SECONDS >> = ALL TO ALL COMM 102033. BYTES 4221. = >> = ALL TO ALL COMM 7.802 MB/S 55.200 SEC = > ... >> mvapich-1.1.0 time per one MD step is 2.55 s >> ELAPSED TIME : 0 HOURS 1 MINUTES 0.65 SECONDS >> = ALL TO ALL COMM 102033. BYTES 4221. = >> = ALL TO ALL COMM 14.815 MB/S 29.070 SEC = > ... >> Intel MPI 3.2.1.009 time per one MD step is 1.58 s >> ELAPSED TIME : 0 HOURS 0 MINUTES 38.16 SECONDS >> = ALL TO ALL COMM 102033. BYTES 4221. = >> = ALL TO ALL COMM 38.696 MB/S 11.130 SEC = > ... >> Clearly the whole difference is basically in the ALL TO ALL COMM time. >> Running on 1 blade (8 cores) all three MPI implementations have very >> similar same time per step of about 8.6 s. > > My guess is that what you see is the difference in MPI_Alltoall performance > for the different MPI-implementations (running in your env. on your hw.). > > You could write a trivial loop like this and try on the three MPIs: > > MPI_init > for i in 1 to 4221 > MPI_Alltoall(size=102033, ...) > MPI_finialize > > And time it to comfirm this. > >> For CPMD I found that using the keyword TASKGROUP >> which introduces a different way of parallelization it is possible to >> improve on the openmpi time substantially and lower the time from 3.66 >> s to 1.67 s, almost to the value found with Intel MPI. > > I guess this changed what kind of communication is done and you no longer have > to do 4221x 100Kbytes alltoall that seems to hurt OpenMPI so much.
With TASKGROUP=2 the summary looks as follows CPU TIME : 0 HOURS 0 MINUTES 42.09 SECONDS ELAPSED TIME : 0 HOURS 0 MINUTES 44.01 SECONDS *** CPMD| SIZE OF THE PROGRAM IS 73532/ 322740 kBYTES *** PROGRAM CPMD ENDED AT: Tue May 19 11:16:18 2009 ================================================================ = COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS = = SEND/RECEIVE 8585. BYTES 48447. = = BROADCAST 19063. BYTES 396. = = GLOBAL SUMMATION 103463. BYTES 372. = = GLOBAL MULTIPLICATION 0. BYTES 1. = = ALL TO ALL COMM 231821. BYTES 4221. = = PERFORMANCE TOTAL TIME = = SEND/RECEIVE 193.459 MB/S 2.150 SEC = = BROADCAST 10.785 MB/S 0.700 SEC = = GLOBAL SUMMATION 339.605 MB/S 0.680 SEC = = GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC = = ALL TO ALL COMM 82.716 MB/S 11.830 SEC = = SYNCHRONISATION 2.360 SEC = ================================================================ Roman