Hi Ivan & Axel, the computer I did have a slight speed up on was an XE6, but the other computer is not (I'm not sure which company supplied it). Once I have access to the computer again I will try your suggestions, but as you have said there may not be much of a performance increase, I will probably not spend too much time. On the computer with Xeon processors, they have a limit of about 64 cores, so I probably wont encounter a communications bottleneck (whereas on the XE6 I can use many more). Thank you for your time and help. All the best, Ben.
On Sat, Nov 9, 2013 at 12:54 PM, Axel Kohlmeyer <akohlmey at gmail.com> wrote: > On Sat, Nov 9, 2013 at 1:14 PM, Ivan Girotto <igirotto at ictp.it> wrote: > > Dear Ben, > > > > I'm afraid you are packing all processes within a node on a same socket > > (-bind-to-socket). > > My recommendation is to use the following: -cpus-per-proc 2 > -bind-to-core. > > However, for the pw.x code there is no much expectation to get better > > performance on Intel Xeon arch using MPI+OpenMP till communication > becomes a > > serious bottleneck. > > Indeed, the parallel work distribution among MPI processes offers in > general > > better scaling. > > ... and with current hardware even keeping cores idle can be > beneficial to the overall performance. > > there is a very nice discussion of the various parallelization options > in the QE user's guide. for almost all "normal" machines OpenMP by > itself is inferior to all the other available parallelization options, > so you should exploit those first. exceptions are "unusual" machines > like IBM's bluegene architecture or Cray's XT/XE machines and cases > where you need to parallelize to an extreme number or processors and > at *that* point leaving some cores idle and/or using OpenMP is indeed > helpful to squeeze out a little extra performance. > > ciao, > axel. > > p.s.: with OpenMP on x86 you should also experiment with the > OMP_WAIT_POLICY environment variable. most OpenMP implementations use > the ACTIVE policy which implies busy waiting and would theoretically > lower the latency, but the alternate PASSIVE policy can be more > efficient, especially when you leave one core per block of threads > idle. remember that on regular machines threads have to compete with > other processes on the machine for access to time slices in the > scheduler. with busy waiting, they are always fully consumed, even if > there is no work been done. calling sched_yield() as is implied by the > PASSIVE mode will quickly release the time slice and lets other > processes do work, which in turn increases the probability that your > thread will be scheduled more quickly again, which in turn can > significantly reduce latencies at implicit or explicit synchronization > points. if all this sounds all greek to you, then you should > definitely follow the advice in the QE user's guide and avoid > OpenMP... ;-) > > > > > > > Regards, > > > > Ivan > > > > > > On 08/11/2013 13:45, Ben Palmer wrote: > > > > Hi Everyone, > > > > (apologies if this has been sent twice) > > > > I have compiled QE 5.0.2 on a computer with AMD interlagos processors, > using > > the acml, compiling with openmp enabled, and submitting jobs with PBS. > I've > > had a speed up using 2 openmp threads per mpi process. > > > > I've been trying to do the same on another computer, that has MOAB as the > > scheduler, E5 series xeon processors (E5-2660) and uses the Intel MKL > > (E5-2660). I'm pretty sure hyperthreading has been turned off, as each > node > > has two sockets and 16 cores in total. > > > > I've seen a slow down in performance using OpenMP and MPI, but have read > > that this might be the case in the documentation. I'm waiting in the > > computer's queue to run the following: > > > > #!/bin/bash > > #MOAB -l "nodes=2:ppn=16" > > #MOAB -l "walltime=0:01:00" > > #MOAB -j oe > > #MOAB -N pwscf_calc > > #MOAB -A readmsd02 > > #MOAB -q bbtest > > cd "$PBS_O_WORKDIR" > > module load apps/openmpi/v1.6.3/intel-tm-ib/v2013.0.079 > > export PATH=$HOME/bin:$PATH > > export OMP_NUM_THREADS=2 > > mpiexec -np 16 -x OMP_NUM_THREADS=2 -npernode 8 -bind-to-socket > -display-map > > -report-bindings pw_openmp_5.0.2.x -in benchmark2.in > benchmark2c.out > > > > I just wondered if anyone had any tips on the settings or flags for > hybrid > > MPI/OpenMP with the E5 Xeon processors? > > > > All the best, > > > > Ben Palmer > > Student @ University of Birmingham, UK > > > > > > _______________________________________________ > > Pw_forum mailing list > > Pw_forum at pwscf.org > > http://pwscf.org/mailman/listinfo/pw_forum > > > > > > > > _______________________________________________ > > Pw_forum mailing list > > Pw_forum at pwscf.org > > http://pwscf.org/mailman/listinfo/pw_forum > > > > -- > Dr. Axel Kohlmeyer akohlmey at gmail.com http://goo.gl/1wk0 > International Centre for Theoretical Physics, Trieste. Italy. > _______________________________________________ > Pw_forum mailing list > Pw_forum at pwscf.org > http://pwscf.org/mailman/listinfo/pw_forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://pwscf.org/pipermail/pw_forum/attachments/20131110/1ff1d72f/attachment.html
