I took the best result from each version, that's why different algotithm numbers were chosen.
I've studied the matter a bit further and here's what I got: with openmpi 1.5.4 these are the average times: /opt/openmpi-1.5.4/intel12/bin/mpirun -x OMP_NUM_THREADS=1 -hostfile hosts_all2all_4 -npernode 32 --mca btl openib,sm,self -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_barrier_algorithm $i -np 128 openmpi-1.5.4/intel12/IMB-MPI1 -off_cache 16,64 -msglog 1:16 -npmin 128 barrier 0 - 71.78 3 - 69.39 6 - 69.05 If I pin the processes with the following script: #!/bin/bash s=$(($OMPI_COMM_WORLD_NODE_RANK)) numactl --physcpubind=$((s)) --localalloc openmpi-1.5.4/intel12/IMB-MPI1 -off_cache 16,64 -msglog 1:16 -npmin 128 barrier then the results improve: 0 - 51.96 3 - 52.39 6 - 28.64 On openmpi-1.5.5rc3 without any binding the results are awful (14964.15 is the best) If I use the '--bind-to-core' flag then the results are almost the same as in 1.5.4 with binding script: 0 - 52.85 3 - 52.69 6 - 23.34 So almost everything seems to work fine now. The only problem left is that algorithm number 5 hangs 2012/3/28 Jeffrey Squyres <jsquy...@cisco.com> > FWIW: > > 1. There were definitely some issues with binding to cores and process > layouts on Opterons that should be fixed in the 1.5.5 that was finally > released today. > > 2. It is strange that the performance of barrier is so much different > between 1.5.4 and 1.5.5. Is there a reason you were choosing different > algorithm numbers between the two? (one of your command lines had > "coll_tuned_barrier_algorithm 1", the other had > "coll_tuned_barrier_algorithm 3"). > > > On Mar 23, 2012, at 10:11 AM, Shamis, Pavel wrote: > > > Pavel, > > > > Mvapich implements multicore optimized collectives, which perform > substantially better than default algorithms. > > FYI, ORNL team works on new high performance collectives framework for > OMPI. The framework provides significant boost in collectives performance. > > > > Regards, > > > > Pavel (Pasha) Shamis > > --- > > Application Performance Tools Group > > Computer Science and Math Division > > Oak Ridge National Laboratory > > > > > > > > > > > > > > On Mar 23, 2012, at 9:17 AM, Pavel Mezentsev wrote: > > > > I've been comparing 1.5.4 and 1.5.5rc3 with the same parameters that's > why I didn't use --bind-to-core. I checked and the usage of --bind-to-core > improved the result comparing to 1.5.4: > > #repetitions t_min[usec] t_max[usec] t_avg[usec] > > 1000 84.96 85.08 85.02 > > > > So I guess with 1.5.5 the processes move from core to core within node > even though I use all cores, right? Then why 1.5.4 behaves differently? > > > > I need --bind-to-core in some cases and that's why I need 1.5.5rc3 > instead of more stable 1.5.4. I know that I can use numactl explicitly but > --bind-to-core is more convinient :) > > > > 2012/3/23 Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>> > > I don't see where you told OMPI to --bind-to-core. We don't > automatically bind, so you have to explicitly tell us to do so. > > > > On Mar 23, 2012, at 6:20 AM, Pavel Mezentsev wrote: > > > >> Hello > >> > >> I'm doing some testing with IMB and dicovered a strange thing: > >> > >> Since I have a system with new AMD opteron 6276 processors I'm using > 1.5.5rc3 since it supports binding to cores. > >> > >> But when I run the barrier test form intel mpi benchmarks, the best I > get is: > >> #repetitions t_min[usec] t_max[usec] t_avg[usec] > >> 598 15159.56 15211.05 15184.70 > >> (/opt/openmpi-1.5.5rc3/intel12/bin/mpirun -x OMP_NUM_THREADS=1 > -hostfile hosts_all2all_2 -npernode 32 --mca btl openib,sm,self -mca > coll_tuned_use_dynamic_rules 1 -mca coll_tuned_barrier_algorithm 1 -np 256 > openmpi-1.5.5rc3/intel12/IMB-MPI1 -off_cache 16,64 -msglog 1:16 -npmin 256 > barrier) > >> > >> And with openmpi 1.5.4 the result is much better: > >> #repetitions t_min[usec] t_max[usec] t_avg[usec] > >> 1000 113.23 113.33 113.28 > >> > >> (/opt/openmpi-1.5.4/intel12/bin/mpirun -x OMP_NUM_THREADS=1 -hostfile > hosts_all2all_2 -npernode 32 --mca btl openib,sm,self -mca > coll_tuned_use_dynamic_rules 1 -mca coll_tuned_barrier_algorithm 3 -np 256 > openmpi-1.5.4/intel12/IMB-MPI1 -off_cache 16,64 -msglog 1:16 -npmin 256 > barrier) > >> > >> and still I couldn't come close to the result I got with mvapich: > >> #repetitions t_min[usec] t_max[usec] t_avg[usec] > >> 1000 17.51 17.53 17.53 > >> > >> (/opt/mvapich2-1.8/intel12/bin/mpiexec.hydra -env OMP_NUM_THREADS 1 > -hostfile hosts_all2all_2 -np 256 mvapich2-1.8/intel12/IMB-MPI1 -mem 2 > -off_cache 16,64 -msglog 1:16 -npmin 256 barrier) > >> > >> I dunno if this is a bug or me doing something not the way I should. So > is there a way to improve my results? > >> > >> Best regards, > >> Pavel Mezentsev > >> > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org<mailto:de...@open-mpi.org> > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org<mailto:de...@open-mpi.org> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org<mailto:de...@open-mpi.org> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >