Unless you are using mxm, you can disable tcp with mpirun --mca pml ob1 --mca btl ^tcp ...
coll/tuned select an algorithm based on communicator size and message size. The spike could occur because a suboptimal (on your cluster and with your job topology) algo is selected. Note you can force an algo, or redefine the rules of algo selection. Cheers, Gilles Cooper Burns <cooper.bu...@convergecfd.com> wrote: >Ok I tried that ( sorry for delay... Network issues killed our cluster ) > >Setting the env variable you suggested changed results, but all it did was to >move the run time spike from between 4mb and 8mb to between 32kb and 64kb > >The nodes I'm running on have infiniband but i think I am running on ethernet >for these tests. > >Any other ideas? > >Thanks! > >Cooper > > >Cooper Burns > >Senior Research Engineer > >     > > > >(608) 230-1551 > >convergecfd.com > > > > >On Tue, Sep 19, 2017 at 3:44 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > >Hello Cooper > > >Could you rerun your test with the following env. variable set > >export OMPI_MCA_coll=self,basic,libnbc > >and see if that helps? > >Also, what type of interconnect are you using - ethernet, IB, ...? > >Howard > > > >2017-09-19 8:56 GMT-06:00 Cooper Burns <cooper.bu...@convergecfd.com>: > >Hello, > >I have been running some simple benchmarks and saw some strange behaviour: > >All tests are done on 4 nodes with 24 cores each (total of 96 mpi processes) > >When I run MPI_Allreduce() I see the run time spike up (about 10x) when I go >from reducing a total of 4096KB to 8192KB for example, when count is 2^21 >(8192 kb of 4 byte ints): > >MPI_Allreduce(send_buf, recv_buf, count, MPI_SUM, MPI_COMM_WORLD) > >is slower than: > >MPI_Allreduce(send_buf, recv_buf, count/2, MPI_INT, MPI_SUM, MPI_COMM_WORLD) > >MPI_Allreduce(send_buf + count/2, recv_buf + count/2, count/2,MPI_INT, >MPI_SUM, MPI_COMM_WORLD) > >Just wondering if anyone knows what the cause of this behaviour is. > >Thanks! > >Cooper > > > >Cooper Burns > >Senior Research Engineer > >     > > > >(608) 230-1551 > >convergecfd.com > > > > >_______________________________________________ >users mailing list >users@lists.open-mpi.org >https://lists.open-mpi.org/mailman/listinfo/users > > > >_______________________________________________ >users mailing list >users@lists.open-mpi.org >https://lists.open-mpi.org/mailman/listinfo/users > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users