First, a small comment: use the MPI standard timer MPI_Wtime() instead
of gettimeofday. I think the problem is that MPI_Sendrecv_replace needs a temporary buffer. Unless the message is very small, the function uses MPI_Alloc_mem and MPI_Free_mem to allocate and free this temporary buffer. When openib is available, I guess it uses mca_mpool_rdma_alloc and mca_mpool_rdma_register, which are relatively expensive. Here is an experiment you can try. Write a PMPI wrapper for MPI_Sendrecv_replace(). That is, provide your own function that's called MPI_Sendrecv_replace() and inside it call PMPI_Sendrecv_replace(). Then, play with alternative implementations of MPI_Sendrecv_replace. E.g., allocate your own memory for the temporary buffer and then call PMPI_Sendrecv(). Try allocating your temporary memory with MPI_Alloc_mem/MPI_Free_mem and also try with malloc/free. I expect you'll see the timings you're seeing now. Maybe the experts on this list can comment on what *should* be happening inside OMPI. Meanwhile, you should probably avoid MPI_Sendrecv_replace if you care about performance. The function is mostly a convenience function and if you care about performance you'd be safest, if you're going to run with different MPIs, to use MPI_Sendrecv instead. That means you need a send buffer and a receive buffer. A little more hassle, perhaps, but it means you have better control over the performance characteristics. E.g., you won't have all those extra allocs/frees, which is what you almost surely have with most MPI implementations. Götz Waschk wrote: Hi everyone, I'm seeing a very strange effect with the openib btl. It seems to slow down my application even if not used at all. For my test, I am running a simple application with 8 processes on a single node, so openib should not be used at all. Still, the result with the btl enabled is much worse:% /usr/lib64/openmpi/1.3.2-gcc/bin/mpirun -np 8 --mca btl self,sm,openib ./a.out 11 tests with 2 x 2 x 2 processes: L0 = 32, L1xL2 = 256 DP spinors Overlap over 8 processes: 271769.0 usec Overlap over 8 processes: 298237.0 usec Overlap over 8 processes: 261648.0 usec Overlap over 8 processes: 369170.0 usec Overlap over 8 processes: 383065.0 usec Overlap over 8 processes: 280675.0 usec Overlap over 8 processes: 270912.0 usec Overlap over 8 processes: 198789.0 usec Overlap over 8 processes: 339857.0 usec Overlap over 8 processes: 192087.0 usec Overlap over 8 processes: 209025.0 usec Average of 10 measurements (skipped first) on 8 processes: 280.3 msec % /usr/lib64/openmpi/1.3.2-gcc/bin/mpirun -np 8 --mca btl self,sm ./a.out 11 tests with 2 x 2 x 2 processes: L0 = 32, L1xL2 = 256 DP spinors Overlap over 8 processes: 7445.0 usec Overlap over 8 processes: 7355.0 usec Overlap over 8 processes: 7311.0 usec Overlap over 8 processes: 7473.0 usec Overlap over 8 processes: 7409.0 usec Overlap over 8 processes: 7449.0 usec Overlap over 8 processes: 7261.0 usec Overlap over 8 processes: 7451.0 usec Overlap over 8 processes: 7430.0 usec Overlap over 8 processes: 7320.0 usec Overlap over 8 processes: 7384.0 usec Average of 10 measurements (skipped first) on 8 processes: 7384.3 usec This is the default openmpi as shipped with SL5.4 (based on RHEL5.4). I have also tested openmpi 1.4, same result. The other mpi shipped by Red Hat (mvapich2) does not show this problem. Any idea? Regards, Götz Waschk |
- [OMPI users] openib btl slows down application Götz Waschk
- Re: [OMPI users] openib btl slows down application Eugene Loh
- Re: [OMPI users] openib btl slows down application Jeff Squyres
- Re: [OMPI users] openib btl slows down application Eugene Loh