I have used gprof to profile a program that uses openmpi. The result shows that the code spends a long time in poll (37% on 8 cores, 50% on 16 and 85% on 32). I was wondering if there is anything I can do to reduce the time spent in poll. I cannot determine the number of calls made to poll and exactly where they are. The bulk of my code uses exclusively MPI_Ssend for the send and MPI_Irecv and MPI_Wait for the receive. For instance, would there be any gain expected if I switch from MPI_Ssend to MPI_Send? Alternatively would there be any gain in switching to MPI_Isend/MPI_Recv instead of MPI_Ssend/MPI_Irecv?
Some details: open-mpi 1.4.3 gcc 4.1.2 Redhat EL5 x86_64 I am using the sm and tcp btls on nodes with 8 cores (2 quad cores) each (so 4 nodes for 32 cores). Intel Xeon 2.7GHz