Re: [OMPI users] Very high latency with openib btl

Maxime Boissonneault Fri, 15 Feb 2013 15:28:16 -0500

Hi again,

I managed to reproduce the "bug" with a simple case (see the cpp fileattached).


I am running this on 2 nodes with 8 cores each. If I run with
mpiexec ./test-mpi-latency.out

then the MPI_Ssend operations take about ~1e-5 second for intra-noderanks, and ~11 seconds for inter-node ranks. Note that 11 seconds isroughly the time required to execute the loop that is after theMPI_Recv. The average time required for the MPI_Ssend to return is 5.1seconds.


If I run with :
mpiexec --mca btl ^openib ./test-mpi-latency.out

then intra-node communications take ~0.5-1e-5 seconds, while internodecommunications take ~1e-6 seconds, for an average of ~5e-5 seconds.

I compiled this with gcc 4.7.2 + openmpi 1.6.3, as well as gcc 4.6.1 +openmpi 1.4.5. Both show the same behavior.

However, on the same machine, with gcc 4.6.1 + mvapich2/1.8, the latencyis always quite low.

The fact that mvapich2 does not show this behavior points out to aproblem with the openib btl within openmpi, and not with our setup.


Can anyone try to reproduce this on a different machine ?

Thanks,

Maxime Boissonneault

Le 2013-02-15 14:29, Maxime Boissonneault a écrit :

Hi again,
I found out that if I add an
MPI_Barrier after the MPI_Recv part, then there is no minute-longlatency.
Is it possible that even if MPI_Recv returns, the openib btl does notguarantee that the acknowledgement is sent promptly ? In other words,is it possible that the computation following the MPI_Recv delays theacknowledgement ? If so, is it supposed to be this way, or is itnormal, and why isn't the same behavior observed with the tcp btl ?
Maxime Boissonneault


Le 2013-02-14 11:50, Maxime Boissonneault a écrit :
Hi,
I have a strange case here. The application is "plink"(http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml). Thecomputation/communication pattern of the application is the following :
1- MPI_Init
2- Some single rank computation
3- MPI_Bcast
4- Some single rank computation
5- MPI_Barrier
6- rank 0 sends data to each other rank with MPI_Ssend, one rank at atime.
6- other ranks use MPI_Recv
7- Some single rank computation
8- other ranks send result to rank 0 with MPI_Ssend
8- rank 0 receives data with MPI_Recv
9- rank 0 analyses result
10- MPI_Finalize
The amount of data being sent is of the order of the kilobytes, andwe have IB.
The problem we observe is in step 6. I've output dates before andafter each MPI operation. With the openib btl, the behavior I observeis that :
- rank 0 starts sending
- rank n receives almost instantly, and MPI_Recv returns.
- rank 0's MPI_Ssend often returns _minutes_.
It looks like the acknowledgement from rank n takes minutes to reachrank 0.
Now, the tricky part is that if I disable the openib btl to useinstead tcp over IB, there is no such latency and the acknowledgementcomes back within a fraction of a second. Also, if rank 0 and rank nare on the same node, the acknowledgement is also quasi-instantaneous(I guess it goes through the SM btl instead of openib).
I tried to reproduce this in a simple case, but I observed no suchlatency. The duration that I got for the whole communication is ofthe order of milliseconds.
Does anyone have an idea of what could cause such very high latencieswhen using the OpenIB BTL ?
Also, I tried replacing step 6 by explicitly sending a confirmation :
- rank 0 does MPI_Isend to rank n followed by MPI_Recv from rank n
- rank n does MPI_Recv from rank 0 followed by MPI_Isend to rank 0
In this case also, rank n's MPI_Isend executes quasi-instantaneously,and rank 0's MPI_Recv only returns a few minutes later.
Thanks,

Maxime Boissonneault



--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

#include <iostream>
#include <fstream>
#include <unistd.h>
#include <sstream>
#include <cstdlib>
#include <climits>
#include <cstring>
#include <omp.h>
#include <time.h>
#include "mpi.h"

using namespace std;
static struct timespec start, end, duration;
static int my_rank, nrank;
static int my_mpi_tag_send=0;
void clock_start() {
        clock_gettime(CLOCK_MONOTONIC,&start);
}
double clock_end(const string & op, int rank_print=0) {
        double duration_in_sec;
        clock_gettime(CLOCK_MONOTONIC,&end);
        duration.tv_sec = end.tv_sec - start.tv_sec;
        duration.tv_nsec = end.tv_nsec - start.tv_nsec;
        while (duration.tv_nsec > 1000000000) { duration.tv_sec++; 
duration.tv_nsec -= 1000000000; }
        while (duration.tv_nsec < 0) { duration.tv_sec--; duration.tv_nsec += 
1000000000; }
        duration_in_sec = duration.tv_sec + 
double(duration.tv_nsec)/1000000000.;
        if (my_rank == rank_print)
                cout << "Operation \"" << op << "\" done. Took: " << 
duration_in_sec << " seconds." << endl;
        return duration_in_sec;
}
int main(int argc, char ** argv) {
        clock_start();
        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
        MPI_Comm_size(MPI_COMM_WORLD,&nrank);
        clock_end("MPI_Init");
        struct timespec start,end,duration;
        double duration_in_sec;
        double total_duration=0;
        int size=363;
        int size2=1024*1024*50;
        if (my_rank == 0) {
                cout << "We have " << nrank << " MPI ranks." << endl;
                if (size <= 1000)
                        cout << "Size of the send buffer :" << size << "." << 
endl;
                else if (size > 1000)
                        cout << "Size of the send buffer :" << size / 1000. << 
" thousands." << endl;
                else if (size > 1000000)
                        cout << "Size of the send buffer :" << size / 1000000. 
<< " millions." << endl;
        }
        char * to_bcast = new char[size2];
        char * to_send = new char[size];

        clock_start();
        for (int j=0; j<10; j++)
                for (int i=0; i<size2; i++) to_bcast[i] = rand() % CHAR_MAX;
        clock_end("bcast buffer init"); 

        clock_start();
        MPI_Bcast(to_bcast,size2,MPI_CHAR,0,MPI_COMM_WORLD);
        clock_end("MPI_Bcast");

        MPI_Barrier(MPI_COMM_WORLD);
        if (my_rank != 0) {
                
MPI_Recv(to_send,size,MPI_CHAR,0,my_mpi_tag_send,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
                for (int j=0; j<10; j++)
                        for (int i=0; i<size2; i++) to_bcast[i] = rand() % 
CHAR_MAX;
        }
        else {
                for (int r=1; r<nrank; r++) {
                        cout << "rank " << r << endl;
                        clock_start();
                        for (int i=0; i<size; i++) to_send[i] = rand() % 
CHAR_MAX;
                        clock_end("send buffer init");

                        clock_start();
                        
MPI_Ssend(to_send,size,MPI_CHAR,r,my_mpi_tag_send,MPI_COMM_WORLD);
                        total_duration += clock_end("send buffer sent");
                }
        }

        if (my_rank == 0) cout << "Average send duration : " << total_duration 
/ (nrank-1) << " seconds." << endl;

        delete[] to_bcast;
        delete[] to_send;
        clock_start();
        MPI_Finalize();
        clock_end("MPI_Finalize");
}

Re: [OMPI users] Very high latency with openib btl

Reply via email to