Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

Simone Pellegrini Mon, 4 May 2009 04:37:05 -0400

Hi,

sorry for the delay but I did some additional experiments to found outwhether the problem was openmpi or gcc!


In attach u will find the program that causes the problem before mentioned.
I compile the program with the following line:

$HOME/openmpi-1.3.2-gcc44/bin/mpicc -O3 -g -Wall -fmessage-length=0 -m64bug.c -o bug

When I run the program using mpi 1.3.2 compiled with gcc44 in thefollowing way:


$HOME/openmpi-1.3.2-gcc44/bin/mpirun --mca btl self,sm --np 32 ./bug 1024

The program just hangs... and never terminates! I am running on a SMPmachine with 32 cores, actually it is a Sun Fire X4600 X2. (8 quad-coreBarcelona AMD chips), the OS is CentOS 5 and the kernel is2.6.18-92.el5.src-PAPI (patched with PAPI).I use a N of 1024, and if I print out the value of the iterator i,sometimes it stops around 165, other times around 520... and it doesn'tmake any sense.

If I run the program (and it's important to notice I don't recompile it,I just use another mpirun from a different mpi version) the programworks fine. I did some experiments during the weekend and if I useopenmpi-1.3.2 compiled with gcc433 everything works fine.

So I really think the problem is strictly related to the usage ofgcc-4.4.0! ...and it doesn't depends from OpenMPI as the program hangseven when I use gcc 1.3.1 compiled with gcc 4.4!


I hope everything is clear now.

regards, Simone

Eugene Loh wrote:

So far, I'm unable to reproduce this problem. I haven't exactlyreproduced your test conditions, but then I can't. At a minimum, Idon't have exactly the code you ran (and not convinced I want to!). So:
*) Can you reproduce the problem with the stand-alone test case I sentout?*) Does the problem correlate with OMPI version? (I.e., 1.3.1 versus1.3.2.)
*) Does the problem occur at lower np?
*) Does the problem correlate with the compiler version? (I.e., GCC4.4 versus 4.3.3.)*) What is the failure rate? How many times should I expect to run tosee failures?
*) How large is N?

Eugene Loh wrote:
Simone Pellegrini wrote:
Dear all,
I have successfully compiled and installed openmpi 1.3.2 on a 8socket quad-core machine from Sun.
I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phasebut when I try to run simple MPI programs processes hangs. Actuallythis is the kernel of the application I am trying to run:
MPI_Barrier(MPI_COMM_WORLD);
    total = MPI_Wtime();
    for(i=0; i<N-1; i++){
        // printf("%d\n", i);
        if(i>0)
MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N,MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status);
        for(k=0; k<N; k++)
            A[i][k] = (A[i][k] + A[i+1][k] + row[k])/3;
    }
    MPI_Barrier(MPI_COMM_WORLD);
    total = MPI_Wtime() - total;
Do you know if this kernel is sufficient to reproduce the problem?How large is N? Evidently, it's greater than 1600, but I'm stillcurious how big. What are top and bottom? Are they rank+1 and rank-1?
Sometimes the program terminates correctly, sometimes don't!
Roughly, what fraction of runs hang?  50%?  1%?  <0.1%?
I am running the program using the shared memory module because I amusing just one multi-core with the following command:
mpirun --mca btl self,sm --np 32 ./my_prog prob_size
Any idea if this fails at lower np?
If I print the index number during the program execution I can seethat program stop running around index value 1600... but it actuallydoesn't crash. It just stops! :(
I run the program under strace to see what's going on and this isthe output:
[...]
futex(0x2b20c02d9790, FUTEX_WAKE, 1)    = 1
futex(0x2aaaaafcf2b0, FUTEX_WAKE, 1)    = 0
readv(100,[{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"...,36}], 1) = 36readv(100,[{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\4\0\0\0jj\0\0\0\1\0\0\0", 28}],1) = 28
futex(0x19e93fd8, FUTEX_WAKE, 1)        = 1
futex(0x2aaaaafcf5e0, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resourcetemporarily unavailable)
futex(0x2aaaaafcf5e0, FUTEX_WAKE, 1)    = 0
writev(102,[{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\4\0\0\0\4\0\0\0\34"...,36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0",28}], 2) = 64poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN},{fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25,events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN},{fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44,events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN},{fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61,events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN},{fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81,events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN},{fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99,events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN},{fd=0, events=POLLIN}, {fd=100, events=POLLIN, revents=POLLIN},...], 39, 1000) = 1readv(100,[{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"...,36}], 1) = 36readv(100,[{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}],1) = 28
futex(0x19e93fd8, FUTEX_WAKE, 1)        = 1
writev(109,[{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\7\0\0\0\4\0\0\0\34"...,36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0",28}], 2) = 64poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN},{fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25,events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN},{fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44,events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN},{fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61,events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN},{fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81,events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN},{fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99,events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN},{fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN},{fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25,events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN},{fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44,events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN},{fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61,events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN},{fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81,events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN},{fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99,events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN},{fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1
and the program keep printing this poll() call till I stop it!
The program runs perfectly with my old configuration which wasOpenMPI 1.3.1 compiled with Gcc-4.4. Actually I see the same problemwhen I compile Openmpi-1.3.1 with Gcc 4.4. Is there any conflictwhich arise when gcc-4.4 is used?
I don't understand this. It runs well with 1.3.1/4.4, but you seethe same problem with 1.3.1/4.4? I'm confused: you do or don't seethe problem with 1.3.1/4.4? What do you think is the crucial factorhere? OMPI rev or GCC rev?
I'm not sure I can replicate all of your test system (hardware,etc.), but some sanity tests on what I have so far have turned upclean. I run:
#include <stdio.h>
#include <mpi.h>

#define N 40000
#define M 40000

int main(int argc, char **argv) {
  int np, me, i, top, bottom;
  float sbuf[N], rbuf[N];
  MPI_Status status;

  MPI_Init(&argc,&argv);
  MPI_Comm_size(MPI_COMM_WORLD,&np);
  MPI_Comm_rank(MPI_COMM_WORLD,&me);

  top    = me + 1;   if ( top  >= np ) top    -= np;
  bottom = me - 1;   if ( bottom < 0 ) bottom += np;

  for ( i = 0; i < N; i++ ) sbuf[i] = 0;
  for ( i = 0; i < N; i++ ) rbuf[i] = 0;

  MPI_Barrier(MPI_COMM_WORLD);
  for ( i = 0; i < M - 1; i++ )
    MPI_Sendrecv(sbuf, N, MPI_FLOAT, top   , 0,
rbuf, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD,&status);
  MPI_Barrier(MPI_COMM_WORLD);

  MPI_Finalize();
  return 0;
}
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

/*
 * bug.c
 *
 *  Created on: May 4, 2009
 *      Author: motonacciu
 */

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void create_matrix(float*** m, int N, int M){
	int i;
	float* matrix = malloc(sizeof(float) * N * M); // create the main matrix
	*m = (float**) malloc(sizeof(float*) * N); // create and indexes vector
	for(i=0; i<N; i++)
		(*m)[i] = &matrix[i*M]; // make the intex vector to point to the matrix rows
}

int main(int argc, char** argv){

	int N = atoi(argv[1]);
	int ntasks, rank;

	float **A;
	create_matrix(&A, N, N);

	MPI_Init(&argc, &argv);
	MPI_Comm_size(MPI_COMM_WORLD, &ntasks);
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);

	MPI_Comm cart;
	int bottom, top;
	int periodic = 1;
	MPI_Cart_create(MPI_COMM_WORLD, 1, &ntasks, &periodic, 0, &cart);
	MPI_Cart_shift(cart, 0, 1, &bottom, &top);

	int i, k;
	MPI_Status status;
	double  total = 0.0;
	float *row = (float*) malloc(sizeof(float) * N);
	memset(row, 0, N);

	MPI_Barrier(MPI_COMM_WORLD);
	total = MPI_Wtime();
	for(i=0; i<N-1; i++){
		printf("%d\n", i);
		if(i>0)
			MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status);
	}
	MPI_Barrier(MPI_COMM_WORLD);
	total = MPI_Wtime() - total;

	if(rank==0)
		printf("%d, %d, %0.3f\n", ntasks, N, total);

	MPI_Finalize();
	free(*A);
	free(A);
	return 0;
}

Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

Reply via email to