Dear all,
I have successfully compiled and installed openmpi 1.3.2 on a 8 socket quad-core machine from Sun.

I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase but when I try to run simple MPI programs processes hangs. Actually this is the kernel of the application I am trying to run:

MPI_Barrier(MPI_COMM_WORLD);
   total = MPI_Wtime();
   for(i=0; i<N-1; i++){
       // printf("%d\n", i);
       if(i>0)
MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status);
       for(k=0; k<N; k++)
           A[i][k] = (A[i][k] + A[i+1][k] + row[k])/3;
   }
   MPI_Barrier(MPI_COMM_WORLD);
   total = MPI_Wtime() - total;

Sometimes the program terminates correctly, sometimes don't! I am running the program using the shared memory module because I am using just one multi-core with the following command:

mpirun --mca btl self,sm --np 32 ./my_prog prob_size

If I print the index number during the program execution I can see that program stop running around index value 1600... but it actually doesn't crash. It just stops! :(

I run the program under strace to see what's going on and this is the output:
[...]
futex(0x2b20c02d9790, FUTEX_WAKE, 1)    = 1
futex(0x2aaaaafcf2b0, FUTEX_WAKE, 1)    = 0
readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\4\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) = 28
futex(0x19e93fd8, FUTEX_WAKE, 1)        = 1
futex(0x2aaaaafcf5e0, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x2aaaaafcf5e0, FUTEX_WAKE, 1)    = 0
writev(102, [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\4\0\0\0\4\0\0\0\34"..., 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 2) = 64 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN, revents=POLLIN}, ...], 39, 1000) = 1 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) = 28
futex(0x19e93fd8, FUTEX_WAKE, 1)        = 1
writev(109, [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\7\0\0\0\4\0\0\0\34"..., 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 2) = 64 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1

and the program keep printing this poll() call till I stop it!

The program runs perfectly with my old configuration which was OpenMPI 1.3.1 compiled with Gcc-4.4. Actually I see the same problem when I compile Openmpi-1.3.1 with Gcc 4.4. Is there any conflict which arise when gcc-4.4 is used?

Regards, Simone

Reply via email to