Dear all,
I have successfully compiled and installed openmpi 1.3.2 on a 8 socket
quad-core machine from Sun.
I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase but
when I try to run simple MPI programs processes hangs. Actually this is
the kernel of the application I am trying to run:
MPI_Barrier(MPI_COMM_WORLD);
total = MPI_Wtime();
for(i=0; i<N-1; i++){
// printf("%d\n", i);
if(i>0)
MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N,
MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status);
for(k=0; k<N; k++)
A[i][k] = (A[i][k] + A[i+1][k] + row[k])/3;
}
MPI_Barrier(MPI_COMM_WORLD);
total = MPI_Wtime() - total;
Sometimes the program terminates correctly, sometimes don't! I am
running the program using the shared memory module because I am using
just one multi-core with the following command:
mpirun --mca btl self,sm --np 32 ./my_prog prob_size
If I print the index number during the program execution I can see that
program stop running around index value 1600... but it actually doesn't
crash. It just stops! :(
I run the program under strace to see what's going on and this is the
output:
[...]
futex(0x2b20c02d9790, FUTEX_WAKE, 1) = 1
futex(0x2aaaaafcf2b0, FUTEX_WAKE, 1) = 0
readv(100,
[{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"...,
36}], 1) = 36
readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\4\0\0\0jj\0\0\0\1\0\0\0",
28}], 1) = 28
futex(0x19e93fd8, FUTEX_WAKE, 1) = 1
futex(0x2aaaaafcf5e0, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource
temporarily unavailable)
futex(0x2aaaaafcf5e0, FUTEX_WAKE, 1) = 0
writev(102,
[{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\4\0\0\0\4\0\0\0\34"...,
36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}],
2) = 64
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27,
events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39,
events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50,
events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61,
events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72,
events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83,
events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94,
events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN},
{fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN,
revents=POLLIN}, ...], 39, 1000) = 1
readv(100,
[{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"...,
36}], 1) = 36
readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0",
28}], 1) = 28
futex(0x19e93fd8, FUTEX_WAKE, 1) = 1
writev(109,
[{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\7\0\0\0\4\0\0\0\34"...,
36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}],
2) = 64
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27,
events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39,
events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50,
events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61,
events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72,
events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83,
events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94,
events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN},
{fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN},
...], 39, 1000) = 1
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11,
events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27,
events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39,
events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50,
events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61,
events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72,
events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83,
events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94,
events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN},
{fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN},
...], 39, 1000) = 1
and the program keep printing this poll() call till I stop it!
The program runs perfectly with my old configuration which was OpenMPI
1.3.1 compiled with Gcc-4.4. Actually I see the same problem when I
compile Openmpi-1.3.1 with Gcc 4.4. Is there any conflict which arise
when gcc-4.4 is used?
Regards, Simone