Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0
On Wed, 2009-11-18 at 01:28 -0800, Bill Broadley wrote: > A rather stable production code that has worked with various versions > of MPI > on various architectures started hanging with gcc-4.4.2 and openmpi > 1.3.33 > > Which lead me to this thread. If you're investigating hangs in a parallel job take a look at the tool linked to below (padb), it should be able to give you a parallel stack trace and the message queues for the job. http://padb.pittman.org.uk/full-report.html Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0
Vincent Loechner wrote: Bill, A rather stable production code that has worked with various versions of MPI on various architectures started hanging with gcc-4.4.2 and openmpi 1.3.33 Probably this bug : https://svn.open-mpi.org/trac/ompi/ticket/2043 Waiting for a correction, try adding this option to mpirun : -mca btl_sm_num_fifos 5 Bill, I noticed you updated the ticket. Thank you. I've been working on this in earnest. Something funny is going on as far as the "memory model" goes: values when writing to the shared-memory FIFOs go goofy. Like a FIFO slot that was initialized to be free and still "should be" free, looks occupied when a writer checks, but it's empty immediately thereafter even though no one "presumably" has accessed that location. I almost have a stand-alone program (C only, no OMPI infrastructure) that demonstrates the problem, but I'm not quite there. Then, it'll either become evident to me what's wrong or I'll be able to show other people more easily why I think something is wrong. At this point, I really have no idea if the problem is GCC 4.4.x or OMPI 1.3.x.
Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0
Bill, > A rather stable production code that has worked with various versions of MPI > on various architectures started hanging with gcc-4.4.2 and openmpi 1.3.33 Probably this bug : https://svn.open-mpi.org/trac/ompi/ticket/2043 Waiting for a correction, try adding this option to mpirun : -mca btl_sm_num_fifos 5 --Vincent
Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0
A rather stable production code that has worked with various versions of MPI on various architectures started hanging with gcc-4.4.2 and openmpi 1.3.33 Which lead me to this thread. I made some very small changes to Eugene's code, here's the diff: $ diff testorig.c billtest.c 3,5c3,4 < < #define N 4 < #define M 4 --- > #define N 8000 > #define M 8000 17c16 < --- > fprintf (stderr, "Initialized\n"); 32,33c31,39 < MPI_Sendrecv (sbuf, N, MPI_FLOAT, top, 0, < rbuf, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status); --- > { > if ((me == 0) && (i % 100 == 0)) > { > fprintf (stderr, "%d\n", i); > } > MPI_Sendrecv (sbuf, N, MPI_FLOAT, top, 0, rbuf, N, MPI_FLOAT, bottom, 0, > MPI_COMM_WORLD, &status); > } > Basically print some occasional progress, and shrink M and N. I'm running on a new intel dual socket nehalem system with centos-5.4. I compiled gcc-4.4.2 and openmpi myself with all the defaults, except I had to point out mpfr-2.4.1 to gcc. If I run: $ mpirun -np 4 ./billtest About 1 in 2 times I get something like: bill@farm bill]$ mpirun -np 4 ./billtest Initialized Initialized Initialized Initialized 0 100 Next time worked, next time: [bill@farm bill]$ mpirun -np 4 ./billtest Initialized Initialized Initialized Initialized 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 Next time hung at 7100. Next time worked. If I strace it when hung I get something like: poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}], 6, 0) = 0 (Timeout) If I run gdb on a hung job (compiled with -O4 -g) (gdb) bt #0 0x2ab3b34cb385 in ompi_request_default_wait () from /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0 #1 0x2ab3b34f0d48 in PMPI_Sendrecv () from /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0 #2 0x00400b88 in main (argc=1, argv=0x7fff083fd298) at billtest.c:36 (gdb) If I recompile with -O1 I get the same thing. Even -g I get the same thing. If I compile the application with gcc-4.3 and still use a gcc-4.4 compiled openmpi I still get hangs. If I compiled openmpi-1.3.3 with gcc-4.3 and the application with gcc-4.3 and I run it 20 times I get zero hangs. Seems like that gcc-4.4 and openib-1.3.3 are incompatible. In my production code I'd always get hung at MPI_Waitall, but the above is obviously inside of Sendrecv. To be paranoid I just reran it 40 times without a hang. Original code below. Eugene Loh wrote: ... > #include > #include > > #define N 4 > #define M 4 > > int main(int argc, char **argv) { > int np, me, i, top, bottom; > float sbuf[N], rbuf[N]; > MPI_Status status; > > MPI_Init(&argc,&argv); > MPI_Comm_size(MPI_COMM_WORLD,&np); > MPI_Comm_rank(MPI_COMM_WORLD,&me); > > top= me + 1; if ( top >= np ) top-= np; > bottom = me - 1; if ( bottom < 0 ) bottom += np; > > for ( i = 0; i < N; i++ ) sbuf[i] = 0; > for ( i = 0; i < N; i++ ) rbuf[i] = 0; > > MPI_Barrier(MPI_COMM_WORLD); > for ( i = 0; i < M - 1; i++ ) >MPI_Sendrecv(sbuf, N, MPI_FLOAT, top , 0, > rbuf, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status); > MPI_Barrier(MPI_COMM_WORLD); > > MPI_Finalize(); > return 0; > } > > Can you reproduce your problem with this test case? > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0
Simone Pellegrini wrote: sorry for the delay but I did some additional experiments to found out whether the problem was openmpi or gcc! The program just hangs... and never terminates! I am running on a SMP machine with 32 cores, actually it is a Sun Fire X4600 X2. (8 quad-core Barcelona AMD chips), the OS is CentOS 5 and the kernel is 2.6.18-92.el5.src-PAPI (patched with PAPI). I use a N of 1024, and if I print out the value of the iterator i, sometimes it stops around 165, other times around 520... and it doesn't make any sense. If I run the program (and it's important to notice I don't recompile it, I just use another mpirun from a different mpi version) the program works fine. I did some experiments during the weekend and if I use openmpi-1.3.2 compiled with gcc433 everything works fine. So I really think the problem is strictly related to the usage of gcc-4.4.0! ...and it doesn't depends from OpenMPI as the program hangs even when I use gcc 1.3.1 compiled with gcc 4.4! I finally got GCC 4.4, but was unable to reproduce the problem. How small can you make np (number of MPI processes) and still see the problem? How reproducible is the problem? When it hangs, can you get stack traces of all the processes? We're trying to hunt down some similar behavior, but I think yours is of a different flavor.
Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0
Hi, sorry for the delay but I did some additional experiments to found out whether the problem was openmpi or gcc! In attach u will find the program that causes the problem before mentioned. I compile the program with the following line: $HOME/openmpi-1.3.2-gcc44/bin/mpicc -O3 -g -Wall -fmessage-length=0 -m64 bug.c -o bug When I run the program using mpi 1.3.2 compiled with gcc44 in the following way: $HOME/openmpi-1.3.2-gcc44/bin/mpirun --mca btl self,sm --np 32 ./bug 1024 The program just hangs... and never terminates! I am running on a SMP machine with 32 cores, actually it is a Sun Fire X4600 X2. (8 quad-core Barcelona AMD chips), the OS is CentOS 5 and the kernel is 2.6.18-92.el5.src-PAPI (patched with PAPI). I use a N of 1024, and if I print out the value of the iterator i, sometimes it stops around 165, other times around 520... and it doesn't make any sense. If I run the program (and it's important to notice I don't recompile it, I just use another mpirun from a different mpi version) the program works fine. I did some experiments during the weekend and if I use openmpi-1.3.2 compiled with gcc433 everything works fine. So I really think the problem is strictly related to the usage of gcc-4.4.0! ...and it doesn't depends from OpenMPI as the program hangs even when I use gcc 1.3.1 compiled with gcc 4.4! I hope everything is clear now. regards, Simone Eugene Loh wrote: So far, I'm unable to reproduce this problem. I haven't exactly reproduced your test conditions, but then I can't. At a minimum, I don't have exactly the code you ran (and not convinced I want to!). So: *) Can you reproduce the problem with the stand-alone test case I sent out? *) Does the problem correlate with OMPI version? (I.e., 1.3.1 versus 1.3.2.) *) Does the problem occur at lower np? *) Does the problem correlate with the compiler version? (I.e., GCC 4.4 versus 4.3.3.) *) What is the failure rate? How many times should I expect to run to see failures? *) How large is N? Eugene Loh wrote: Simone Pellegrini wrote: Dear all, I have successfully compiled and installed openmpi 1.3.2 on a 8 socket quad-core machine from Sun. I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase but when I try to run simple MPI programs processes hangs. Actually this is the kernel of the application I am trying to run: MPI_Barrier(MPI_COMM_WORLD); total = MPI_Wtime(); for(i=0; i0) MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status); for(k=0; k Do you know if this kernel is sufficient to reproduce the problem? How large is N? Evidently, it's greater than 1600, but I'm still curious how big. What are top and bottom? Are they rank+1 and rank-1? Sometimes the program terminates correctly, sometimes don't! Roughly, what fraction of runs hang? 50%? 1%? <0.1%? I am running the program using the shared memory module because I am using just one multi-core with the following command: mpirun --mca btl self,sm --np 32 ./my_prog prob_size Any idea if this fails at lower np? If I print the index number during the program execution I can see that program stop running around index value 1600... but it actually doesn't crash. It just stops! :( I run the program under strace to see what's going on and this is the output: [...] futex(0x2b20c02d9790, FUTEX_WAKE, 1)= 1 futex(0x2afcf2b0, FUTEX_WAKE, 1)= 0 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\4\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) = 28 futex(0x19e93fd8, FUTEX_WAKE, 1)= 1 futex(0x2afcf5e0, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0x2afcf5e0, FUTEX_WAKE, 1)= 0 writev(102, [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\4\0\0\0\4\0\0\0\34"..., 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 2) = 64 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN, revents=POLLIN}, ...], 39, 1000) = 1 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\
Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0
So far, I'm unable to reproduce this problem. I haven't exactly reproduced your test conditions, but then I can't. At a minimum, I don't have exactly the code you ran (and not convinced I want to!). So: *) Can you reproduce the problem with the stand-alone test case I sent out? *) Does the problem correlate with OMPI version? (I.e., 1.3.1 versus 1.3.2.) *) Does the problem occur at lower np? *) Does the problem correlate with the compiler version? (I.e., GCC 4.4 versus 4.3.3.) *) What is the failure rate? How many times should I expect to run to see failures? *) How large is N? Eugene Loh wrote: Simone Pellegrini wrote: Dear all, I have successfully compiled and installed openmpi 1.3.2 on a 8 socket quad-core machine from Sun. I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase but when I try to run simple MPI programs processes hangs. Actually this is the kernel of the application I am trying to run: MPI_Barrier(MPI_COMM_WORLD); total = MPI_Wtime(); for(i=0; i0) MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status); for(k=0; k Do you know if this kernel is sufficient to reproduce the problem? How large is N? Evidently, it's greater than 1600, but I'm still curious how big. What are top and bottom? Are they rank+1 and rank-1? Sometimes the program terminates correctly, sometimes don't! Roughly, what fraction of runs hang? 50%? 1%? <0.1%? I am running the program using the shared memory module because I am using just one multi-core with the following command: mpirun --mca btl self,sm --np 32 ./my_prog prob_size Any idea if this fails at lower np? If I print the index number during the program execution I can see that program stop running around index value 1600... but it actually doesn't crash. It just stops! :( I run the program under strace to see what's going on and this is the output: [...] futex(0x2b20c02d9790, FUTEX_WAKE, 1)= 1 futex(0x2afcf2b0, FUTEX_WAKE, 1)= 0 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\4\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) = 28 futex(0x19e93fd8, FUTEX_WAKE, 1)= 1 futex(0x2afcf5e0, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0x2afcf5e0, FUTEX_WAKE, 1)= 0 writev(102, [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\4\0\0\0\4\0\0\0\34"..., 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 2) = 64 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN, revents=POLLIN}, ...], 39, 1000) = 1 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) = 28 futex(0x19e93fd8, FUTEX_WAKE, 1)= 1 writev(109, [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\7\0\0\0\4\0\0\0\34"..., 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 2) = 64 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=PO
Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0
I'm responsible for some sm changes in 1.3.2, so I can try looking at this. Some questions below: Simone Pellegrini wrote: Dear all, I have successfully compiled and installed openmpi 1.3.2 on a 8 socket quad-core machine from Sun. I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase but when I try to run simple MPI programs processes hangs. Actually this is the kernel of the application I am trying to run: MPI_Barrier(MPI_COMM_WORLD); total = MPI_Wtime(); for(i=0; i0) MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status); for(k=0; k Do you know if this kernel is sufficient to reproduce the problem? How large is N? Evidently, it's greater than 1600, but I'm still curious how big. What are top and bottom? Are they rank+1 and rank-1? Sometimes the program terminates correctly, sometimes don't! Roughly, what fraction of runs hang? 50%? 1%? <0.1%? I am running the program using the shared memory module because I am using just one multi-core with the following command: mpirun --mca btl self,sm --np 32 ./my_prog prob_size Any idea if this fails at lower np? If I print the index number during the program execution I can see that program stop running around index value 1600... but it actually doesn't crash. It just stops! :( I run the program under strace to see what's going on and this is the output: [...] futex(0x2b20c02d9790, FUTEX_WAKE, 1)= 1 futex(0x2afcf2b0, FUTEX_WAKE, 1)= 0 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\4\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) = 28 futex(0x19e93fd8, FUTEX_WAKE, 1)= 1 futex(0x2afcf5e0, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0x2afcf5e0, FUTEX_WAKE, 1)= 0 writev(102, [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\4\0\0\0\4\0\0\0\34"..., 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 2) = 64 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN, revents=POLLIN}, ...], 39, 1000) = 1 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) = 28 futex(0x19e93fd8, FUTEX_WAKE, 1)= 1 writev(109, [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\7\0\0\0\4\0\0\0\34"..., 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 2) = 64 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1 and the program keep printing this poll() call till I stop it! The program
[OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0
Dear all, I have successfully compiled and installed openmpi 1.3.2 on a 8 socket quad-core machine from Sun. I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase but when I try to run simple MPI programs processes hangs. Actually this is the kernel of the application I am trying to run: MPI_Barrier(MPI_COMM_WORLD); total = MPI_Wtime(); for(i=0; i0) MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status); for(k=0; kSometimes the program terminates correctly, sometimes don't! I am running the program using the shared memory module because I am using just one multi-core with the following command: mpirun --mca btl self,sm --np 32 ./my_prog prob_size If I print the index number during the program execution I can see that program stop running around index value 1600... but it actually doesn't crash. It just stops! :( I run the program under strace to see what's going on and this is the output: [...] futex(0x2b20c02d9790, FUTEX_WAKE, 1)= 1 futex(0x2afcf2b0, FUTEX_WAKE, 1)= 0 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\4\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) = 28 futex(0x19e93fd8, FUTEX_WAKE, 1)= 1 futex(0x2afcf5e0, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0x2afcf5e0, FUTEX_WAKE, 1)= 0 writev(102, [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\4\0\0\0\4\0\0\0\34"..., 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 2) = 64 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN, revents=POLLIN}, ...], 39, 1000) = 1 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) = 28 futex(0x19e93fd8, FUTEX_WAKE, 1)= 1 writev(109, [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\7\0\0\0\4\0\0\0\34"..., 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 2) = 64 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1 and the program keep printing this poll() call till I stop it! The program runs perfectly with my old configuration which was OpenMPI 1.3.1 compiled with Gcc-4.4. Actually I see the same problem when I compile Openmpi-1.3.1 with Gcc 4.4. Is there any conflict which arise when gcc-4.4 is used? Regards, Simone