Open MPI Users, I'm hoping someone here can help. I built Open MPI 1.10.0 with PGI 15.7 using this configure string:
./configure --disable-vt --with-tm=/PBS --with-verbs --disable-wrapper-rpath \ CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 CFLAGS='-fpic -m64' \ CXXFLAGS='-fpic -m64' FCFLAGS='-fpic -m64' FFLAGS='-fpic -m64' \ --prefix=/nobackup/gmao_SIteam/MPI/pgi_15.7-openmpi_1.10.0 |& tee configure.pgi15.7.log It seemed to pass 'make check'. I'm working at pleiades at NAS, and there they have both Sandy Bridge nodes with GPUs (maia) and regular Sandy Bridge compute nodes (here after called Sandy) without. To be extra careful (since PGI compiles to the architecture you build on) I took a Westmere node and built Open MPI there just in case. So, as I said, all seems to work with a test. I now grab a maia node, maia1, of an allocation of 4 I had: (102) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c (103) $ mpirun -np 2 ./helloWorld.x Process 0 of 2 is on maia1 Process 1 of 2 is on maia1 Good. Now, let's go to a Sandy Bridge (non-GPU) node, r321i7n16, of an allocation of 8 I had: (49) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c (50) $ mpirun -np 2 ./helloWorld.x [r323i5n11:13063] [[62995,0],7] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n6:57417] [[62995,0],2] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n7:67287] [[62995,0],3] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n8:57429] [[62995,0],4] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n10:35329] [[62995,0],6] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n9:13456] [[62995,0],5] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) Hmm. Let's try turning off tcp (often my first thought when on an Infiniband system): (51) $ mpirun --mca btl sm,openib,self -np 2 ./helloWorld.x [r323i5n6:57420] [[62996,0],2] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n9:13459] [[62996,0],5] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n8:57432] [[62996,0],4] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n7:67290] [[62996,0],3] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n11:13066] [[62996,0],7] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [r323i5n10:35332] [[62996,0],6] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) Now, the nodes reporting the issue seem to be the "other" nodes on the allocation that are in a different rack: (52) $ cat $PBS_NODEFILE | uniq r321i7n16 r321i7n17 r323i5n6 r323i5n7 r323i5n8 r323i5n9 r323i5n10 r323i5n11 Maybe that's a clue? I didn't think this would matter if I only ran two processes...and it works on the multi-node maia allocation. I've tried searching the web, but the only place I've seen tcp_peer_send_blocking is in a PDF where they say it's an error that can be seen: http://www.hpc.mcgill.ca/downloads/checkpointing_workshop/20150326%20-%20McGill%20-%20Checkpointing%20Techniques.pdf Any ideas for what this error can mean? -- Matt Thompson Man Among Men Fulcrum of History