Open MPI Users,

I'm hoping someone here can help. I built Open MPI 1.10.0 with PGI 15.7
using this configure string:

 ./configure --disable-vt --with-tm=/PBS --with-verbs
--disable-wrapper-rpath \
    CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 CFLAGS='-fpic -m64' \
    CXXFLAGS='-fpic -m64' FCFLAGS='-fpic -m64' FFLAGS='-fpic -m64' \
    --prefix=/nobackup/gmao_SIteam/MPI/pgi_15.7-openmpi_1.10.0 |& tee
configure.pgi15.7.log

It seemed to pass 'make check'.

I'm working at pleiades at NAS, and there they have both Sandy Bridge nodes
with GPUs (maia) and regular Sandy Bridge compute nodes (here after called
Sandy) without. To be extra careful (since PGI compiles to the architecture
you build on) I took a Westmere node and built Open MPI there just in case.

So, as I said, all seems to work with a test. I now grab a maia node,
maia1, of an allocation of 4 I had:

(102) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
(103) $ mpirun -np 2 ./helloWorld.x
Process 0 of 2 is on maia1
Process 1 of 2 is on maia1

Good. Now, let's go to a Sandy Bridge (non-GPU) node, r321i7n16, of an
allocation of 8 I had:

(49) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
(50) $ mpirun -np 2 ./helloWorld.x
[r323i5n11:13063] [[62995,0],7] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n6:57417] [[62995,0],2] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n7:67287] [[62995,0],3] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n8:57429] [[62995,0],4] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n10:35329] [[62995,0],6] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n9:13456] [[62995,0],5] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)

Hmm. Let's try turning off tcp (often my first thought when on an
Infiniband system):

(51) $ mpirun --mca btl sm,openib,self -np 2 ./helloWorld.x
[r323i5n6:57420] [[62996,0],2] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n9:13459] [[62996,0],5] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n8:57432] [[62996,0],4] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n7:67290] [[62996,0],3] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n11:13066] [[62996,0],7] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
[r323i5n10:35332] [[62996,0],6] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)

Now, the nodes reporting the issue seem to be the "other" nodes on the
allocation that are in a different rack:

(52) $ cat $PBS_NODEFILE | uniq
r321i7n16
r321i7n17
r323i5n6
r323i5n7
r323i5n8
r323i5n9
r323i5n10
r323i5n11

Maybe that's a clue? I didn't think this would matter if I only ran two
processes...and it works on the multi-node maia allocation.

I've tried searching the web, but the only place I've seen
tcp_peer_send_blocking is in a PDF where they say it's an error that can be
seen:

http://www.hpc.mcgill.ca/downloads/checkpointing_workshop/20150326%20-%20McGill%20-%20Checkpointing%20Techniques.pdf

Any ideas for what this error can mean?

-- 
Matt Thompson

Man Among Men
Fulcrum of History

Reply via email to