Starting in the 1.7 series, OMPI by default launches daemons on all nodes in 
the allocation during startup. This is done so we can “probe” the topology of 
the nodes and use that info during the process mapping procedure - e.g., if you 
want to map-by NUMA regions.

What is happening here is that some of the nodes in your allocation aren’t 
allowing those daemons to callback to mpirun. Either a firewall is in the way, 
or something is preventing it.

If you don’t want to launch on those other nodes, you could just add —novm to 
your cmd line, or use the —host option to restrict us to your local node. 
However, I imagine you got the bigger allocation so you could use it :-)

In which case, you need to remove the obstacle. You might check for firewall, 
or check to see if multiple NICs are on the non-maia nodes (this can sometimes 
confuse things, especially if someone put the NICs on the same IP subnet)

HTH
Ralph



> On Sep 24, 2015, at 8:18 AM, Matt Thompson <fort...@gmail.com> wrote:
> 
> Open MPI Users,
> 
> I'm hoping someone here can help. I built Open MPI 1.10.0 with PGI 15.7 using 
> this configure string:
> 
>  ./configure --disable-vt --with-tm=/PBS --with-verbs --disable-wrapper-rpath 
> \
>     CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 CFLAGS='-fpic -m64' \
>     CXXFLAGS='-fpic -m64' FCFLAGS='-fpic -m64' FFLAGS='-fpic -m64' \
>     --prefix=/nobackup/gmao_SIteam/MPI/pgi_15.7-openmpi_1.10.0 |& tee 
> configure.pgi15.7.log
> 
> It seemed to pass 'make check'. 
> 
> I'm working at pleiades at NAS, and there they have both Sandy Bridge nodes 
> with GPUs (maia) and regular Sandy Bridge compute nodes (here after called 
> Sandy) without. To be extra careful (since PGI compiles to the architecture 
> you build on) I took a Westmere node and built Open MPI there just in case.
> 
> So, as I said, all seems to work with a test. I now grab a maia node, maia1, 
> of an allocation of 4 I had:
> 
> (102) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
> (103) $ mpirun -np 2 ./helloWorld.x
> Process 0 of 2 is on maia1 
> Process 1 of 2 is on maia1 
> 
> Good. Now, let's go to a Sandy Bridge (non-GPU) node, r321i7n16, of an 
> allocation of 8 I had:
> 
> (49) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
> (50) $ mpirun -np 2 ./helloWorld.x
> [r323i5n11:13063] [[62995,0],7] tcp_peer_send_blocking: send() to socket 9 
> failed: Broken pipe (32)
> [r323i5n6:57417] [[62995,0],2] tcp_peer_send_blocking: send() to socket 9 
> failed: Broken pipe (32)
> [r323i5n7:67287] [[62995,0],3] tcp_peer_send_blocking: send() to socket 9 
> failed: Broken pipe (32)
> [r323i5n8:57429] [[62995,0],4] tcp_peer_send_blocking: send() to socket 9 
> failed: Broken pipe (32)
> [r323i5n10:35329] [[62995,0],6] tcp_peer_send_blocking: send() to socket 9 
> failed: Broken pipe (32)
> [r323i5n9:13456] [[62995,0],5] tcp_peer_send_blocking: send() to socket 9 
> failed: Broken pipe (32)
> 
> Hmm. Let's try turning off tcp (often my first thought when on an Infiniband 
> system):
> 
> (51) $ mpirun --mca btl sm,openib,self -np 2 ./helloWorld.x
> [r323i5n6:57420] [[62996,0],2] tcp_peer_send_blocking: send() to socket 9 
> failed: Broken pipe (32)
> [r323i5n9:13459] [[62996,0],5] tcp_peer_send_blocking: send() to socket 9 
> failed: Broken pipe (32)
> [r323i5n8:57432] [[62996,0],4] tcp_peer_send_blocking: send() to socket 9 
> failed: Broken pipe (32)
> [r323i5n7:67290] [[62996,0],3] tcp_peer_send_blocking: send() to socket 9 
> failed: Broken pipe (32)
> [r323i5n11:13066] [[62996,0],7] tcp_peer_send_blocking: send() to socket 9 
> failed: Broken pipe (32)
> [r323i5n10:35332] [[62996,0],6] tcp_peer_send_blocking: send() to socket 9 
> failed: Broken pipe (32)
> 
> Now, the nodes reporting the issue seem to be the "other" nodes on the 
> allocation that are in a different rack:
> 
> (52) $ cat $PBS_NODEFILE | uniq
> r321i7n16
> r321i7n17
> r323i5n6
> r323i5n7
> r323i5n8
> r323i5n9
> r323i5n10
> r323i5n11
> 
> Maybe that's a clue? I didn't think this would matter if I only ran two 
> processes...and it works on the multi-node maia allocation.
> 
> I've tried searching the web, but the only place I've seen 
> tcp_peer_send_blocking is in a PDF where they say it's an error that can be 
> seen:
> 
> http://www.hpc.mcgill.ca/downloads/checkpointing_workshop/20150326%20-%20McGill%20-%20Checkpointing%20Techniques.pdf
>  
> <http://www.hpc.mcgill.ca/downloads/checkpointing_workshop/20150326%20-%20McGill%20-%20Checkpointing%20Techniques.pdf>
> 
> Any ideas for what this error can mean?
> 
> -- 
> Matt Thompson
> Man Among Men
> Fulcrum of History
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27669.php

Reply via email to