On Thu, Sep 24, 2015 at 12:10 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Ah, sorry - wrong param. It’s the out-of-band that is having the problem.
> Try adding —mca oob_tcp_if_include <foo>
>

Ooh. Okay. Look at this:

(13) $ mpirun --mca oob_tcp_if_include ib0 -np 2 ./helloWorld.x
Process 1 of 2 is on r509i2n17
Process 0 of 2 is on r509i2n17

So that is nice. Now the spin up if I have 8 or so nodes is rather...slow.
But at this point I'll take working over efficient. Quick startup can come
later.

Matt



>
>
> On Sep 24, 2015, at 8:56 AM, Matt Thompson <fort...@gmail.com> wrote:
>
> Ralph,
>
> I believe these nodes might have both an Ethernet and Infiniband port
> where the Ethernet port is not the one to use. Is there a way to tell Open
> MPI to ignore any ethernet devices it sees? I've tried:
>
> --mca btl sm,openib,self
>
> and (based on the advice of the much more intelligent support at NAS):
>
> --mca btl openib,self --mca btl_openib_if_include mlx4_0,mlx4_1
>
> But neither worked.
>
> Matt
>
>
> On Thu, Sep 24, 2015 at 11:41 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Starting in the 1.7 series, OMPI by default launches daemons on all nodes
>> in the allocation during startup. This is done so we can “probe” the
>> topology of the nodes and use that info during the process mapping
>> procedure - e.g., if you want to map-by NUMA regions.
>>
>> What is happening here is that some of the nodes in your allocation
>> aren’t allowing those daemons to callback to mpirun. Either a firewall is
>> in the way, or something is preventing it.
>>
>> If you don’t want to launch on those other nodes, you could just add
>> —novm to your cmd line, or use the —host option to restrict us to your
>> local node. However, I imagine you got the bigger allocation so you could
>> use it :-)
>>
>> In which case, you need to remove the obstacle. You might check for
>> firewall, or check to see if multiple NICs are on the non-maia nodes (this
>> can sometimes confuse things, especially if someone put the NICs on the
>> same IP subnet)
>>
>> HTH
>> Ralph
>>
>>
>>
>> On Sep 24, 2015, at 8:18 AM, Matt Thompson <fort...@gmail.com> wrote:
>>
>> Open MPI Users,
>>
>> I'm hoping someone here can help. I built Open MPI 1.10.0 with PGI 15.7
>> using this configure string:
>>
>>  ./configure --disable-vt --with-tm=/PBS --with-verbs
>> --disable-wrapper-rpath \
>>     CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 CFLAGS='-fpic -m64' \
>>     CXXFLAGS='-fpic -m64' FCFLAGS='-fpic -m64' FFLAGS='-fpic -m64' \
>>     --prefix=/nobackup/gmao_SIteam/MPI/pgi_15.7-openmpi_1.10.0 |& tee
>> configure.pgi15.7.log
>>
>> It seemed to pass 'make check'.
>>
>> I'm working at pleiades at NAS, and there they have both Sandy Bridge
>> nodes with GPUs (maia) and regular Sandy Bridge compute nodes (here after
>> called Sandy) without. To be extra careful (since PGI compiles to the
>> architecture you build on) I took a Westmere node and built Open MPI there
>> just in case.
>>
>> So, as I said, all seems to work with a test. I now grab a maia node,
>> maia1, of an allocation of 4 I had:
>>
>> (102) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
>> (103) $ mpirun -np 2 ./helloWorld.x
>> Process 0 of 2 is on maia1
>> Process 1 of 2 is on maia1
>>
>> Good. Now, let's go to a Sandy Bridge (non-GPU) node, r321i7n16, of an
>> allocation of 8 I had:
>>
>> (49) $ mpicc -tp=px-64 -o helloWorld.x helloWorld.c
>> (50) $ mpirun -np 2 ./helloWorld.x
>> [r323i5n11:13063] [[62995,0],7] tcp_peer_send_blocking: send() to socket
>> 9 failed: Broken pipe (32)
>> [r323i5n6:57417] [[62995,0],2] tcp_peer_send_blocking: send() to socket 9
>> failed: Broken pipe (32)
>> [r323i5n7:67287] [[62995,0],3] tcp_peer_send_blocking: send() to socket 9
>> failed: Broken pipe (32)
>> [r323i5n8:57429] [[62995,0],4] tcp_peer_send_blocking: send() to socket 9
>> failed: Broken pipe (32)
>> [r323i5n10:35329] [[62995,0],6] tcp_peer_send_blocking: send() to socket
>> 9 failed: Broken pipe (32)
>> [r323i5n9:13456] [[62995,0],5] tcp_peer_send_blocking: send() to socket 9
>> failed: Broken pipe (32)
>>
>> Hmm. Let's try turning off tcp (often my first thought when on an
>> Infiniband system):
>>
>> (51) $ mpirun --mca btl sm,openib,self -np 2 ./helloWorld.x
>> [r323i5n6:57420] [[62996,0],2] tcp_peer_send_blocking: send() to socket 9
>> failed: Broken pipe (32)
>> [r323i5n9:13459] [[62996,0],5] tcp_peer_send_blocking: send() to socket 9
>> failed: Broken pipe (32)
>> [r323i5n8:57432] [[62996,0],4] tcp_peer_send_blocking: send() to socket 9
>> failed: Broken pipe (32)
>> [r323i5n7:67290] [[62996,0],3] tcp_peer_send_blocking: send() to socket 9
>> failed: Broken pipe (32)
>> [r323i5n11:13066] [[62996,0],7] tcp_peer_send_blocking: send() to socket
>> 9 failed: Broken pipe (32)
>> [r323i5n10:35332] [[62996,0],6] tcp_peer_send_blocking: send() to socket
>> 9 failed: Broken pipe (32)
>>
>> Now, the nodes reporting the issue seem to be the "other" nodes on the
>> allocation that are in a different rack:
>>
>> (52) $ cat $PBS_NODEFILE | uniq
>> r321i7n16
>> r321i7n17
>> r323i5n6
>> r323i5n7
>> r323i5n8
>> r323i5n9
>> r323i5n10
>> r323i5n11
>>
>> Maybe that's a clue? I didn't think this would matter if I only ran two
>> processes...and it works on the multi-node maia allocation.
>>
>> I've tried searching the web, but the only place I've seen
>> tcp_peer_send_blocking is in a PDF where they say it's an error that can be
>> seen:
>>
>>
>> http://www.hpc.mcgill.ca/downloads/checkpointing_workshop/20150326%20-%20McGill%20-%20Checkpointing%20Techniques.pdf
>>
>> Any ideas for what this error can mean?
>>
>> --
>> Matt Thompson
>>
>> Man Among Men
>> Fulcrum of History
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/09/27669.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/09/27670.php
>>
>
>
>
> --
> Matt Thompson
>
> Man Among Men
> Fulcrum of History
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27671.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27672.php
>



-- 
Matt Thompson

Man Among Men
Fulcrum of History

Reply via email to