Hello,

I think that your problem is that your nodes are virtual and that network address translation
is not configured.

See my answers below.


Le 2012-05-21 19:32, Keith Robison a écrit :
Apologies -- haven't had a chance to post these.

NetworkTest1 (seems to hang up)
http://pastebin.com/RpXsSGUR


The last line is:

[actinode04.cluster:24490] btl: tcp: attempting to connect() to address 192.168.122.1 on port 16644

This may indicates problem with the network.

According to ifconfig (tests 5 and 6), both nodes have the same address 192.168.122.1 for the network interface virbr0. virbr0 is usually a virtual network interface created
by the hypervisor overlord Xen.

NetworkTest2 (also seems to hang)
http://pastebin.com/WCx821Ae


The last line is

received unexpected process identifier [[33498,1],25]

I don't know what that means in your context.

NetworkTest3
http://pastebin.com/FL0aCdZX

OK, you don't have Infiniband.


NetworkTest4
http://pastebin.com/C4wr5iAA

Your main node has 2 network interfaces.


NetworkTest5
http://pastebin.com/BbCmUQP4


You have a virtual network interface: virbr0


NetworkTest6
http://pastebin.com/NE86GmdB



You have a virtual network interface: virbr0


Your two nodes, actinode03 and actinode04 have exactly the same addresses.

http://pastebin.com/BbCmUQP4
http://pastebin.com/NE86GmdB


Is it possible that your nodes are actually virtual nodes and that network address translation is now
configured at all ?



Running Ray on virtual machines will work if these virtual machines are properly configured.

I have used successfully Ray on Amazon EC2, which is heavily virtualized.



At this point, I think that your local technical support will be more efficient at solving this
network problem.


                       Sébastien



Thanks again for your help on this!

Keith Robison

On Tue, May 15, 2012 at 10:31 AM, Sébastien Boisvert <[email protected] <mailto:[email protected]>> wrote:

    Hi,


    I think the mpiexec -48 date indicates that processes are launched
    on the two
    nodes.

    But there is probably a problem with the way messages are sent
    with your network.

    Can you try these network tests, paste the results on
    http://pastebin.com/
    (one paste per test) and link these in your reply ?

    Each of these tests will take a few seconds to run.


    Test 1

    mpiexec -n 48 -hostfile hostfile.actinode34 \
    --mca mca_verbose 9999999 \
    --mca btl_base_verbose 9999999 \
    --mca btl_openib_verbose 9999999 \
    /home/krobison/packages/Ray-v2.0-ReleaseCandidate5/Ray \
    -test-network-only -o NetworkTest1 &> NetworkTest1.txt


    Test 2

    mpiexec -n 48  -hostfile hostfile.actinode34 \
    --mca mca_verbose 9999999 \
    --mca btl_base_verbose 9999999 \
    --mca btl_openib_verbose 9999999 \
    --mca btl self,tcp \
    /home/krobison/packages/Ray-v2.0-ReleaseCandidate5/Ray \
    -test-network-only -o NetworkTest2 &> NetworkTest2.txt


    Test 3

    mpiexec -n 48  -hostfile hostfile.actinode34 \
    --mca mca_verbose 9999999 \
    --mca btl_base_verbose 9999999 \
    --mca btl_openib_verbose 9999999 \
    --mca btl self,openib \
    /home/krobison/packages/Ray-v2.0-ReleaseCandidate5/Ray \
    -test-network-only -o NetworkTest3 &> NetworkTest3.txt


    Test 4


    /sbin/ifconfig -a &> NetworkTest4.txt


    Test 5

    ssh actinode03 /sbin/ifconfig -a &> NetworkTest5.txt


    Test 6

    ssh actinode04 /sbin/ifconfig -a &> NetworkTest6.txt



                     Sébastien



    Le 2012-05-15 09:40, Keith Robison a écrit :
    Apologies for replying to my own message; something is amiss with
    my subscriptions & I didn't see Sebastien's helpful reply [I went
    to the archives to get it]

    Sebastien: On which machine are you when launching mpirun/mpiexec ?

    I'm launching the jobs from the head node (actinode)

    Sebastien suggested I try pinging one node from another, which
    failed -- so that is a clue:

    ssh actinode03 ping actinode04
    ping: icmp open socket: Operation not permitted

    He also suggested I try
     mpiexec -n 48 -hostfile hostfile.actinode34 date

    Which prints out the date 48 times -- so that works

    Sebastien also suggested I run

    ompi_info -a

    Which gives a lot of output


    Thanks for being so helpful!  I'm feeling like I don't even know
    the right questions to ask, so getting any direction is really a
    boost.

    Keith R.



    On Sat, May 12, 2012 at 5:58 PM, Keith Robison
    <[email protected] <mailto:[email protected]>> wrote:

        Hello!  I've run into a roadblock.

        If I run the following command in the background, the
        assembler seems to stall, with the last output being the
        citation for the assembler

        mpirun -hostfile hostfile.actinode34 -np 48 -stdin /dev/null
        /home/krobison/packages/Ray-v2.0-ReleaseCandidate5/Ray -i
        part.8.fasta -o ray.part.8.actinode34.c 1>
        ray.part.8.actinode34.c.out 2> ray.part.8.actinode34.c.err

        Where hostfile.actinode34 reads:

        actinode03 slots=24
        actinode04 slots=24


        if instead I run with a hostfile with only one host (either
        one of them) and -np 24, but otherwise the same command line,
        the assembler seems to be off and running.

        My .bashrc has

        export PATH=$PATH:/act/openmpi/gnu/bin
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/act/openmpi/gnu/lib

        (the cluster vendor put the code in /act)
        Any suggestions for what might be triggering this behavior?




    
------------------------------------------------------------------------------
    Live Security Virtual Conference
    Exclusive live event will cover all the ways today's security and
    threat landscape has changed and how IT managers can respond.
    Discussions
    will include endpoint security, mobile security and the latest in
    malware
    threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
    _______________________________________________
    Denovoassembler-users mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/denovoassembler-users



------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to