Hello,
I think that your problem is that your nodes are virtual and that
network address translation
is not configured.
See my answers below.
Le 2012-05-21 19:32, Keith Robison a écrit :
Apologies -- haven't had a chance to post these.
NetworkTest1 (seems to hang up)
http://pastebin.com/RpXsSGUR
The last line is:
[actinode04.cluster:24490] btl: tcp: attempting to connect() to address
192.168.122.1 on port 16644
This may indicates problem with the network.
According to ifconfig (tests 5 and 6), both nodes have the same address
192.168.122.1 for
the network interface virbr0. virbr0 is usually a virtual network
interface created
by the hypervisor overlord Xen.
NetworkTest2 (also seems to hang)
http://pastebin.com/WCx821Ae
The last line is
received unexpected process identifier [[33498,1],25]
I don't know what that means in your context.
NetworkTest3
http://pastebin.com/FL0aCdZX
OK, you don't have Infiniband.
NetworkTest4
http://pastebin.com/C4wr5iAA
Your main node has 2 network interfaces.
NetworkTest5
http://pastebin.com/BbCmUQP4
You have a virtual network interface: virbr0
NetworkTest6
http://pastebin.com/NE86GmdB
You have a virtual network interface: virbr0
Your two nodes, actinode03 and actinode04 have exactly the same addresses.
http://pastebin.com/BbCmUQP4
http://pastebin.com/NE86GmdB
Is it possible that your nodes are actually virtual nodes and that
network address translation is now
configured at all ?
Running Ray on virtual machines will work if these virtual machines are
properly configured.
I have used successfully Ray on Amazon EC2, which is heavily virtualized.
At this point, I think that your local technical support will be more
efficient at solving this
network problem.
Sébastien
Thanks again for your help on this!
Keith Robison
On Tue, May 15, 2012 at 10:31 AM, Sébastien Boisvert
<[email protected]
<mailto:[email protected]>> wrote:
Hi,
I think the mpiexec -48 date indicates that processes are launched
on the two
nodes.
But there is probably a problem with the way messages are sent
with your network.
Can you try these network tests, paste the results on
http://pastebin.com/
(one paste per test) and link these in your reply ?
Each of these tests will take a few seconds to run.
Test 1
mpiexec -n 48 -hostfile hostfile.actinode34 \
--mca mca_verbose 9999999 \
--mca btl_base_verbose 9999999 \
--mca btl_openib_verbose 9999999 \
/home/krobison/packages/Ray-v2.0-ReleaseCandidate5/Ray \
-test-network-only -o NetworkTest1 &> NetworkTest1.txt
Test 2
mpiexec -n 48 -hostfile hostfile.actinode34 \
--mca mca_verbose 9999999 \
--mca btl_base_verbose 9999999 \
--mca btl_openib_verbose 9999999 \
--mca btl self,tcp \
/home/krobison/packages/Ray-v2.0-ReleaseCandidate5/Ray \
-test-network-only -o NetworkTest2 &> NetworkTest2.txt
Test 3
mpiexec -n 48 -hostfile hostfile.actinode34 \
--mca mca_verbose 9999999 \
--mca btl_base_verbose 9999999 \
--mca btl_openib_verbose 9999999 \
--mca btl self,openib \
/home/krobison/packages/Ray-v2.0-ReleaseCandidate5/Ray \
-test-network-only -o NetworkTest3 &> NetworkTest3.txt
Test 4
/sbin/ifconfig -a &> NetworkTest4.txt
Test 5
ssh actinode03 /sbin/ifconfig -a &> NetworkTest5.txt
Test 6
ssh actinode04 /sbin/ifconfig -a &> NetworkTest6.txt
Sébastien
Le 2012-05-15 09:40, Keith Robison a écrit :
Apologies for replying to my own message; something is amiss with
my subscriptions & I didn't see Sebastien's helpful reply [I went
to the archives to get it]
Sebastien: On which machine are you when launching mpirun/mpiexec ?
I'm launching the jobs from the head node (actinode)
Sebastien suggested I try pinging one node from another, which
failed -- so that is a clue:
ssh actinode03 ping actinode04
ping: icmp open socket: Operation not permitted
He also suggested I try
mpiexec -n 48 -hostfile hostfile.actinode34 date
Which prints out the date 48 times -- so that works
Sebastien also suggested I run
ompi_info -a
Which gives a lot of output
Thanks for being so helpful! I'm feeling like I don't even know
the right questions to ask, so getting any direction is really a
boost.
Keith R.
On Sat, May 12, 2012 at 5:58 PM, Keith Robison
<[email protected] <mailto:[email protected]>> wrote:
Hello! I've run into a roadblock.
If I run the following command in the background, the
assembler seems to stall, with the last output being the
citation for the assembler
mpirun -hostfile hostfile.actinode34 -np 48 -stdin /dev/null
/home/krobison/packages/Ray-v2.0-ReleaseCandidate5/Ray -i
part.8.fasta -o ray.part.8.actinode34.c 1>
ray.part.8.actinode34.c.out 2> ray.part.8.actinode34.c.err
Where hostfile.actinode34 reads:
actinode03 slots=24
actinode04 slots=24
if instead I run with a hostfile with only one host (either
one of them) and -np 24, but otherwise the same command line,
the assembler seems to be off and running.
My .bashrc has
export PATH=$PATH:/act/openmpi/gnu/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/act/openmpi/gnu/lib
(the cluster vendor put the code in /act)
Any suggestions for what might be triggering this behavior?
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond.
Discussions
will include endpoint security, mobile security and the latest in
malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Denovoassembler-users mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users