Joe Landman wrote:
3) using btl to turn off sm and openib, generates lots of these messages:
[c1-8][0,1,4][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
[...]
No route to host at -e line 1.
This is wrong, all the nodes are visible from all the other nodes on a
private subnet. For example:
ok, fixed this. Turns out we have ipoib going, and one adapter needed
to be brought down and back up. Now the tcp version appears to be
running, though I do get the strange hangs after a random (never the
same) number of iterations.
Given that the hangs are random, and don't appear to happen at the same
time step but a similar place in the code, suggests to me that something
may be amiss in the MPI_Waitsome function. Possible a completion was
posted and due to buffer sizes, fell off the scoreboard.
Any thoughts?
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: land...@scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615