Joe Landman wrote:


3) using btl to turn off sm and openib, generates lots of these messages:

[c1-8][0,1,4][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113

[...]

No route to host at -e line 1.

This is wrong, all the nodes are visible from all the other nodes on a private subnet. For example:

ok, fixed this. Turns out we have ipoib going, and one adapter needed to be brought down and back up. Now the tcp version appears to be running, though I do get the strange hangs after a random (never the same) number of iterations.

Given that the hangs are random, and don't appear to happen at the same time step but a similar place in the code, suggests to me that something may be amiss in the MPI_Waitsome function. Possible a completion was posted and due to buffer sizes, fell off the scoreboard.

Any thoughts?

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: land...@scalableinformatics.com
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615

Reply via email to