On Aug 12, 2009, at  19:09 PM, Ralph Castain wrote:

Hmmm...well, I'm going to ask our TCP friends for some help here.

Meantime, I do see one thing that stands out. Port 4 is an awfully low port number that usually sits in the reserved range. I checked the /etc/services file on my Mac, and it was commented out as unassigned, which should mean it was okay.

Still, that is an unusual number. The default minimum port number is 1024, so I'm puzzled how you wound up down there. Of course, could just be an error in the print statement, but let's try moving it to be safe? Set

-mca btl_tcp_port_min_v4 36900 -mca btl_tcp_port_range_v4 32
and see what happens.

What happens is that everything works now! Both connectivity_c and the MITgcm. I haven't tried under torque yet, but lets declare an openMPI victory at this point.

On Aug 13, 2009, at  8:28 AM, Jeff Squyres wrote:

Agreed -- ports 4 and 260 should be in the reserved ports range. Are you running as root, perchance?

Errrr, no, but yes. My user account has admin privledges. A sloppy workstation OS X habit I now regret propagating to my cluster. I'm sorry to not mention it earlier as possibly relevant.

As a suggestion, btl_base_verbose could be mentioned as a good debugging tool in the troubleshooting section of the FAQ. Its on the page to do with tcp, which I admit I should have read as soon as I realized there was a communication issue, but having it in the troubleshooting section would be helpful too. i.e. maybe a more erudite version of:

Checking connections between nodes:

Sometimes the configuration of a cluster makes it impossible for nodes to communicate properly. To debug this it helps to include --mca btl_base_verbose 30 as a command line argument (see http://www.open-mpi.org/faq/?category=tcp for more information). The program example/connectivity_c.c is also a useful minimal program for testing communication on the cluster.

Thanks again for everyone's help, particularly Ralph, Jeff and Gus.

Cheers,  Jody

Reply via email to