Hello John-- > > I am experiencing some difficulties in my attempts to distribute runs > over multiple machines using the '-parallel -machines machines.list' > approach. In brief, the approach works as expected if the entries in > the machines.list file are all the same machine (even if that machine > is not the machine on which the run is launched), but not if more than > one machine appears in the machines.list file. I have tried a bit to > understand where things are going wrong, and there are some clues > (please see more detailed description below), but thought I would see > if anyone could point me in the right direction before I continue to > poke around. I don't think the problem is related to our particular > machine set-up, but am not totally sure.
This behavior likely indicates that the jobs remote to the master (process 0) are not able to establish a connection, and could well be due to firewall rules. In your case, ssh will be used to start the jobs, but immediately after start, all processes try to communicate back to process 0 - and this network activity is unencrypted, over a high port. If you look at a log file from one of the failed processes, you will probably see lines like: connect attempt #0 to remoteHost:10017... connection fault: timed out This indicates that a connection is attempted using port 10017, but it is likely blocked. This port can be specified using the -pport option or the XPLOR_PPORT environment variable, and it must be opened in order for this sort of parallel calculation to work. best regards-- Charles ######################################################################## To unsubscribe from the XPLOR-NIH list, click the following link: http://list.nih.gov/cgi-bin/wa.exe?SUBED1=XPLOR-NIH&A=1
