Dear all,
I am experiencing some difficulties in my attempts to distribute runs over multiple machines using
the '-parallel -machines machines.list' approach. In brief, the approach works as expected if the
entries in the machines.list file are all the same machine (even if that machine is not the machine
on which the run is launched), but not if more than one machine appears in the machines.list file. I
have tried a bit to understand where things are going wrong, and there are some clues (please see
more detailed description below), but thought I would see if anyone could point me in the right
direction before I continue to poke around. I don't think the problem is related to our particular
machine set-up, but am not totally sure.
Any suggestions would be much appreciated.
Thanks very much,
John
----------------------
For the test runs, the protocol was set to calculate an ensemble of 8
structures.
1. A simple '-smp' run on the local machine with 4 CPUs works as expected.
Command:
xplor -py -smp 4 anneal.py
2. The same effective run but using the '-parallel -machines machines.list' approach also works.
Command:
xplor -py -parallel -machines machines.list anneal.py
This works with a machines.list file with four 'localhost' entries and also with a machines.list
file with four '<local-machine-hostname>' entries.
3. A run launched on the local machine but with the machines.list file containing four entries all
specifying the hostname of a remote machine (remote machine #1) also works as expected; the xplor
processes for each structure-calculation jobs are all run on remote machine #1.
4. A run launched on the local machine with a machines.list file whose first two entries are the
hostname of the local machine and second two entries are the hostname of remote machine #1 does not
work as expected. The structure-calculation jobs on remote machine #1 do not run, and in the end
all the 8 structures are calculated on the local machine. If the order of the entries in the
machines.list file is reversed (so that the two entries for remote machine #1 come first), then the
behaviour is also reversed: the structure-calculation jobs on the local machine fail to run, and all
the structures end up being calculated on remote machine #1.
5. A similar behaviour is observed if the machines.list file contains two entries for remote machine
#1 and two entries for remote machine #2: the jobs on remote machine #2 do not run, and all
structures are calculated on remote machine #1.
There do not seem to be any problems with the SSH connectivity between the different machines. As
far as I can see, the problem appears to be related to failure of the xplor processes on the second
machine in the machines.list file to successfully establish the TCP connections that should allow
them to communicate with the parent process. These attempted TCP connections are then between two
different machines (since the parent process always seems to be running on the first machine in the
machines.list file, even if that is not the machine on which the whole run was launched), but are
just between two arbitrary ports (not tunnelled over SSH), so presumably get blocked by the firewall.
If more information would be useful, I have the xplor log-files and also logs of continually
updating 'ps' and 'netstat' commands recorded during these runs.
Thanks,
John
########################################################################
To unsubscribe from the XPLOR-NIH list, click the following link:
http://list.nih.gov/cgi-bin/wa.exe?SUBED1=XPLOR-NIH&A=1