Ralph Castain wrote:
Hmmm....well, the problem is as I suspected. The system doesn't see any
allocation of nodes to your job, and so it aborts with a crummy error
message that doesn't really tell you the problem. We are working on
improving them.
How are you allocating nodes to the job? Does this BEOWULF_JOB_MAP contain
info on the nodes that are to be used?
BEOWULF_JOB_MAP is an array of integers separated by a colon that
contains node mapping information. The easiest way to explain is is just
my example:
BEOWULF_JOB_MAP=0:0
This is a two process job, with each process running on node 0.
BEOWULF_JOB_MAP=0:1:1
A three process job with the first process on node 0, and the next two
on node 1.
All said, this is of little consequent right now, and we/I can worry
about adding support for this later.
One of the biggest headaches with bproc is that there is no adhered-to
standard for describing the node allocation. What we implemented will
support LSF+Bproc (since that is what was being used here) and BJS. It
sounds like you are using something different - true?
Understood. We aren't using BJS, and have long depricated BJS in favor
of bundling TORQUE with Scyld instead, though legacy functionality for
things like envars like NP, NO_LOCAL, and BEOWULF_JOB_MAP are present in
the MPICH extensions we've put together.
If so, we can work around it by just mapping enviro variables to what the
system is seeking. Or, IIRC, we could use the hostfile option (have to check
on that one).
Exactly, but for now, if I make sure the NODES envar is setup correctly
and make sure the OpenMPI is NFS mounted, and I actually have to copy
out the mca libraries (libcache doesn't seem to work), I actually end up
with something running!
[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 2 ./cpi
Process 0 on n0
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
wall clock time = 0.005377
Process 1 on n0
Hangup
It seems the -H option and using a hostfile with BProc aren't honored
correct? So the only thing that I can use to derrive the host mapping
with BProc support is the BJS RAS MCA (via the NODES Envar?)
-Josh