Ralph Castain wrote:
Hmmm....well, the problem is as I suspected. The system doesn't see any
allocation of nodes to your job, and so it aborts with a crummy error
message that doesn't really tell you the problem. We are working on
improving them.

How are you allocating nodes to the job? Does this BEOWULF_JOB_MAP contain
info on the nodes that are to be used?

BEOWULF_JOB_MAP is an array of integers separated by a colon that contains node mapping information. The easiest way to explain is is just my example:


This is a two process job, with each process running on node 0.


A three process job with the first process on node 0, and the next two on node 1.

All said, this is of little consequent right now, and we/I can worry about adding support for this later.

One of the biggest headaches with bproc is that there is no adhered-to
standard for describing the node allocation. What we implemented will
support LSF+Bproc (since that is what was being used here) and BJS. It
sounds like you are using something different - true?

Understood. We aren't using BJS, and have long depricated BJS in favor of bundling TORQUE with Scyld instead, though legacy functionality for things like envars like NP, NO_LOCAL, and BEOWULF_JOB_MAP are present in the MPICH extensions we've put together.

If so, we can work around it by just mapping enviro variables to what the
system is seeking. Or, IIRC, we could use the hostfile option (have to check
on that one).

Exactly, but for now, if I make sure the NODES envar is setup correctly and make sure the OpenMPI is NFS mounted, and I actually have to copy out the mca libraries (libcache doesn't seem to work), I actually end up with something running!

[ats@goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 2 ./cpi
Process 0 on n0
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
wall clock time = 0.005377
Process 1 on n0

It seems the -H option and using a hostfile with BProc aren't honored correct? So the only thing that I can use to derrive the host mapping with BProc support is the BJS RAS MCA (via the NODES Envar?)


Reply via email to