Ralph Castain wrote:
Had a chance to think about how this might be done, and looked at it for awhile after getting home. I -think- I found a way to do it
Okay, terrific.  Here is my high-level point of view:

*) From a usability point of view, I think having to specify both rankfile and hostfile is awkward.  From a user's point of view, saying "bind this process to that core, this one to that, etc." is complete information.  It's eyebrowing raising to have to specify a subset of this information redundantly in a different file.

*) Plus, the error message one gets when one uses only a rankfile is rather confusing.

*) But fixing all this within the current OMPI framework can be difficult and therefore might have a lower priority than other pressing issues.

*) Plus, even if we were to start from scratch, the "big picture" about how we want to approach all this remains fuzzy (to me).  If all users wanted the same thing, we could provide that.  Realistically, users will want millions of variations of somewhat related functionality.

So, I'll settle for an improved failure mode:  a better error message for the user who specifies rankfile without independent allocation information.  Whatever you can do to accommodate such a user better would be icing on the cake.
but there are a couple of caveats:

1. Len's point about oversubscribing without warning would definitely hold true - this would positively be a "user beware" option
I'm okay with that.  A rankfile gives rather specific information.  Give them what they ask for!  Having to maintain information in multiple places is hardly an elegant safeguard.
On Fri, Jun 19, 2009 at 2:21 PM, Eugene Loh <eugene....@sun.com <mailto:eugene....@sun.com>> wrote:

   % cat rankfile
   rank 0=node0 slot=0
   rank 1=node1 slot=0
   % mpirun -np 2 -rf rankfile ./a.out
   --------------------------------------------------------------------------
   Rankfile claimed host node1 that was not allocated or
   oversubscribed it's slots:

   --------------------------------------------------------------------------
   [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad parameter in file
   rmaps_rank_file.c at line 107
   [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad parameter in file
   base/rmaps_base_map_job.c at line 86
   [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad parameter in file
   base/plm_base_launch_support.c at line 86
   [node0:14611] [[61560,0],0] ORTE_ERROR_LOG: Bad parameter in file
   plm_rsh_module.c at line 1016
   % mpirun -np 2 -host node0,node1 -rf rankfile ./a.out
   0 on node0
   1 on node1
   done

Reply via email to