On Feb 24, 2011, at 2:59 PM, Henderson, Brent wrote:

> [snip]
> They really can't be all SLURM_PROCID=0 - that is supposed to be unique for 
> the job - right?  It appears that the SLURM_PROCID is inherited from the 
> orted parent - which makes a fair amount of sense given how things are 
> launched.  

That's correct, and I can agree with your sentiment.  

However, our design goals were to provide a consistent *Open MPI* experience 
across different launchers. Providing native access to the actual underlying 
launcher was a secondary goal.  Balancing those two, you can see why we chose 
the model we did: our orted provides  (nearly) the same functionality across 
all environments.  

In SLURM's case, we propagate a [seemingly] non-sensical SLURM_PROCID values to 
the individual processes, but only if you are making an assumption about how 
Open MPI is using SLURM's launcher.

More specifically, our goal is to provide consistent *Open MPI information* 
(e.g., through the OMPI_COMM_WORLD* env variables) -- not emulate what SLURM 
would have done if MPI processes had been launched individually through srun.  
Even more specifically: we don't think that the exact underlying launching 
mechanism that OMPI uses is of interest to most users; we encourage them to use 
our portable mechanisms that work even if they move to another cluster with a 
different launcher.  Admittedly, that does make it a little more challenging if 
you have to support multiple MPI implementations, and although that's an 
important consideration to us, it's not our first priority.

> Now to answer the other question - why are there some variables missing.  It 
> appears that when the orted processes are launched - via srun but only one 
> per node, it is a subset of the main allocation and thus some of the 
> environment variables are not the same (or missing entirely) as compared to 
> launching them directly with srun on the full allocation.  This also makes 
> sense to me at some level, so I'm at peace with it now.  :)

Ah, good.

> Last thing before I go.  Please let me apologize for not being clear on what 
> I disagreed with Ralph about in my last note.  Clearly he nailed the orted 
> launching process and spelled it out very clearly, but I don't believe that 
> HP-MPI is not doing anything special to copy/fix up the SLURM environment 
> variables.  Hopefully that was clear by the body of that message.  

No worries; you were perfectly clear.  Thanks!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to