I'm afraid I don't know how to advise you on this - you may need to talk to the 
Slurm folks. When you start your application with mpirun, we use "srun" to 
start our own daemons on the job's nodes. The application processes, however, 
are subsequently started by those daemons using our own infrastructure - i.e., 
Slurm is not involved in starting the application itself.  This has always been 
true, so I don't know why Slurm's behavior would differ across OMPI versions.

Comparing the srun cmd line for starting the daemons between the two versions 
you cite, I don't see any difference in them. They should be identical. You can 
check for yourself by adding "--mca plm_base_verbose 5" to your mpirun cmd line 
(be sure you configured OMPI with --enable-debug).

Given that the srun cmd lines are the same, I have no idea why they would 
invoke different spank plugins.


> On Nov 1, 2019, at 8:10 AM, Jordi A. Gómez via users 
> <users@lists.open-mpi.org> wrote:
> 
> Good day,
> 
> We have a cluster with some MPI distributions and a SLURM serving as queue 
> manager. We also have a SLURM Spank Plugin, and it is simple, you just define 
> some functions in a library and SLURM loads and calls eventually. 
> 
> The isue arises with OpenMPI 4.0.1 (and possibly greater) and MPIRUN command. 
> As far as I know, no other combination produces that. If your job has 2 or 
> more node reservations, and you launch an SBATCH with MPIRUNs inside, the 
> node reading the sbatch script doesn't follows the normal Spank Plugin call 
> pipeline. It misses the function "slurm_spank_user_init" after an MPIRUN:
> 
> This is the OpenMPI 3.1.4 version:
> srun: function slurm_spank_init
> srun: function slurm_spank_init_post_opt
> srun: function slurm_spank_local_user_init
> remote: function slurm_spank_user_init
> srun: slurm_spank_exit
> 
> This is the OpenMPI 4.0.1 version:
> srun: function slurm_spank_init
> srun: function slurm_spank_init_post_opt
> srun: function slurm_spank_local_user_init
> srun: slurm_spank_exit
> 
> It is like "ok i'm currently reading the SBATCH script, I don't have to 
> initialize the user again". It is causing some problems because we are using 
> this specific function.
> 
> Also, in this node application instance, we are missing some SLURM's 
> environment variables such as SLURM_STEP_ID, SLURM_STEP_NODELIST, 
> SLURM_STEP_NUM_NODES, SLURM_STEP_NUM_TASKS, SLURM_STEP_TASKS_PER_NODE...
> 
> I would like to know more about it. Because if it is perfectly normal 
> behavior and will remain, I will have to make changes to the plugin.
> 
> Thank you,
> Jordi.


Reply via email to