I'm not entirely comfortable with the solution, as the problem truly is that we 
are doing what you asked - i.e., if you tell Slurm to bind tasks to a single 
core, then we live within it. The problem with your proposed fix is that we 
override whatever the user may have actually wanted - e.g., if the user told 
Slurm to bind us to 4 cores, then we override that constraint.

If you can come up with a way that we can launch the orteds in a manner that 
respects whatever directive was given, while still providing added flexibility, 
then great. Otherwise, I would say the right solution is for users not to set 
TaskAffinity when using mpirun.


On Feb 12, 2014, at 2:42 AM, Artem Polyakov <artpo...@gmail.com> wrote:

> Hello
> 
> I found that SLURM installations that use cgroup plugin and have 
> TaskAffinity=yes in cgroup.conf have problems with OpenMPI: all processes on 
> non-launch node are assigned on one core. This leads to quite poor 
> performance.
> The problem can be seen only if using mpirun to start parallel application in 
> batch script. For example: mpirun ./mympi
> If using srun with PMI affinity is setted properly: srun ./mympi.
> 
> Close look shows that the reason lies in the way Open MPI use srun to launch 
> ORTE daemons. Here is example of the command line:
> srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=node02 orted -mca ess 
> slurm -mca orte_ess_jobid 3799121920 -mca orte_ess_vpid 
>  
> Saying --nodes=1 --ntasks=1 to SLURM means that you want to start one task 
> and (with TaskAffinity=yes) it will be binded to one core. Next orted use 
> this affinity as base for all spawned branch processes. If I understand 
> correctly the problem behind using srun is that if you say srun --nodes=1 
> --ntasks=4 - then SLURM will spawn 4 independent orted processes binded to 
> different cores which is not what we really need.
> 
> I found that disabling of cpu binding as a fast hack works good for cgroup 
> plugin. Since job runs inside a group which has core access restrictions, 
> spawned branch processes are executed under nodes scheduler control on all 
> allocated cores. The command line looks like this:
> srun --cpu_bind=none --nodes=1 --ntasks=1 --kill-on-bad-exit 
> --nodelist=node02 orted -mca ess slurm -mca orte_ess_jobid 3799121920 -mca 
> orte_ess_vpid 
> 
> This solution will probably won't work with SLURM task/affinity plugin. Also 
> it may be a bad idea when strong affinity desirable.
> 
> My patch to stable Open MPI version (1.6.5) is attached to this e-mail. I 
> will try to make more reliable solution but I need more time and beforehand 
> would like to know opinion of Open MPI developers.
> 
> -- 
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> <affinity.patch>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to