I'm not entirely comfortable with the solution, as the problem truly is that we are doing what you asked - i.e., if you tell Slurm to bind tasks to a single core, then we live within it. The problem with your proposed fix is that we override whatever the user may have actually wanted - e.g., if the user told Slurm to bind us to 4 cores, then we override that constraint.
If you can come up with a way that we can launch the orteds in a manner that respects whatever directive was given, while still providing added flexibility, then great. Otherwise, I would say the right solution is for users not to set TaskAffinity when using mpirun. On Feb 12, 2014, at 2:42 AM, Artem Polyakov <artpo...@gmail.com> wrote: > Hello > > I found that SLURM installations that use cgroup plugin and have > TaskAffinity=yes in cgroup.conf have problems with OpenMPI: all processes on > non-launch node are assigned on one core. This leads to quite poor > performance. > The problem can be seen only if using mpirun to start parallel application in > batch script. For example: mpirun ./mympi > If using srun with PMI affinity is setted properly: srun ./mympi. > > Close look shows that the reason lies in the way Open MPI use srun to launch > ORTE daemons. Here is example of the command line: > srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=node02 orted -mca ess > slurm -mca orte_ess_jobid 3799121920 -mca orte_ess_vpid > > Saying --nodes=1 --ntasks=1 to SLURM means that you want to start one task > and (with TaskAffinity=yes) it will be binded to one core. Next orted use > this affinity as base for all spawned branch processes. If I understand > correctly the problem behind using srun is that if you say srun --nodes=1 > --ntasks=4 - then SLURM will spawn 4 independent orted processes binded to > different cores which is not what we really need. > > I found that disabling of cpu binding as a fast hack works good for cgroup > plugin. Since job runs inside a group which has core access restrictions, > spawned branch processes are executed under nodes scheduler control on all > allocated cores. The command line looks like this: > srun --cpu_bind=none --nodes=1 --ntasks=1 --kill-on-bad-exit > --nodelist=node02 orted -mca ess slurm -mca orte_ess_jobid 3799121920 -mca > orte_ess_vpid > > This solution will probably won't work with SLURM task/affinity plugin. Also > it may be a bad idea when strong affinity desirable. > > My patch to stable Open MPI version (1.6.5) is attached to this e-mail. I > will try to make more reliable solution but I need more time and beforehand > would like to know opinion of Open MPI developers. > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > <affinity.patch>_______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel