[slurm-users] why is the performance fo mpi allreduce job lower than the expected level?

2020-12-24 Thread hu...@sugon.com
Dear there,
We tested mpi allreduce job in three modes (srun-dtcp 
、mpirun-slurm、mpirun-ssh), and we found that the job running time in the  
mpirun-ssh mode is shorter than the other modes. 
We've set parameters like below:
/usr/lib/systemd/system/slurmd.service:
LimitMEMLOCK=infinity
LimitSTACK=infinity
/etc/sysconfig/slurmd:
ulimit -l unlimited
ulimit -s unlimited
We want to know if this is normal ?Will functions such as cgroup and 
pam_slurm_adopt limit job performance ? And how to improve the efficiency of 
slurm jobs.
Here is the test results:
sizesrun-dtcpmpirun-slurmmpirun-ssh
00.050.060.05
42551.83355.67281.02
81469.322419.97139
1667.41151.871000.15
3273.31143.15126.22
64107.14111.6126.3
12877.12473.62344.36
25646.39417.9565.09
51292.84260.6990.5
102497.9137.13112.3
2048286.27233.21169.59
4096155.69343.59160.9
8192261.02465.78151.29
1638412518.0413363.713869.58
3276822071.611398.214842.32
655366041.2.953368.58
13107210436.1118071.5310909.76
26214413802.2224728.788263.53
52428813086.2616394.724825.51
104857628193.0815943.266490.29
209715263277.7324411.5815361.7
419430458538.0560516.1533955.49

(1)srun-dtcp job:
#!/bin/bash
#SBATCH -J test
#SBATCH -N 32
#SBATCH --ntasks-per-node=30
#SBATCH -p seperate

NP=$SLURM_NTASKS
srun --mpi=pmix_v3 /public/software/benchmark/imb/hpcx/2017/IMB-MPI1 -npmin $NP 
Allreduce

(2)mpirun-slurm job:
#!/bin/bash
#SBATCH -J test
#SBATCH -N 32
#SBATCH --ntasks-per-node=30
#SBATCH -p seperate

NP=$SLURM_NTASKS
mpirun /public/software/benchmark/imb/hpcx/2017/IMB-MPI1 -npmin $NP Allreduce

(3)mpirun-ssh job:
#!/bin/bash
#SBATCH -J test
#SBATCH -N 32
#SBATCH --ntasks-per-node=30
#SBATCH -p seperate

env | grep SLURM > env.log 
scontrol show hostname > nd.$SLURM_JOBID
NNODE=$SLURM_NNODES
NP=$SLURM_NTASKS

mpirun -np $NP -machinefile nd.$SLURM_JOBID -map-by ppr:30:node \
-mca plm rsh -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_num_concurrent $NNODE \
/public/software/benchmark/imb/hpcx/2017/IMB-MPI1 -npmin $NP Allreduce



Best wishes!
menglong





Re: [slurm-users] Slurm Upgrade Philosophy?

2020-12-24 Thread Paul Edmon
We are the same way, though we tend to keep pace with minor releases.  
We typically wait until the .1 release of a new major release before 
considering upgrade so that many of the bugs are worked out.  We then 
have a test cluster that we install the release on a run a few test jobs 
to make sure things are working, usually MPI jobs as they tend to hit 
most of the features of the scheduler.


We also like to stay current with releases as there are new features we 
want, or features we didn't know we wanted but our users find and start 
using.  So our general methodology is to upgrade to the latest minor 
release at our next monthly maintenance.  For major releases we will 
upgrade at our next monthly maintenance after the .1 release is out 
unless there is a show stopping bug that we run into in our own 
testing.  At which point we file a bug with SchedMD and get a patch.


-Paul Edmon-

On 12/24/2020 1:57 AM, Chris Samuel wrote:

On Friday, 18 December 2020 10:10:19 AM PST Jason Simms wrote:


Thanks to several helpful members on this list, I think I have a much better
handle on how to upgrade Slurm. Now my question is, do most of you upgrade
with each major release?

We do, though not immediately and not without a degree of testing on our test
systems.  One of the big reasons for us upgrading is that we've usually paid
for features in Slurm for our needs (for example in 20.11 that includes
scrontab so users won't be tied to favourite login nodes, as well as  the
experimental RPC queue code due to the large numbers of RPCs our systems need
to cope with).

I also keep an eye out for discussions of what other sites find with new
releases too, so I'm following the current concerns about 20.11 and the change
in behaviour for job steps that do (expanding NVIDIA's example slightly):

#SBATCH --exclusive
#SBATCH -N2
srun --ntasks-per-node=1 python multi_node_launch.py

which (if I'm reading the bugs correctly) fails in 20.11 as that srun no
longer gets all the allocated resources, instead just gets the default of
--cpus-per-task=1 instead, which also affects things like mpirun in OpenMPI
built with Slurm support (as it effectively calls "srun orted" and that "orted"
launches the MPI ranks, so in 20.11 it only has access to a single core for
them all to fight over).  Again - if I'm interpreting the bugs correctly!

I don't currently have a test system that's free to try 20.11 on, but
hopefully early in the new year I'll be able to test this out to see how much
of an impact this is going to have and how we will manage it.

https://bugs.schedmd.com/show_bug.cgi?id=10383
https://bugs.schedmd.com/show_bug.cgi?id=10489

All the best,
Chris




Re: [slurm-users] Slurm Upgrade Philosophy?

2020-12-24 Thread Chris Samuel

On 24/12/20 6:24 am, Paul Edmon wrote:

We then have a test cluster that we install the release on a run a few 
test jobs to make sure things are working, usually MPI jobs as they tend 
to hit most of the features of the scheduler.


One thing I meant to mention last night was that we use Reframe from 
CSCS as the test framework for our systems, our user support folks 
maintain our local tests as they're best placed to understand the user 
requirements that need coverage and we feed in our system facing 
requirements to them so they can add tests for that side too.


https://reframe-hpc.readthedocs.io/

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



[slurm-users] trying to add gres

2020-12-24 Thread Erik Bryer
Hello List,

I am trying to change:
NodeName=saga-test02 CPUS=2 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 
RealMemory=1800 State=UNKNOWN
to
NodeName=saga-test02 CPUS=2 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 
RealMemory=1800 State=UNKNOWN Gres=gpu:foolsgold:4
But I get this error once per second:
Dec 24 16:08:32 saga-test03 slurmctld[115409]: error: 
_slurm_rpc_node_registration node=saga-test02: Invalid argument

I made sure my slurm.conf is synchronized across machines. My intention is to 
add some arbitrary gres for testing purposes.

Thanks,
Erik


Re: [slurm-users] trying to add gres

2020-12-24 Thread Chris Samuel

On 24/12/20 4:42 pm, Erik Bryer wrote:

I made sure my slurm.conf is synchronized across machines. My intention 
is to add some arbitrary gres for testing purposes.


Did you update your gres.conf on all the nodes to match?

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA