Hello, A colleague intimated that he thought that larger jobs were tending to get starved out on our slurm cluster. It's not a busy time at the moment so it's difficult to test this properly. Back in November it was not completely unusual for a larger job to have to wait up to a week to start.
I've extracted the key scheduling configuration out of the slurm.conf and I would appreciate your comments, please. Even at the busiest of times we notice many single compute jobs executing on the cluster -- starting either via the scheduler or by backfill. Looking at the scheduling configuration do you think that I'm favouring small jobs too much? That is, for example, should I increase the PriorityWeightJobSize to encourage larger jobs to run? I was very keen not to starve out small/medium jobs, however perhaps there is too much emphasis on small/medium jobs in our setup. My colleague is from a Moab background, and in that respect he was surprised not to see nodes being reserved for jobs, but it could be that Slurm works in a different way to try to make efficient use of the cluster by backfilling more aggressively than Moab. Certainly we see a great deal of activity from backfill. In this respect does anyone understand the mechanism used to reserve nodes/resources for jobs in slurm or potentially where to look for that type of information. Best regards, David SchedulerType=sched/backfill SchedulerParameters=bf_window=3600,bf_resolution=180,bf_max_job_user=4 SelectType=select/cons_res SelectTypeParameters=CR_Core FastSchedule=1 PriorityFavorSmall=NO PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME,FAIR_TREE PriorityType=priority/multifactor PriorityDecayHalfLife=14-0 PriorityWeightFairshare=1000000 PriorityWeightAge=100000 PriorityWeightPartition=0 PriorityWeightJobSize=100000 PriorityWeightQOS=10000 PriorityMaxAge=7-0