Hello, 

We are running a 10 node cluster in our lab and we are experiencing a job 
allocation lag. 

srun commands wait for resource allocation up to 1 minute even if there are 
several idle nodes. It's the same with sbatch scripts. Even if there are idle 
nodes, jobs are waiting for about one minute for resource allocation.. 

Our ControlMachine is on a virtual node. Compute nodes are all physical 
machines. 

In our config file we set those values : 
FastSchedule=1 
SchedulerType=sched/backfill 

I feel like after the whole cluster reboot, jobs are scheduled pretty fast and 
after few weeks uptime job scheduling slows down (at this moment ControlMAchine 
uptime is 25 days). I'm not quite sure those are related. 

Everything looks in order, there is no errors in logfiles ... 

I'll be grateful for any hint ... or advice. 

Thanks, 
Vladimir 







Reply via email to