[slurm-users] update node config while jobs are running

2020-03-09 Thread Rundall, Jacob D
I need to update the configuration for the nodes in a cluster and I’d like to let jobs keep running while I do so. Specifically I need to add RealMemory= to the node definitions (NodeName=). Is it safe to do this for nodes where jobs are currently running? Or I need to make sure nodes are

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-09 Thread Chris Samuel
On 9/3/20 7:44 am, mike tie wrote: Specifically, how is slurmd -C getting that info?  Maybe this is a kernel issue, but other than lscpu and /proc/cpuinfo, I don't know where to look.  Maybe I should be looking at the slurmd source? It would be worth looking at what something like "lstopo"

Re: [slurm-users] srun --reboot option is not working

2020-03-09 Thread MrBr @ GMail
> Ah. Looks like the --reboot option is telling slurmctld to put them in the CF state and wait for them to come back up. Slurmctld then waits for them to 'disconnect' and come back. Since they never reboot (therefore never disconnect), slurmctld keeps them in the CF state until the timeout

Re: [slurm-users] srun --reboot option is not working

2020-03-09 Thread Brian Andrus
Ah. Looks like the --reboot option is telling slurmctld to put them in the CF state and wait for them to come back up. Slurmctld then waits for them to 'disconnect' and come back. Since they never reboot (therefore never disconnect), slurmctld keeps them in the CF state until the timeout

Re: [slurm-users] Preemption within same QOS

2020-03-09 Thread Relu Patrascu
We received no replies, so we solved the problem in house by writing a simple plugin based on the qos priority plugin. On Wed, Jan 22, 2020 at 2:50 PM Relu Patrascu wrote: > We're having a bit of a problem setting up slurm to achieve this: > > 1. Two QOSs, 'high' and 'normal'. > 2. Preemption

Re: [slurm-users] srun --reboot option is not working

2020-03-09 Thread MrBr @ GMail
Hi Brian The nodes work with slurm without any issues till I try the "--reboot" option. I can successfully allocate the nodes or any other slurm related operation > You may want to double check that the node is actually rebooting and that slurmd is set to start on boot. That's the problem, they

Re: [slurm-users] srun --reboot option is not working

2020-03-09 Thread Brian Andrus
You may want to double check that the node is actually rebooting and that slurmd is set to start on boot. ResumeTimeoutReached, in a nutshell, means slurmd isn't talking to slurmctld. Are you able to log onto the node itself and see that it has rebooted? If so, try doing something like

Re: [slurm-users] slurmd -C showing incorrect core count

2020-03-09 Thread mike tie
Interesting. I'm still confused by the where slurmd -C is getting the data. When I think of where the kernel stores info about the processor, I normally think of /proc/cpuinfo. (by the way, I am running centos 7 in the vm. The vm hypervisor is VMware). /proc/cpuinfo does show 16 cores. I

[slurm-users] srun --reboot option is not working

2020-03-09 Thread MrBr @ GMail
Hi all I'm trying to use the --reboot option of srun to reboot the nodes before allocation. However the nodes not been rebooted The node get's stuck in allocated# state as show by sinfo or CF - as shown by squeue The logs of slurmctld and slurmd show no relevant information, debug levels at