[slurm-dev] RE: Node always going to DRAIN state with reason=Low TmpDisk

Le Biot, Pierre-Marie Tue, 10 Oct 2017 06:19:05 -0700

Véronique,

This not what I expected, I was thinking slurmd -C would return TmpDisk=204000 
or more probably 129186 as seen in slurmctld log.


I suppose that you already checked slurmd logs on tars-XXX ?

Regards,
Pierre-Marie Le Biot

From: Véronique LEGRAND [mailto:veronique.legr...@pasteur.fr]
Sent: Tuesday, October 10, 2017 2:09 PM
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] RE: Node always going to DRAIN state with reason=Low 
TmpDisk

Hello Pierre-Marie,

First, thank you for your hint.
I just tried.

>slurmd -C
NodeName=tars-XXX CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 
ThreadsPerCore=1 RealMemory=258373 TmpDisk=500
UpTime=0-20:50:54

The value for TmpDisk is erroneous. I do not know what can be the cause of this 
since the operating system df command gives the right values.

-sh-4.1$ df -hl
Filesystem      Size  Used Avail Use% Mounted on
slash_root      3.5G  1.6G  1.9G  47% /
tmpfs           127G     0  127G   0% /dev/shm
tmpfs           500M   84K  500M   1% /tmp
/dev/sda1       200G   33M  200G   1% /local/scratch


Could slurmd be messing up tmpfs with /local/scratch?

I tried the same thing on another similar node (tars-XXX-1)

I got:

-sh-4.1$ df -hl
Filesystem      Size  Used Avail Use% Mounted on
slash_root      3.5G  1.7G  1.8G  49% /
tmpfs           127G     0  127G   0% /dev/shm
tmpfs           500M  5.7M  495M   2% /tmp
/dev/sda1       200G   33M  200G   1% /local/scratch

and

slurmd -C
NodeName=tars-XXX-1 CPUs=12 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 
ThreadsPerCore=1 RealMemory=258373 TmpDisk=500
UpTime=101-21:34:14


So, slurmd –C gives exactly the same answer but this node doesn’t go into DRAIN 
state; it works perfectly.

Thank you again for your help.

Regards,

Véronique



--
Véronique Legrand
IT engineer – scientific calculation & software development
https://research.pasteur.fr/en/member/veronique-legrand/
Cluster and computing group
IT department
Institut Pasteur Paris
Tel : 95 03


From: "Le Biot, Pierre-Marie" 
<pierre-marie.leb...@hpe.com<mailto:pierre-marie.leb...@hpe.com>>
Reply-To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Date: Tuesday, 10 October 2017 at 13:53
To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Subject: [slurm-dev] RE: Node always going to DRAIN state with reason=Low 
TmpDisk

Hi Véronique,

Did you check the result of slurmd -C on tars-XXX ?

Regards,
Pierre-Marie Le Biot

From: Véronique LEGRAND [mailto:veronique.legr...@pasteur.fr]
Sent: Tuesday, October 10, 2017 12:02 PM
To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Subject: [slurm-dev] Node always going to DRAIN state with reason=Low TmpDisk

Hello,

I have a problem with 1 node in our cluster. It is exactly as all the other 
nodes (200 GB of temporary storage)

Here is what I have in slurm.conf:

# COMPUTES
TmpFS=/local/scratch

# NODES
GresTypes=disk,gpu
ReturnToService=2
NodeName=DEFAULT State=UNKNOWN Gres=disk:204000,gpu:0 TmpDisk=204000
NodeName=tars-[XXX-YYY] Sockets=2 CoresPerSocket=6 RealMemory=254373 
Feature=ram256,cpu,fast,normal,long,specific,admin Weight=20

The node that has the trouble is tars-XXX.

Here is what I have in gres.conf:

# Local disk space in MB (/local/scratch)
NodeName=tars-[ZZZ-UUU] Name=disk Count=204000

XXX is in range: [ZZZ,UUU].

If I ssh to tars-XXX, here is what I get:

-sh-4.1$ df -hl
Filesystem      Size  Used Avail Use% Mounted on
slash_root      3.5G  1.6G  1.9G  47% /
tmpfs           127G     0  127G   0% /dev/shm
tmpfs           500M   84K  500M   1% /tmp
/dev/sda1       200G   33M  200G   1% /local/scratch

/local/scratch is the directory for temporary storage.

The problem is  when I do
scontrol show node tars-XXX,

I get:

NodeName=tars-XXX Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.00
   AvailableFeatures=ram256,cpu,fast,normal,long,specific,admin
   ActiveFeatures=ram256,cpu,fast,normal,long,specific,admin
   Gres=disk:204000,gpu:0
   NodeAddr=tars-113 NodeHostName=tars-113 Version=16.05
   OS=Linux RealMemory=254373 AllocMem=0 FreeMem=255087 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=204000 Weight=20 Owner=N/A 
MCS_label=N/A
   BootTime=2017-10-09T17:08:43 SlurmdStartTime=2017-10-09T17:09:57
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low TmpDisk [slurm@2017-10-10T11:25:04]


And in the slurmctld logs, I get the error message:
2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: Node tars-XXX 
has low tmp_disk size (129186 < 204000)
2017-10-10T08:35:57+02:00 tars-master slurmctld[120352]: error: 
_slurm_rpc_node_registration node=tars-XXX: Invalid argument

I tried to reboot tars-XXX yesterday but the problem is still here.
I also tried:
scontrol update  NodeName=ClusterNode0 State=Resume
but state went back to DRAIN after a while…

Does anyone have an idea of what could cause the problem? My configuration 
files seem correct and there really are 200G free in /local/scratch on tars-XXX…

I thank you in advance for any help.

Regards,


Véronique







--
Véronique Legrand
IT engineer – scientific calculation & software development
https://research.pasteur.fr/en/member/veronique-legrand/
Cluster and computing group
IT department
Institut Pasteur Paris
Tel : 95 03

[slurm-dev] RE: Node always going to DRAIN state with reason=Low TmpDisk

Reply via email to