Re: [slurm-users] (no subject)

2021-07-30 Thread Chris Samuel
On Friday, 30 July 2021 11:21:19 AM PDT Soichi Hayashi wrote:

> I am running slurm-wlm 17.11.2

You are on a truly ancient version of Slurm there I'm afraid (there have been 
4 major releases & over 13,000 commits since that was tagged in January 2018), 
I would strongly recommend you try and get to a more recent release to get 
those bug fixes and improvements. A quick scan of the NEWS file shows a number 
that are cloud related. https://github.com/SchedMD/slurm/blob/slurm-20.11/NEWS

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] [External] Re: Down nodes

2021-07-30 Thread Soichi Hayashi
Brian,

Yes, slurmd is not running on that node because the node itself is not
there anymore (the whole VM is gone!). When the node is no longer in use,
slurm automatically runs slurm_suspend.sh script which removes the whole
node(VM) by running "openstack server delete $host". There is no server/VM,
no IP address, no DNS name, nothing. "slurm4-compute9" only exists as a
hypothetical node that can be launched in the future in case there are more
jobs to run. That's how "cloud" partition works, right?

[slurm.conf]
SuspendProgram=/usr/local/sbin/slurm_suspend.sh
SuspendTime=600 #time in seconds before an idle node is suspended

I am wondering.. maybe something went wrong when slurm ran slurm_suspend.sh
so that slurm *thinks* that the node is still there.. so it tries to ping
it, and it fails to ping it (obviously...) and marking it as DOWN?

I don't know if my theory is right or not.. but just to get our cluster
going again, is there a way to force slurm to forget about the node that it
"suspended" earlier? Is there a command like "scontrol forcesuspend
node=$id"?

Thank you for your help!

-soichi

On Fri, Jul 30, 2021 at 7:56 PM Brian Andrus  wrote:

> This message was sent from a non-IU address. Please exercise caution when
> clicking links or opening attachments from external sources.
>
> That 'not responding' is the issue and usually means 1 of 2 things:
>
> 1) slurmd is not running on the node
> 2) something on the network is stopping the communication between the node
> and the master (firewall, selinux, congestion, bad nic, routes, etc)
>
> Brian Andrus
> On 7/30/2021 3:51 PM, Soichi Hayashi wrote:
>
> Brian,
>
> Thank you for your reply and thanks for setting the email title. I forgot
> to edit it before I sent it!
>
> I am not sure how I can reply to your your reply.. but I hope this make it
> so the right place..
>
> I've updated slurm.conf to increase the controller debug level
> > SlurmctldDebug=5
>
> I now see additional log output (debug).
>
> [2021-07-30T22:42:05.255] debug:  Spawning ping agent for
> slurm4-compute[2-6,10,12-14]
> [2021-07-30T22:42:05.256] error: Nodes slurm4-compute[9,15,19-22,30] not
> responding, setting DOWN
>
> It's still very sparse, but it looks like slurm is trying to ping nodes
> that are already removed (they don't exist anymore - as they are removed by
> slurm_suspend.sh script)
>
> I tried sinfo -R but it doesn't really give much info..
>
> $ sinfo -R
> REASON   USER  TIMESTAMP   NODELIST
> Not responding   slurm 2021-07-30T22:42:05
> slurm4-compute[9,15,19-22,30]
>
> These machines are gone, so it should not respond.
>
> $ ping slurm4-compute9
> ping: slurm4-compute9: Name or service not known
>
> This is expected.
>
> Why is slurm keeps trying to contact the node that's already removed?
> slurm_suspend.sh does the following to "remove" node from the partition.
> > scontrol update nodename=${host} nodeaddr="(null)"
> Maybe this isn't the correct way to do it? Is there a way to force slurm
> to forget about the node? I tried "scontrol update node=$node state=idle",
> but this only works for a few minutes until slurm's ping agent kicks in and
> marking them down again.
>
> Thanks!!
> Soichi
>
>
>
>
> On Fri, Jul 30, 2021 at 2:21 PM Soichi Hayashi  wrote:
>
>> Hello. I need a help with troubleshooting our slurm cluster.
>>
>> I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud
>> infrastructure (Jetstream) using an elastic computing mechanism (
>> https://slurm.schedmd.com/elastic_computing.html). Our cluster works for
>> the most part, but for some reason, a few of our nodes constantly goes into
>> "down" state.
>>
>> PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS GROUPS  NODES
>>   STATE NODELIST
>> cloud*   up 2-00:00:00 1-infinite   noYES:4all 10
>>   idle~ slurm9-compute[1-5,10,12-15]
>> cloud*   up 2-00:00:00 1-infinite   noYES:4all  5
>>down slurm9-compute[6-9,11]
>>
>> The only log I see in the slurm log is this..
>>
>> [2021-07-30T15:10:55.889] Invalid node state transition requested for
>> node slurm9-compute6 from=COMPLETING to=RESUME
>> [2021-07-30T15:21:37.339] Invalid node state transition requested for
>> node slurm9-compute6 from=COMPLETING* to=RESUME
>> [2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason set
>> to: completing
>> [2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state set to
>> DOWN
>> [2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state set to
>> IDLE
>> ..
>> [2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not
>> responding, setting DOWN
>>
>> WIth elastic computing, any unused nodes are automatically removed
>> (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are
>> *expected* to not respond once they are removed, but they should not be
>> marked as DOWN. They should simply be set to "idle".
>>
>> To work around this issue, I am running the following cron job.
>>
>> 0 0 

Re: [slurm-users] Down nodes

2021-07-30 Thread Brian Andrus

That 'not responding' is the issue and usually means 1 of 2 things:

1) slurmd is not running on the node
2) something on the network is stopping the communication between the 
node and the master (firewall, selinux, congestion, bad nic, routes, etc)


Brian Andrus

On 7/30/2021 3:51 PM, Soichi Hayashi wrote:

Brian,

Thank you for your reply and thanks for setting the email title. I 
forgot to edit it before I sent it!


I am not sure how I can reply to your your reply.. but I hope this 
make it so the right place..


I've updated slurm.conf to increase the controller debug level
> SlurmctldDebug=5

I now see additional log output (debug).

[2021-07-30T22:42:05.255] debug:  Spawning ping agent for 
slurm4-compute[2-6,10,12-14]
[2021-07-30T22:42:05.256] error: Nodes slurm4-compute[9,15,19-22,30] 
not responding, setting DOWN


It's still very sparse, but it looks like slurm is trying to ping 
nodes that are already removed (they don't exist anymore - as they are 
removed by slurm_suspend.sh script)


I tried sinfo -R but it doesn't really give much info..

$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       slurm     2021-07-30T22:42:05 
slurm4-compute[9,15,19-22,30]


These machines are gone, so it should not respond.

$ ping slurm4-compute9
ping: slurm4-compute9: Name or service not known

This is expected.

Why is slurm keeps trying to contact the node that's already removed? 
slurm_suspend.sh does the following to "remove" node from the partition.

> scontrol update nodename=${host} nodeaddr="(null)"
Maybe this isn't the correct way to do it? Is there a way to force 
slurm to forget about the node? I tried "scontrol update node=$node 
state=idle", but this only works for a few minutes until slurm's ping 
agent kicks in and marking them down again.


Thanks!!
Soichi




On Fri, Jul 30, 2021 at 2:21 PM Soichi Hayashi > wrote:


Hello. I need a help with troubleshooting our slurm cluster.

I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud
infrastructure (Jetstream) using an elastic computing
mechanism (https://slurm.schedmd.com/elastic_computing.html
). Our cluster
works for the most part, but for some reason, a few of our nodes
constantly goes into "down" state.

PARTITION AVAIL  TIMELIMIT JOB_SIZE ROOT OVERSUBS     GROUPS
 NODES       STATE NODELIST
cloud*       up 2-00:00:00 1-infinite   no    YES:4        all    
10       idle~ slurm9-compute[1-5,10,12-15]
cloud*       up 2-00:00:00 1-infinite   no    YES:4        all    
 5        down slurm9-compute[6-9,11]

The only log I see in the slurm log is this..

[2021-07-30T15:10:55.889] Invalid node state transition requested
for node slurm9-compute6 from=COMPLETING to=RESUME
[2021-07-30T15:21:37.339] Invalid node state transition requested
for node slurm9-compute6 from=COMPLETING* to=RESUME
[2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason
set to: completing
[2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state
set to DOWN
[2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state
set to IDLE
..
[2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not
responding, setting DOWN

WIth elastic computing, any unused nodes are automatically removed
(by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are
*expected* to not respond once they are removed, but they should
not be marked as DOWN. They should simply be set to "idle".

To work around this issue, I am running the following cron job.

0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume

This "works" somewhat.. but our nodes go to "DOWN" state so often
that running this every hour is not enough.

Here is the full content of our slurm.conf

root@slurm9:~# cat /etc/slurm-llnl/slurm.conf
ClusterName=slurm9
ControlMachine=slurm9

SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
ProctrackType=proctrack/pgid
ReturnToService=1
Prolog=/usr/local/sbin/slurm_prolog.sh

#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
#make slurm a little more tolerant here
MessageTimeout=30
TCPTimeout=15
BatchStartTimeout=20
GetEnvTimeout=20
InactiveLimit=0
MinJobAge=604800
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#FastSchedule=0

# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFi

Re: [slurm-users] Down nodes

2021-07-30 Thread Soichi Hayashi
Brian,

Thank you for your reply and thanks for setting the email title. I forgot
to edit it before I sent it!

I am not sure how I can reply to your your reply.. but I hope this make it
so the right place..

I've updated slurm.conf to increase the controller debug level
> SlurmctldDebug=5

I now see additional log output (debug).

[2021-07-30T22:42:05.255] debug:  Spawning ping agent for
slurm4-compute[2-6,10,12-14]
[2021-07-30T22:42:05.256] error: Nodes slurm4-compute[9,15,19-22,30] not
responding, setting DOWN

It's still very sparse, but it looks like slurm is trying to ping nodes
that are already removed (they don't exist anymore - as they are removed by
slurm_suspend.sh script)

I tried sinfo -R but it doesn't really give much info..

$ sinfo -R
REASON   USER  TIMESTAMP   NODELIST
Not responding   slurm 2021-07-30T22:42:05
slurm4-compute[9,15,19-22,30]

These machines are gone, so it should not respond.

$ ping slurm4-compute9
ping: slurm4-compute9: Name or service not known

This is expected.

Why is slurm keeps trying to contact the node that's already removed?
slurm_suspend.sh does the following to "remove" node from the partition.
> scontrol update nodename=${host} nodeaddr="(null)"
Maybe this isn't the correct way to do it? Is there a way to force slurm to
forget about the node? I tried "scontrol update node=$node state=idle", but
this only works for a few minutes until slurm's ping agent kicks in and
marking them down again.

Thanks!!
Soichi




On Fri, Jul 30, 2021 at 2:21 PM Soichi Hayashi  wrote:

> Hello. I need a help with troubleshooting our slurm cluster.
>
> I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud
> infrastructure (Jetstream) using an elastic computing mechanism (
> https://slurm.schedmd.com/elastic_computing.html). Our cluster works for
> the most part, but for some reason, a few of our nodes constantly goes into
> "down" state.
>
> PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS GROUPS  NODES
>   STATE NODELIST
> cloud*   up 2-00:00:00 1-infinite   noYES:4all 10
>   idle~ slurm9-compute[1-5,10,12-15]
> cloud*   up 2-00:00:00 1-infinite   noYES:4all  5
>down slurm9-compute[6-9,11]
>
> The only log I see in the slurm log is this..
>
> [2021-07-30T15:10:55.889] Invalid node state transition requested for node
> slurm9-compute6 from=COMPLETING to=RESUME
> [2021-07-30T15:21:37.339] Invalid node state transition requested for node
> slurm9-compute6 from=COMPLETING* to=RESUME
> [2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason set to:
> completing
> [2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state set to
> DOWN
> [2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state set to
> IDLE
> ..
> [2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not
> responding, setting DOWN
>
> WIth elastic computing, any unused nodes are automatically removed
> (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are
> *expected* to not respond once they are removed, but they should not be
> marked as DOWN. They should simply be set to "idle".
>
> To work around this issue, I am running the following cron job.
>
> 0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume
>
> This "works" somewhat.. but our nodes go to "DOWN" state so often that
> running this every hour is not enough.
>
> Here is the full content of our slurm.conf
>
> root@slurm9:~# cat /etc/slurm-llnl/slurm.conf
> ClusterName=slurm9
> ControlMachine=slurm9
>
> SlurmUser=slurm
> SlurmdUser=root
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> StateSaveLocation=/tmp
> SlurmdSpoolDir=/tmp/slurmd
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> ProctrackType=proctrack/pgid
> ReturnToService=1
> Prolog=/usr/local/sbin/slurm_prolog.sh
>
> #
> # TIMERS
> SlurmctldTimeout=300
> SlurmdTimeout=300
> #make slurm a little more tolerant here
> MessageTimeout=30
> TCPTimeout=15
> BatchStartTimeout=20
> GetEnvTimeout=20
> InactiveLimit=0
> MinJobAge=604800
> KillWait=30
> Waittime=0
> #
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU_Memory
> #FastSchedule=0
>
> # LOGGING
> SlurmctldDebug=3
> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> SlurmdDebug=3
> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> JobCompType=jobcomp/none
>
> # ACCOUNTING
> JobAcctGatherType=jobacct_gather/linux
> JobAcctGatherFrequency=30
>
> AccountingStorageType=accounting_storage/filetxt
> AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log
>
> #CLOUD CONFIGURATION
> PrivateData=cloud
> ResumeProgram=/usr/local/sbin/slurm_resume.sh
> SuspendProgram=/usr/local/sbin/slurm_suspend.sh
> ResumeRate=1 #number of nodes per minute that can be created; 0 means no
> limit
> ResumeTimeout=900 #max time in seconds between ResumeProgram running an

Re: [slurm-users] History of pending jobs

2021-07-30 Thread Ole Holm Nielsen

On 30-07-2021 20:42, Glenn (Gedaliah) Wolosh wrote:
I'm interested on getting an idea how long jobs were pending in a 
particular partition. Is there any magic to sreport or sacct that can 
generate this info.


I could also use something like:"sreport cluster utilization" broken 
down by partition.


The "topreports" and "slurmacct" tools let you specify partitions, and 
also report the average waiting time in the queue, for users and groups:

https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmacct

I hope this helps.

/Ole



Re: [slurm-users] History of pending jobs

2021-07-30 Thread Fulcomer, Samuel
XDMoD can do that for you, but bear in mind that wait/pending time by
itself may not be particularly useful.

Consider the extreme scenario in which a user is only allowed to use one
node at a time, but submits a thousand one-day jobs. Without any other
competition for resources, the average wait/pending time would be five
hundred days.

On Fri, Jul 30, 2021 at 2:44 PM Glenn (Gedaliah) Wolosh 
wrote:

> I'm interested on getting an idea how long jobs were pending in a
> particular partition. Is there any magic to sreport or sacct that can
> generate this info.
>
> I could also use something like:"sreport cluster utilization" broken down
> by partition.
>
> Any help would be appreciated.
>
>
>
> [image: NJIT logo]  *Glenn (Gedaliah) Wolosh,
> Ph.D.*
> Ass't Director Research Software and Cloud Computing
> Acad & Research Computing Systems
> gwol...@njit.edu • (973) 596-5437 <(973)%20596-5437>
>
> A Top 100 National University
> *U.S. News & World Report*
>
>
>
>
>
>


[slurm-users] History of pending jobs

2021-07-30 Thread Glenn (Gedaliah) Wolosh
I'm interested on getting an idea how long jobs were pending in a particular 
partition. Is there any magic to sreport or sacct that can generate this info.

I could also use something like:"sreport cluster utilization" broken down by 
partition.

Any help would be appreciated.



 Glenn (Gedaliah) Wolosh, Ph.D.
Ass't Director Research Software and Cloud Computing
Acad & Research Computing Systems
gwol...@njit.edu  • (973) 596-5437 

A Top 100 National University
U.S. News & World Report







[slurm-users] (no subject)

2021-07-30 Thread Soichi Hayashi
Hello. I need a help with troubleshooting our slurm cluster.

I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public cloud
infrastructure (Jetstream) using an elastic computing mechanism (
https://slurm.schedmd.com/elastic_computing.html). Our cluster works for
the most part, but for some reason, a few of our nodes constantly goes into
"down" state.

PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS GROUPS  NODES
STATE NODELIST
cloud*   up 2-00:00:00 1-infinite   noYES:4all 10
idle~ slurm9-compute[1-5,10,12-15]
cloud*   up 2-00:00:00 1-infinite   noYES:4all  5
 down slurm9-compute[6-9,11]

The only log I see in the slurm log is this..

[2021-07-30T15:10:55.889] Invalid node state transition requested for node
slurm9-compute6 from=COMPLETING to=RESUME
[2021-07-30T15:21:37.339] Invalid node state transition requested for node
slurm9-compute6 from=COMPLETING* to=RESUME
[2021-07-30T15:27:30.039] update_node: node slurm9-compute6 reason set to:
completing
[2021-07-30T15:27:30.040] update_node: node slurm9-compute6 state set to
DOWN
[2021-07-30T15:27:40.830] update_node: node slurm9-compute6 state set to
IDLE
..
[2021-07-30T15:34:20.628] error: Nodes slurm9-compute[6-9,11] not
responding, setting DOWN

WIth elastic computing, any unused nodes are automatically removed
(by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So nodes are
*expected* to not respond once they are removed, but they should not be
marked as DOWN. They should simply be set to "idle".

To work around this issue, I am running the following cron job.

0 0 * * * scontrol update node=slurm9-compute[1-30] state=resume

This "works" somewhat.. but our nodes go to "DOWN" state so often that
running this every hour is not enough.

Here is the full content of our slurm.conf

root@slurm9:~# cat /etc/slurm-llnl/slurm.conf
ClusterName=slurm9
ControlMachine=slurm9

SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/tmp
SlurmdSpoolDir=/tmp/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
ProctrackType=proctrack/pgid
ReturnToService=1
Prolog=/usr/local/sbin/slurm_prolog.sh

#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
#make slurm a little more tolerant here
MessageTimeout=30
TCPTimeout=15
BatchStartTimeout=20
GetEnvTimeout=20
InactiveLimit=0
MinJobAge=604800
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#FastSchedule=0

# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none

# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30

AccountingStorageType=accounting_storage/filetxt
AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log

#CLOUD CONFIGURATION
PrivateData=cloud
ResumeProgram=/usr/local/sbin/slurm_resume.sh
SuspendProgram=/usr/local/sbin/slurm_suspend.sh
ResumeRate=1 #number of nodes per minute that can be created; 0 means no
limit
ResumeTimeout=900 #max time in seconds between ResumeProgram running and
when the node is ready for use
SuspendRate=1 #number of nodes per minute that can be suspended/destroyed
SuspendTime=600 #time in seconds before an idle node is suspended
SuspendTimeout=300 #time between running SuspendProgram and the node being
completely down
TreeWidth=30

NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24 RealMemory=60388
PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15] Default=YES
MaxTime=48:00:00 State=UP Shared=YES

I appreciate your assistance!

Soichi Hayashi
Indiana University