Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-13 Thread Chris Samuel
On Friday, 13 December 2019 7:01:48 AM PST Christopher Benjamin Coffey wrote:

> Maybe because that setting is just not included in the default list of
> settings shown? That is counterintuitive to this in the man page for
> sacctmgr:
> 
> show  []
>   Display information about the specified entity.  By default,
> all entries are displayed, you can narrow results by specifying SPECS in
> your query.  Identical to the list command.
> 
> Thoughts? Thanks!

I _suspect_ what that's saying is that it is has a default list that you can 
narrow, not that specifying it there will show it if it's not part of the 
default list.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] error: persistent connection experienced an error

2019-12-13 Thread Chris Samuel

On 13/12/19 12:19 pm, Christopher Benjamin Coffey wrote:


error: persistent connection experienced an error


Looking at the source code that comes from here:

if (ufds.revents & POLLERR) {
error("persistent connection experienced an error");
return false;
}

So your TCP/IP stack reported a problem with an existing connection.

That's very odd if you're on the same box.

If you are on a large system or putting a lot of small jobs through 
quickly then it's worth checking out the Slurm HTC guide for networking:


https://slurm.schedmd.com/high_throughput.html

Good luck..

Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



[slurm-users] error: persistent connection experienced an error

2019-12-13 Thread Christopher Benjamin Coffey
Hi All,

I wonder if any of you have seen these errors in slurmdbd.log

error: persistent connection experienced an error

When we see these errors, we are seeing job errors with some kind of accounting 
in slurm like:

slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should 
never happen
slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should 
never happen
srun: fatal: slurm_allocation_msg_thr_create: pthread_create error Resource 
temporarily unavailable

I haven't been able to figure out what makes the slurmdbd get into this 
condition. The slurm controller, and slurmdbd are on the same box, so it's 
increasingly odd that the slurmdbd can't communicate with slurmctld. While we 
figure this out, we have begun restarting slurmctl and slurmdbd every day to 
try and keep them "in sync". 

Anyone seen this? Any thoughts? Maybe the one port shown here by:

sacctmgr list cluster

Becomes overwhelmed at times? We have a range of ports for the controller to be 
contacted on. Maybe the db should try on another port if that’s the issue?

SlurmctldPort=6900-6950

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 



Re: [slurm-users] srun: job steps and generic resources

2019-12-13 Thread Brian W. Johanson
If those sruns are wrapped in salloc, they work correctly.  The first srun can 
be eliminated by adding SallocDefaultCommand for salloc (disabled in this 
example with --no-shell)
SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --mpi=none --pty 
$SHELL"




[user@login005 ~]$ salloc -p GPU --gres=gpu:p100:1 --no-shell
salloc: Good day
salloc: Pending job allocation 7052366
salloc: job 7052366 queued and waiting for resources
salloc: job 7052366 has been allocated resources
salloc: Granted job allocation 7052366
[user@login005 ~]$ srun --jobid 7052366 --gres=gpu:0 --pty bash
[user@gpu045 ~]$ nvidia-smi
No devices were found
[user@gpu045 ~]$ srun nvidia-smi
Fri Dec 13 14:19:45 2019
+-+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1 |
|---+--+--+
| GPU  Name    Persistence-M| Bus-Id    Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===+==+==|
|   0  Tesla P100-PCIE...  On   | :87:00.0 Off |    0 |
| N/A   31C    P0    26W / 250W |  0MiB / 16280MiB | 0%  Default |
+---+--+--+

+-+
| Processes: GPU Memory |
|  GPU   PID   Type   Process name Usage  |
|=|
|  No running processes found |
+-+
[user@gpu045 ~]$ exit
exit
[user@login005 ~]$ scancel 7052366
[user@login005 ~]$








On 12/13/19 11:48 AM, Kraus, Sebastian wrote:

Dear Valantis,
thanks for the explanation. But, I have to correct you about the second 
alternate approach:
srun -ppartition -N1 -n4 --gres=gpu:0 --time=00:30:00 --mem=1G -Jjobname
--pty /bin/bash -il
srun --gres=gpu:1 -l hostname

Naturally, this is not working and in consequence the "inner" srun job step 
throws an error about the generic resource being not available/allocatable:
user@frontend02#-bash_4.2:~:[2]$ srun -pgpu -N1 -n4 --time=00:30:00 --mem=5G 
--gres=gpu:0 -Jjobname --pty /bin/bash -il
user@gpu006#bash_4.2:~:[1]$  srun --gres=gpu:1 hostname
srun: error: Unable to create step for job 18044554: Invalid generic resource 
(gres) specification

Test it yourself. ;-)

Best
Sebastian


Sebastian Kraus
Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin


Tel.: +49 30 314 22263
Fax: +49 30 314 29309
Email: sebastian.kr...@tu-berlin.de



From: Chrysovalantis Paschoulas 
Sent: Friday, December 13, 2019 13:05
To: Kraus, Sebastian
Subject: Re: [slurm-users] srun: job steps and generic resources

Hi Sebastian,

the first srun uses the gres you requested and the second waits for it
to be available again.

You have to do either
```
srun -ppartition -N1 -n4 --gres=gpu:1 --time=00:30:00 --mem=1G -Jjobname
--pty /bin/bash -il

srun --gres=gpu:0 -l hostname
```

or
```
srun -ppartition -N1 -n4 --gres=gpu:0 --time=00:30:00 --mem=1G -Jjobname
--pty /bin/bash -il

srun --gres=gpu:1 -l hostname
```

Best Regards,
Valantis


On 13.12.19 12:44, Kraus, Sebastian wrote:

Dear all,
I am facing the following nasty problem.
I use to start interactive batch jobs via:
srun -ppartition -N1 -n4 --time=00:30:00 --mem=1G -Jjobname --pty /bin/bash -il
Then, explicitly starting a job step within such a session via:
srun -l hostname
works fine.
But, as soon as I add a generic resource  to the job allocation as with:
srun -ppartition -N1 -n4 --gres=gpu:1 --time=00:30:00 --mem=1G -Jjobname --pty 
/bin/bash -il
an explict job step lauched as above via:
srun -l hostname
infinitely stalls/blocks.
Hope, anyone out there able to explain me this behavior.

Thanks and best
Sebastian


Sebastian Kraus
Team IT am Institut für Chemie

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

Email: sebastian.kr...@tu-berlin.de





[slurm-users] Efficiency of the profile influxdb plugin for graphing live job stats

2019-12-13 Thread Lech Nieroda
Hi,


I’ve been tinkering with the acct_gather_profile/influxdb plugin a bit  in 
order to visualize the cpu and memory usage of live jobs.
Both the influxdb backend and Grafana dashboards seem like a perfect fit for 
our needs.

I’ve run into an issue though and made a crude workaround for it, maybe someone 
knows a better way?

A few words about influxdb and the influxdb plugin:
InfluxDB is a NoSQL database that organizes its data in „series“, which are 
unique sets of „measurements“ and „tags“, which correspond roughly to tables 
and their indexed fields (if you prefer relational DB).
A single „series“ can reference a multitude of timestamped records described 
further by non indexed „fields".
The acct_gather_profile/influxdb plugin defines its data points for each job 
task/step as follows:

Measurement: CPUTime   Tags: job, host, step, task   Fields: value   Timestamp
Measurement: CPUUtilization   Tags: job, host, step, task   Fields: value   
Timestamp
…

e.g. a single record would look like 
CPUTime,job=12465711,step=0,task=3,host=node20307 value=20.80 1576054517

The default „Task“ Profile contains 8 such characteristics:  CPUTime, 
CPUUtilization, CPUFrequency, RSS, VMSize, Pages, ReadMB, WriteMB

This data structure means that for each job step, task or host, 8 unique 
„series“ are created, e.g. „CPUTime, job, step, task, host“, „CPUUtilization, 
job, step, task, host“, ...
Those „series“ then reference the timestamped values of the respective 
measurements. The „tags“ can be used to „group by“ in queries, e.g. the 
performance of a single job on a specified host.


OK, so what’s the problem?
There are two: the number of created „series“ and data redundancy.
InfluxDB limits the number of „series“ per default to 1 million, and for good 
reason: each „series“ increases RAM usage since it’s used as an index. 
The number of „series" or "series cardinality" is one of the most important 
factors determining memory usage;  the influxdb manual considers a cardinality 
above 10 million as „probably infeasible“.
When you consider that each combination of a new job/host/step/task creates 8 
„series“, the default limit can be reached relatively quickly. Performance 
problems follow.
As to data redundancy: for each timestamp a large part of the same data is 
stored multiple times under different „measurements“.

The current workaround: store the 8 characteristics as „fields“ rather than 
„measurements“, thus creating 1 series per job/step/task/host rather than 8. It 
also reduces data redundancy, saving roughly 70%.

So a single „series" would be:
Measurement: acct_gather_profile_task   Tags: job, step, task, host   Fields: 
CPUTime, CPUUtilization, CPUFrequency, RSS, VMSize, Pages, ReadMB, WriteMB
Timestamp

Another benefit is that identical „measurement“ names like e.g. „WriteMB" which 
are used both by the task and the lustre/fs profile plugins can be 
differentiated.

Further Ideas?

Kind regards,
Lech




Re: [slurm-users] srun: job steps and generic resources

2019-12-13 Thread Kraus, Sebastian
Dear Valantis,
thanks for the explanation. But, I have to correct you about the second 
alternate approach:
srun -ppartition -N1 -n4 --gres=gpu:0 --time=00:30:00 --mem=1G -Jjobname
--pty /bin/bash -il
srun --gres=gpu:1 -l hostname

Naturally, this is not working and in consequence the "inner" srun job step 
throws an error about the generic resource being not available/allocatable:
user@frontend02#-bash_4.2:~:[2]$ srun -pgpu -N1 -n4 --time=00:30:00 --mem=5G 
--gres=gpu:0 -Jjobname --pty /bin/bash -il
user@gpu006#bash_4.2:~:[1]$  srun --gres=gpu:1 hostname
srun: error: Unable to create step for job 18044554: Invalid generic resource 
(gres) specification

Test it yourself. ;-)

Best
Sebastian


Sebastian Kraus
Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin


Tel.: +49 30 314 22263
Fax: +49 30 314 29309
Email: sebastian.kr...@tu-berlin.de



From: Chrysovalantis Paschoulas 
Sent: Friday, December 13, 2019 13:05
To: Kraus, Sebastian
Subject: Re: [slurm-users] srun: job steps and generic resources

Hi Sebastian,

the first srun uses the gres you requested and the second waits for it
to be available again.

You have to do either
```
srun -ppartition -N1 -n4 --gres=gpu:1 --time=00:30:00 --mem=1G -Jjobname
--pty /bin/bash -il

srun --gres=gpu:0 -l hostname
```

or
```
srun -ppartition -N1 -n4 --gres=gpu:0 --time=00:30:00 --mem=1G -Jjobname
--pty /bin/bash -il

srun --gres=gpu:1 -l hostname
```

Best Regards,
Valantis


On 13.12.19 12:44, Kraus, Sebastian wrote:
> Dear all,
> I am facing the following nasty problem.
> I use to start interactive batch jobs via:
> srun -ppartition -N1 -n4 --time=00:30:00 --mem=1G -Jjobname --pty /bin/bash 
> -il
> Then, explicitly starting a job step within such a session via:
> srun -l hostname
> works fine.
> But, as soon as I add a generic resource  to the job allocation as with:
> srun -ppartition -N1 -n4 --gres=gpu:1 --time=00:30:00 --mem=1G -Jjobname 
> --pty /bin/bash -il
> an explict job step lauched as above via:
> srun -l hostname
> infinitely stalls/blocks.
> Hope, anyone out there able to explain me this behavior.
>
> Thanks and best
> Sebastian
>
>
> Sebastian Kraus
> Team IT am Institut für Chemie
>
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin
>
> Email: sebastian.kr...@tu-berlin.de



Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-13 Thread Christopher Benjamin Coffey
Hey Chris,

Thanks! Ya, my qos name is billybob for testing. I believe I was setting it 
right, but not able to confirm it correctly.

sacctmgr update qos name=billybob set maxjobsaccrueperuser=8 -i

[ddd@radar ~ ]$ sacctmgr show qos where name=billybob 
format=MaxJobsAccruePerUser
MaxJobsAccruePU 
--- 
  8

I guess it's getting set right, but I wonder why its not shown by:

[ddd@radar ~ ]$ sacctmgr show qos where name=billybob
  Name   Priority  GraceTimePreempt   PreemptExemptTime PreemptMode 
   Flags UsageThres UsageFactor   GrpTRES   
GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall   MaxTRES 
MaxTRESPerNode   MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU
 MaxTRESPA MaxJobsPA MaxSubmitPA   MinTRES 
-- -- -- -- --- --- 
 -- --- - 
- - --- - --- - 
-- - --- - - --- 
- - --- - 
  billybob  0   00:00:00 explorato+ cluster 
   1.00 


 

[ddd@radar ~ ]$ sacctmgr show qos where name=billybob 
format=maxjobsaccrueperuser
MaxJobsAccruePU 
--- 
  8

Maybe because that setting is just not included in the default list of settings 
shown? That is counterintuitive to this in the man page for sacctmgr:

show  []
  Display information about the specified entity.  By default, all 
entries are displayed, you can narrow results by specifying
  SPECS in your query.  Identical to the list command.

Thoughts? Thanks!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 12/12/19, 10:46 PM, "slurm-users on behalf of Chris Samuel" 
 wrote:

Hi Chris,

On 12/12/19 3:16 pm, Christopher Benjamin Coffey wrote:

> What am I missing?

It's just a setting on the QOS, not the user:

csamuel@cori01:~> sacctmgr show qos where name=regular_1 
format=MaxJobsAccruePerUser
MaxJobsAccruePU
---
   2

So any user in that QOS can only have 2 jobs ageing at any one time.

All the best,
Chris
-- 
  Chris Samuel  :  
https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2Fdata=02%7C01%7Cchris.coffey%40nau.edu%7C70ddcd0c108d49a5daba08d77f8fcccb%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637118127915006547sdata=KXW%2B4pHkgymQLBLLbv2PK7bk0Nb0rGOBTd9nvwQR9mU%3Dreserved=0
  :  Berkeley, CA, USA





Re: [slurm-users] srun: job steps and generic resources

2019-12-13 Thread Brian W. Johanson
The gres resource is allocated by the first srun, the second srun is waiting for 
the gres allocation to be free.


If you were to replace that second srun with 'srun -l --gres=gpu:0 hostname' it 
will complete, but it will not have access to the GPUs.


You can use salloc instead of the srun to create the allocation and issue an 
'srun --gres=gpu:0 --pty bash', the second srun will not hang as the gres 
resource is avail.
But you will not have access to the GPUs within your shell as it is not 
allocated to that srun instance.


A workaround is to 'export SLURM_GRES=gpu:0' in the shell where the srun is 
hanging.


-b


On 12/13/19 6:44 AM, Kraus, Sebastian wrote:

Dear all,
I am facing the following nasty problem.
I use to start interactive batch jobs via:
srun -ppartition -N1 -n4 --time=00:30:00 --mem=1G -Jjobname --pty /bin/bash -il
Then, explicitly starting a job step within such a session via:
srun -l hostname
works fine.
But, as soon as I add a generic resource  to the job allocation as with:
srun -ppartition -N1 -n4 --gres=gpu:1 --time=00:30:00 --mem=1G -Jjobname --pty 
/bin/bash -il
an explict job step lauched as above via:
srun -l hostname
infinitely stalls/blocks.
Hope, anyone out there able to explain me this behavior.

Thanks and best
Sebastian


Sebastian Kraus
Team IT am Institut für Chemie

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

Email: sebastian.kr...@tu-berlin.de





[slurm-users] srun: job steps and generic resources

2019-12-13 Thread Kraus, Sebastian
Dear all,
I am facing the following nasty problem.
I use to start interactive batch jobs via:
srun -ppartition -N1 -n4 --time=00:30:00 --mem=1G -Jjobname --pty /bin/bash -il
Then, explicitly starting a job step within such a session via:
srun -l hostname
works fine.
But, as soon as I add a generic resource  to the job allocation as with:
srun -ppartition -N1 -n4 --gres=gpu:1 --time=00:30:00 --mem=1G -Jjobname --pty 
/bin/bash -il
an explict job step lauched as above via:
srun -l hostname
infinitely stalls/blocks.
Hope, anyone out there able to explain me this behavior.

Thanks and best
Sebastian


Sebastian Kraus
Team IT am Institut für Chemie

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

Email: sebastian.kr...@tu-berlin.de


Re: [slurm-users] sched

2019-12-13 Thread Steve Brasier
Thanks Alex - that is mostly how I understand it too. However my
understanding from the docs (and the GCP example actually) is that the
cluster isn't reconfigured in the sense of rewriting slurm.conf and
restarting the daemons (i.e. how you might manually resize a cluster), it's
just nodes are marked by slurmctld as "powered down", even if the actual
instances are released back to the cloud. So my query still stands I think.

regards
Steve

On Thu, 12 Dec 2019 at 17:08, Alex Chekholko  wrote:

> Hey Steve,
>
> I think it doesn't just "power down" the nodes but deletes the instances.
> So then when you need a new node, it creates one, then provisions the
> config, then updates the slurm cluster config...
>
> That's how I understand it, but I haven't tried running it myself.
>
> Regards,
> Alex
>
> On Thu, Dec 12, 2019 at 1:20 AM Steve Brasier  wrote:
>
>> Hi, I'm hoping someone can shed some light on the SchedMD-provided
>> example here https://github.com/SchedMD/slurm-gcp for an autoscaling
>> cluster on Google Cloud Plaform (GCP).
>>
>> I understand that slurm autoscaling uses the power saving interface to
>> create/remove nodes and the example suspend.py and resume.py scripts in the
>> seem pretty clear and in line with the slurm docs. However I don't
>> understand why the additional slurm-gcp-sync.py script is required. It
>> seems to compare the states of nodes as seen by google compute and slurm
>> and then on the GCP side either start instances or shut them down, and on
>> the slurm side mark them as in RESUME or DOWN states. I don't see why this
>> is necessary though; my understanding from the slurm docs is that e.g. the
>> suspend script simply has to "power down" the nodes, and slurmctld will
>> then mark them as in power saving mode - marking nodes down would seem to
>> prevent jobs being scheduled on them, which isn't what we want. Similarly,
>> I would have thought the resume.py script could mark nodes as in RESUME
>> state itself, (once it's tested that the node is up and slurmd is running
>> etc).
>>
>> thanks for any help
>> Steve
>>
>