Re: [slurm-users] Slurm 19.05 X11-forwarding

2020-02-28 Thread Pär Lundö
Hi all,
First off, thank you all for all of your quick replies and suggestions on how 
to solve this problem of mine.
With some additional help from Tina  Friedrich  I did a test proposed by her:

First step:  ”ssh -X ”.
Second step: ”ssh -X localhost”.
Third step: ”srun —x11 ”

(It should read a double ” -” but my mail-sw on this device doesnt let me do 
that, sorry!)
This actually worked and I got the X-application to be displayed at my working 
x-server/display.

So to summarize, I had to ssh to the node and when connected I ssh’d once more 
to the localhost (i.e the localhost at the node) and then started a 
”srun”-session. This did work, I got the X-forwarding to work and sent the 
X-application back to my available X-sever/screen.

This isn’t really the way I aspected the srun nor X11-forwarding to work. I 
thought that I could run the srun-command with X11-forwarding called from an 
sbatch-jobarray-script and get the X11-forwarding to my display. Yes I know 
that having 4000 nodes will pose a problem with a lot of X11-forwarding being 
sent, but that will not be a problem for me with only a handful of nodes and 
utlizing e.g xvfb.
I’ll work more on this matter now that I know a way of how to get it to work.

Thank you!
Best regards,
Pär Lundö


From: "slurm-users" 
Sent: 28 feb. 2020 17:52
To: "wag...@itc.rwth-aachen.de" 
Cc: "slurm-users@lists.schedmd.com" 
Subject: Re: [slurm-users] Slurm 19.05 X11-forwarding

Hi Marcus,

You are correct, but it doesnt show anything regarding the X11-forwarding.

Thank you for your input!

Best regards,
Pär Lundö

From: "slurm-users" 
Sent: 28 feb. 2020 15:57
To: "slurm-users@lists.schedmd.com" 
Subject: Re: [slurm-users] Slurm 19.05 X11-forwarding

Hi Pär,

yes , you can -v or e.g. -vvv together with srun. I'm not sure, though, if taht 
shows anything X11-related, but you might try.


Best
Marcus

On 2/28/20 3:45 PM, Pär Lundö wrote:

Hi everyone

Thank you for your support.
I’ve done a few changes and done some further testing but it has not solved my 
problem.
Regardless of the settings for the sshd I can’t get it to
I am able to use SSH to the node directly along with ”-X”-argument and get the 
X11 forwarding to work.
Is there a debug output from srun to retrieve during the connection to the node?
(Surely the simple answer to that is yes! I’ll try and locate that 
debug-function, or can someone help with that?)

Best regards,
Pär Lundö



From: "slurm-users" 

Sent: 27 feb. 2020 12:29
To: "Slurm User Community List" 

Subject: Re: [slurm-users] Slurm 19.05 X11-forwarding

Hi Sean,

Thank you for your reply.
I will test it asap.

Best regards,
Pär Lundö
From: slurm-users 

 On Behalf Of Sean Crosby
Sent: den 27 februari 2020 10:26
To: Slurm User Community List 

Subject: Re: [slurm-users] Slurm 19.05 X11-forwarding

I remember that we had to add this to our /etc/ssh/sshd_config to get X11 to 
work with Slurm 19.05

X11UseLocalhost no

We added this to our login nodes (where users ssh to), and then restarted the 
ssh server. You would then need to log out and log back in with X11 forwarding 
again.

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


On Wed, 26 Feb 2020 at 20:52, Pär Lundö 
mailto:par.lu...@foi.se>> wrote:
Hi,

Thank you for your quick replies.
Please bear with me as I am a newbie regarding Slurm and Linux.
My hostname is not a FQDN and I´m running Slurm on a local node (slurmctld and 
slurmuser are the same) just to verify that the X11-forwarding is working 
(amongst other things).
The output of "xauth list" is as follows:
"
parlun-E6540/unix:  MIT-MAGIC-COOKIE-1  bcf263421d143e6ba3297dfe926c68b8
##7061726c756e2d4536353430#:  MIT-MAGIC-COOKIE-1  
bcf263421d143e6ba3297dfe926c68b8
"
I seem to be missing any ":0", i.e. to wich display or Xserver my user 
(parlun-E6540) uses?
I´ve tried adding the user "slurm", who is designated slurmuser, via "xhost 
+SI:localuser:slurm" but I still get the error stating that the magic cookie 
could not be retrieved.

You are correct, X11Parameters only holds one option "home_xauthority".
Does this setting automatically search in the "/home/slurm"-dir for the 
Xauthority-file? (Given that the slurmuser is "slurm").
Trying to locate the storage where xauthoriy is present, I do not find it my 
home-dir instead it is in some other OS-dir.

Searching for additional help I came across a debian package (xvfb) for virtual 
framebuffers. i.e. video/GUI is sent to a virtual buffer instead of a physical 
display. Anyone tried that?
I haven´t not yet installed it, but is curious if that´s a way to go.

Best regards,
Pär Lundö

-Original Messa

[slurm-users] Hybrid compiling options

2020-02-28 Thread Brian Andrus

All,

Wanted to reach out for input on how folks compile slurm when you have a 
hybrid cluster.


Scenario:

you have 4 node types:

A) CPU only
B) GPU Only
C) CPU+IB
D) GPU+IB

So, you can compile slurm with/without IB support and/or with/without 
GPU support.

Including either option creates a dependency when packaging (RPM based).

So, do you compile different versions for the different node types or 
install the dependent packages on nodes that have no user (nvidia in 
particular here)?


Generally, I have always added the superfluous packages, but wondered 
what the thoughts on that are.


Brian Andrus




Re: [slurm-users] Nodelist dependent environment setup ?

2020-02-28 Thread Sajid Ali
Hi Ole,

Thanks a lot for sharing the resource!

Our biggest concern is the case where a user asks for 2 nodes and one of
those is a Cascade Lake node and the other one is a Haswell node. At this
point, the environment have modules that work on both, hence my preference
for reading the slurm nodelist to set the appropriate modulepath.

PS: I'm trying to do the same thing with spack!


-- 
Sajid Ali | PhD Candidate
Applied Physics
Northwestern University
s-sajid-ali.github.io


Re: [slurm-users] Nodelist dependent environment setup ?

2020-02-28 Thread Ole Holm Nielsen

On 28-02-2020 19:44, Sajid Ali wrote:
If I install multiple versions of a software library, each optimized for 
a different partition based on CPU architecture, how would I 
automatically load the version of software based on the nodes allocated 
to the job ?


Ideally I'd want to store the modules for each cpu arch at a different 
location and set the `MODULEPATH` at job startup to the lowest CPU-arch. 
(since our cluster only runs Intel CPU's, the software for lower arch 
would run on a higher arch).


Could someone point out how Slurm initializes the job environment at 
startup ? Based on this I'm hoping that it will be a relatively simple 
task to add a small script to determine nodelist and prepend the 
moduepath env var.


Alternatively, if someone could point out how they do this at their 
sites it would be useful as well.


This is how we NFS automount different module trees for different CPU 
architectures:


https://wiki.fysik.dtu.dk/niflheim/EasyBuild_modules#automounting-the-cpu-architecture-dependent-modules-directory

The advantage is that the automounter's map file uses a system variable 
$CPU_ARCH to mount the correct tree for each system.


The Wiki page has other possibly useful information, for example
https://wiki.fysik.dtu.dk/niflheim/EasyBuild_modules#setting-the-cpu-hardware-architecture

/Ole



[slurm-users] Nodelist dependent environment setup ?

2020-02-28 Thread Sajid Ali
Hi Slurm-developers/users,

If I install multiple versions of a software library, each optimized for a
different partition based on CPU architecture, how would I automatically
load the version of software based on the nodes allocated to the job ?

Ideally I'd want to store the modules for each cpu arch at a different
location and set the `MODULEPATH` at job startup to the lowest CPU-arch.
(since our cluster only runs Intel CPU's, the software for lower arch would
run on a higher arch).

Could someone point out how Slurm initializes the job environment at
startup ? Based on this I'm hoping that it will be a relatively simple task
to add a small script to determine nodelist and prepend the moduepath env
var.

Alternatively, if someone could point out how they do this at their sites
it would be useful as well.

Thanks in advance for the advice!

--
Sajid Ali | PhD Candidate
Applied Physics
Northwestern University
s-sajid-ali.github.io


[slurm-users] Question about determining pre-empted jobs

2020-02-28 Thread Jeffrey R. Lang
I need your help.

We have had a request to generate a report showing the number of jobs by date 
showing pre-empted jobs.   We used sacct to try to gather the data but we only 
found a few jobs with the state "PREEMPTED".

Scanning the slurmd logs we find there are a lot of job that show pre-empted.

What is the best way to gather or discover this data?

Thanks
Jeff


Re: [slurm-users] Slurm 19.05 X11-forwarding

2020-02-28 Thread Pär Lundö
Hi Marcus,

You are correct, but it doesnt show anything regarding the X11-forwarding.

Thank you for your input!

Best regards,
Pär Lundö

From: "slurm-users" 
Sent: 28 feb. 2020 15:57
To: "slurm-users@lists.schedmd.com" 
Subject: Re: [slurm-users] Slurm 19.05 X11-forwarding

Hi Pär,

yes , you can -v or e.g. -vvv together with srun. I'm not sure, though, if taht 
shows anything X11-related, but you might try.


Best
Marcus

On 2/28/20 3:45 PM, Pär Lundö wrote:

Hi everyone

Thank you for your support.
I’ve done a few changes and done some further testing but it has not solved my 
problem.
Regardless of the settings for the sshd I can’t get it to
I am able to use SSH to the node directly along with ”-X”-argument and get the 
X11 forwarding to work.
Is there a debug output from srun to retrieve during the connection to the node?
(Surely the simple answer to that is yes! I’ll try and locate that 
debug-function, or can someone help with that?)

Best regards,
Pär Lundö



From: "slurm-users" 

Sent: 27 feb. 2020 12:29
To: "Slurm User Community List" 

Subject: Re: [slurm-users] Slurm 19.05 X11-forwarding

Hi Sean,

Thank you for your reply.
I will test it asap.

Best regards,
Pär Lundö
From: slurm-users 

 On Behalf Of Sean Crosby
Sent: den 27 februari 2020 10:26
To: Slurm User Community List 

Subject: Re: [slurm-users] Slurm 19.05 X11-forwarding

I remember that we had to add this to our /etc/ssh/sshd_config to get X11 to 
work with Slurm 19.05

X11UseLocalhost no

We added this to our login nodes (where users ssh to), and then restarted the 
ssh server. You would then need to log out and log back in with X11 forwarding 
again.

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


On Wed, 26 Feb 2020 at 20:52, Pär Lundö 
mailto:par.lu...@foi.se>> wrote:
Hi,

Thank you for your quick replies.
Please bear with me as I am a newbie regarding Slurm and Linux.
My hostname is not a FQDN and I´m running Slurm on a local node (slurmctld and 
slurmuser are the same) just to verify that the X11-forwarding is working 
(amongst other things).
The output of "xauth list" is as follows:
"
parlun-E6540/unix:  MIT-MAGIC-COOKIE-1  bcf263421d143e6ba3297dfe926c68b8
##7061726c756e2d4536353430#:  MIT-MAGIC-COOKIE-1  
bcf263421d143e6ba3297dfe926c68b8
"
I seem to be missing any ":0", i.e. to wich display or Xserver my user 
(parlun-E6540) uses?
I´ve tried adding the user "slurm", who is designated slurmuser, via "xhost 
+SI:localuser:slurm" but I still get the error stating that the magic cookie 
could not be retrieved.

You are correct, X11Parameters only holds one option "home_xauthority".
Does this setting automatically search in the "/home/slurm"-dir for the 
Xauthority-file? (Given that the slurmuser is "slurm").
Trying to locate the storage where xauthoriy is present, I do not find it my 
home-dir instead it is in some other OS-dir.

Searching for additional help I came across a debian package (xvfb) for virtual 
framebuffers. i.e. video/GUI is sent to a virtual buffer instead of a physical 
display. Anyone tried that?
I haven´t not yet installed it, but is curious if that´s a way to go.

Best regards,
Pär Lundö

-Original Message-
From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Ryan Novosielski
Sent: den 25 februari 2020 21:23
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Slurm 19.05 X11-forwarding

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I seem to remember there being a config option to specify rewriting the 
hostname as well. I thought it was part of X11Parameters, but I only see one 
option there:

https://slurm.schedmd.com/archive/slurm-19.05-latest/slurm.conf.html

On 2/25/20 10:55 AM, Tina Friedrich wrote:
> I remember having issues when I set up X forwarding that had to do
> with how the host names were set on the nodes. I had them set (CentOS
> default) to the fully qualified hostname, and that didn't work - with
> an error message very similar to what you're getting, if memory serves
> right. 'Fixed' it by setting the hostnames on all my cluster nodes to
> the short version.
>
> What does 'xauth list' give you on your nodes (and the machine you're
> coming from)?
>
> This is/was SLURM 18.08 though, not sure if that makes a difference.
>
> Tina
>
> On 25/02/2020 04:55, Pär Lundö wrote:
>> Hi,
>>
>> Thank you for your reply Patrick. I´ve tried that but I still get the
>> error stating that the magic cookie could not be retrieved.
>> Reading Tim´s answer, this bug should have been fixed in a release
>> following 18.08, but I´m using 19.05 thus it should have been taken
>> c

Re: [slurm-users] How to show state of CLOUD nodes

2020-02-28 Thread Carter, Allan
Thanks, that was very useful. The key takeaways for me are:

Set “PrivateData=cloud”. Documentation states that the default is that 
everything’s public and that those options make things private. Apparently 
except for this case which allows regular uses to see nodes that are powered 
down.

Set “ReturnToService=2”. One of the problems I’m having is nodes taking too 
long to boot up due to initialization I was doing. This option will bring the 
nodes up even after the resume timeout causes them to be marked down.

Use strigger to take a custom action. I will use this to alarm when nodes go 
down, etc.

Thanks for the help. I think it will solve the issues I’m having.

From: Kirill 'kkm' Katsnelson [mailto:k...@pobox.com]
Sent: Friday, February 28, 2020 5:56 AM
To: Slurm User Community List 
Cc: Carter, Allan 
Subject: Re: [slurm-users] How to show state of CLOUD nodes

I'm running clusters entirely in Google Cloud. I'm not sure I'm understanding 
the issue--do the nodes disappear from view entirely only when they fail to 
power up by ResumeTimeout? Failures of this kind are happening in GCE when 
resources are momentarily unavailable, but the nodes are still there, only 
shown as DOWN. FWIW, I'm currently using 19.05.4-1.

I have a trigger on the controller to catch and return these nodes back to 
POWER_SAVE. The offset of 20s lets all moving parts to settle; in any case, 
Slurm batches trigger runs internally, on a 15s schedule IIRC, so it's not 
precise. --flags=PERM makes the trigger permanent, so you need to install it 
once and for all:

strigger --set --down --flags=PERM --offset=20 --program=$script

and the $script points to the full path (on the controller) of the following 
script. I'm copying the Log function and the _logname gymnastics from a file 
which is dot-sourced by the main program in my setup, as it's part of a larger 
set of scripts; it's more complex than it has to be for your case, but I did 
not want to introduce a bug by hastily paring it down. You'll do that if you 
want.

  8<8<8<
#!/bin/bash

set -u

# Tidy up name for logging: '.../slurm_resume.sh' => 'slurm-resume'
_logname=$(basename "$0")
_logname=${_logname%%.*}
_logname=${_logname//_/-}

Log() {
  local level=$1; shift;
  [[ $level == *.* ]] || level=daemon.$level  # So we can use e.g. auth.notice.
  logger -p $level -t $_logname -- "$@"
}

reason=recovery

for n; do
  Log notice "Recovering failed node(s) '$n'"
  scontrol update nodename="$n" reason="$reason" state=DRAIN &&
  scontrol update nodename="$n" reason="$reason" state=POWER_DOWN ||
Log alert "The command 'scontrol update nodename=$n' failed." \
  "Is scontrol on PATH?"
done

exit 0
  8<8<8<

The sequence of DRAIN first then POWER_DOWN is a magic left over from v18; see 
if POWER_DOWN alone does the trick. Or don't, as long as it works :)

Also make sure you have (some of) the following in slurm.conf, assuming EC2 
provides DNS name resolution--GCE does.

# Important for cloud: do not assume the nodes will retain their IP
# addresses, and do not cache name-to-IP mapping.
CommunicationParameters=NoAddrCache
SlurmctldParameters=cloud_dns,idle_on_node_suspend
PrivateData=cloud   # Always show cloud nodes.
ReturnToService=2   # When a DOWN node boots, it becomes available.

Hope this might help.

 -kkm

On Thu, Feb 27, 2020 at 4:11 PM Carter, Allan 
mailto:carta...@amazon.com>> wrote:
I’m setting up an EC2 SLURM cluster and when an instance doesn’t resume fast 
enough I get an error like:

node c7-c5-24xl-464 not resumed by ResumeTimeout(600) - marking down and 
power_save

I keep running into issues where my cloud nodes do not show up in sinfo and I 
can’t display their information with scontrol. This makes it difficult to know 
which of my CLOUD nodes are available for scheduling and which are down for 
some reason and can’t be used. I haven’t figured out when slurm will show a 
cloud node and when it won’t and this make it pretty hard to manage the cluster.

Would I be better off just removing the CLOUD attribute on my EC2 nodes? What 
is the advantage of making them CLOUD nodes if it just make it more difficult 
to manage the cluster?



Re: [slurm-users] Slurm 19.05 X11-forwarding

2020-02-28 Thread Marcus Wagner

Hi Pär,

yes , you can -v or e.g. -vvv together with srun. I'm not sure, though, 
if taht shows anything X11-related, but you might try.



Best
Marcus

On 2/28/20 3:45 PM, Pär Lundö wrote:


Hi everyone

Thank you for your support.
I’ve done a few changes and done some further testing but it has not 
solved my problem.

Regardless of the settings for the sshd I can’t get it to
I am able to use SSH to the node directly along with ”-X”-argument and 
get the X11 forwarding to work.
Is there a debug output from srun to retrieve during the connection to 
the node?
(Surely the simple answer to that is yes! I’ll try and locate that 
debug-function, or can someone help with that?)


Best regards,
Pär Lundö



*From:* "slurm-users" 
*Sent:* 27 feb. 2020 12:29
*To:* "Slurm User Community List" 
*Subject:* Re: [slurm-users] Slurm 19.05 X11-forwarding

Hi Sean,

Thank you for your reply.

I will test it asap.

Best regards,

Pär Lundö

*From:*slurm-users  *On Behalf 
Of *Sean Crosby

*Sent:* den 27 februari 2020 10:26
*To:* Slurm User Community List 
*Subject:* Re: [slurm-users] Slurm 19.05 X11-forwarding

I remember that we had to add this to our /etc/ssh/sshd_config to get 
X11 to work with Slurm 19.05


X11UseLocalhost no

We added this to our login nodes (where users ssh to), and then 
restarted the ssh server. You would then need to log out and log back 
in with X11 forwarding again.


Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia

On Wed, 26 Feb 2020 at 20:52, Pär Lundö > wrote:


Hi,

Thank you for your quick replies.
Please bear with me as I am a newbie regarding Slurm and Linux.
My hostname is not a FQDN and I´m running Slurm on a local node
(slurmctld and slurmuser are the same) just to verify that the
X11-forwarding is working (amongst other things).
The output of "xauth list" is as follows:
"
parlun-E6540/unix:  MIT-MAGIC-COOKIE-1
bcf263421d143e6ba3297dfe926c68b8
##7061726c756e2d4536353430#:  MIT-MAGIC-COOKIE-1
bcf263421d143e6ba3297dfe926c68b8
"
I seem to be missing any ":0", i.e. to wich display or Xserver my
user (parlun-E6540) uses?
I´ve tried adding the user "slurm", who is designated slurmuser,
via "xhost +SI:localuser:slurm" but I still get the error stating
that the magic cookie could not be retrieved.

You are correct, X11Parameters only holds one option
"home_xauthority".
Does this setting automatically search in the "/home/slurm"-dir
for the Xauthority-file? (Given that the slurmuser is "slurm").
Trying to locate the storage where xauthoriy is present, I do not
find it my home-dir instead it is in some other OS-dir.

Searching for additional help I came across a debian package
(xvfb) for virtual framebuffers. i.e. video/GUI is sent to a
virtual buffer instead of a physical display. Anyone tried that?
I haven´t not yet installed it, but is curious if that´s a way to go.

Best regards,
Pär Lundö

-Original Message-
From: slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> On Behalf Of Ryan
Novosielski
Sent: den 25 februari 2020 21:23
To: slurm-users@lists.schedmd.com

Subject: Re: [slurm-users] Slurm 19.05 X11-forwarding

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I seem to remember there being a config option to specify
rewriting the hostname as well. I thought it was part of
X11Parameters, but I only see one option there:

https://slurm.schedmd.com/archive/slurm-19.05-latest/slurm.conf.html

On 2/25/20 10:55 AM, Tina Friedrich wrote:
> I remember having issues when I set up X forwarding that had to do
> with how the host names were set on the nodes. I had them set
(CentOS
> default) to the fully qualified hostname, and that didn't work -
with
> an error message very similar to what you're getting, if memory
serves
> right. 'Fixed' it by setting the hostnames on all my cluster
nodes to
> the short version.
>
> What does 'xauth list' give you on your nodes (and the machine
you're
> coming from)?
>
> This is/was SLURM 18.08 though, not sure if that makes a difference.
>
> Tina
>
> On 25/02/2020 04:55, Pär Lundö wrote:
>> Hi,
>>
>> Thank you for your reply Patrick. I´ve tried that but I still
get the
>> error stating that the magic cookie could not be retrieved.
>> Reading Tim´s answer, this bug should have been fixed in a release
>> following 18.08, but I´m using 19.05 thus it should have been
taken
>> care of?
>>
>> Best regards, Pär Lundö
>>
>> -Original Message- From: slurm-users
>> mailto:slurm-users-boun

Re: [slurm-users] Slurm 19.05 X11-forwarding

2020-02-28 Thread Pär Lundö

Hi everyone

Thank you for your support.
I’ve done a few changes and done some further testing but it has not solved my 
problem.
Regardless of the settings for the sshd I can’t get it to
I am able to use SSH to the node directly along with ”-X”-argument and get the 
X11 forwarding to work.
Is there a debug output from srun to retrieve during the connection to the node?
(Surely the simple answer to that is yes! I’ll try and locate that 
debug-function, or can someone help with that?)

Best regards,
Pär Lundö



From: "slurm-users" 
Sent: 27 feb. 2020 12:29
To: "Slurm User Community List" 
Subject: Re: [slurm-users] Slurm 19.05 X11-forwarding

Hi Sean,

Thank you for your reply.
I will test it asap.

Best regards,
Pär Lundö
From: slurm-users  On Behalf Of Sean 
Crosby
Sent: den 27 februari 2020 10:26
To: Slurm User Community List 
Subject: Re: [slurm-users] Slurm 19.05 X11-forwarding

I remember that we had to add this to our /etc/ssh/sshd_config to get X11 to 
work with Slurm 19.05

X11UseLocalhost no

We added this to our login nodes (where users ssh to), and then restarted the 
ssh server. You would then need to log out and log back in with X11 forwarding 
again.

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


On Wed, 26 Feb 2020 at 20:52, Pär Lundö 
mailto:par.lu...@foi.se>> wrote:
Hi,

Thank you for your quick replies.
Please bear with me as I am a newbie regarding Slurm and Linux.
My hostname is not a FQDN and I´m running Slurm on a local node (slurmctld and 
slurmuser are the same) just to verify that the X11-forwarding is working 
(amongst other things).
The output of "xauth list" is as follows:
"
parlun-E6540/unix:  MIT-MAGIC-COOKIE-1  bcf263421d143e6ba3297dfe926c68b8
##7061726c756e2d4536353430#:  MIT-MAGIC-COOKIE-1  
bcf263421d143e6ba3297dfe926c68b8
"
I seem to be missing any ":0", i.e. to wich display or Xserver my user 
(parlun-E6540) uses?
I´ve tried adding the user "slurm", who is designated slurmuser, via "xhost 
+SI:localuser:slurm" but I still get the error stating that the magic cookie 
could not be retrieved.

You are correct, X11Parameters only holds one option "home_xauthority".
Does this setting automatically search in the "/home/slurm"-dir for the 
Xauthority-file? (Given that the slurmuser is "slurm").
Trying to locate the storage where xauthoriy is present, I do not find it my 
home-dir instead it is in some other OS-dir.

Searching for additional help I came across a debian package (xvfb) for virtual 
framebuffers. i.e. video/GUI is sent to a virtual buffer instead of a physical 
display. Anyone tried that?
I haven´t not yet installed it, but is curious if that´s a way to go.

Best regards,
Pär Lundö

-Original Message-
From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Ryan Novosielski
Sent: den 25 februari 2020 21:23
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Slurm 19.05 X11-forwarding

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I seem to remember there being a config option to specify rewriting the 
hostname as well. I thought it was part of X11Parameters, but I only see one 
option there:

https://slurm.schedmd.com/archive/slurm-19.05-latest/slurm.conf.html

On 2/25/20 10:55 AM, Tina Friedrich wrote:
> I remember having issues when I set up X forwarding that had to do
> with how the host names were set on the nodes. I had them set (CentOS
> default) to the fully qualified hostname, and that didn't work - with
> an error message very similar to what you're getting, if memory serves
> right. 'Fixed' it by setting the hostnames on all my cluster nodes to
> the short version.
>
> What does 'xauth list' give you on your nodes (and the machine you're
> coming from)?
>
> This is/was SLURM 18.08 though, not sure if that makes a difference.
>
> Tina
>
> On 25/02/2020 04:55, Pär Lundö wrote:
>> Hi,
>>
>> Thank you for your reply Patrick. I´ve tried that but I still get the
>> error stating that the magic cookie could not be retrieved.
>> Reading Tim´s answer, this bug should have been fixed in a release
>> following 18.08, but I´m using 19.05 thus it should have been taken
>> care of?
>>
>> Best regards, Pär Lundö
>>
>> -Original Message- From: slurm-users
>> mailto:slurm-users-boun...@lists.schedmd.com>>
>>  On Behalf Of Patrick Goetz
>> Sent: den 24 februari 2020 21:38 To:
>> slurm-users@lists.schedmd.com Subject: 
>> Re: [slurm-users] Slurm
>> 19.05 X11-forwarding
>>
>> This bug report appears to address the issue you're seeing:
>>
>> https://bugs.schedmd.com/show_bug.cgi?id=5868
>>
>>
>>
>> On 2/24/20 4:46 AM, Pär Lundö wrote:
>>> Dear all,
>>>
>>> I started testing and evaluating Slurm roughly a year ago and used
>>> it succesfully with MPI-programs. I have now identified that I need
>

[slurm-users] slurm 18.08.3 on CentOS 6.18: error: _slurm_cgroup_destroy

2020-02-28 Thread AMU

Hello,
on an old machine CentOS 6.10, i've installed slurm 18.08.3 from 
sources, and tried to configure a simple configuration (attached 
slurm.conf).
Afterstarting slurmctld et slurmd, sinfo shows everything oaky, but at 
the first submission with sbatch, i got errors and the node becomes "drain":
[2020-02-28T14:44:57.883] [2.batch] error: _slurm_cgroup_destroy: Unable 
to move pid 10322 to root cgroup
[2020-02-28T14:44:57.883] [2.batch] error: proctrack_g_create: No such 
file or directory
[2020-02-28T14:44:57.883] [2.batch] error: job_manager exiting 
abnormally, rc = 4014
[2020-02-28T14:44:57.883] [2.batch] sending 
REQUEST_COMPLETE_BATCH_SCRIPT, error:4014 status 0


I'm not very confident with cgroup, and don't understand where is the 
problem.

I read in archives that people have success with CentOS 6 and slurm 18.
Anybody can help?

Thanks in advance,

Gérard

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=tramel
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=99
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=4
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
TaskPluginParam=Sched
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
AccountingStorageLoc=/var/log/slurm/accounting
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/filetxt
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/SlurmdLogFile.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=tramel CPUs=64 Boards=2 SocketsPerBoard=4 CoresPerSocket=8 
ThreadsPerCore=1 RealMemory=1033892
PartitionName=hipe Nodes=tramel Default=YES MaxTime=INFINITE State=UP


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Problem with configuration CPU/GPU partitions

2020-02-28 Thread Renfro, Michael
When I made similar queues, and only wanted my GPU jobs to use up to 8 cores 
per GPU, I set Cores=0-7 and 8-15 for each of the two GPU devices in gres.conf. 
Have you tried reducing those values to Cores=0 and Cores=20?

> On Feb 27, 2020, at 9:51 PM, Pavel Vashchenkov  wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> 
> 
> 
> Hello,
> 
> I have a hybrid cluster with 2 GPUs and 2 20-cores CPUs on each node.
> 
> I created two partitions: - "cpu" for CPU-only jobs which are allowed to
> allocate up to 38 cores per node - "gpu" for GPU-only jobs which are
> allowed to allocate up to 2 GPUs and 2 CPU cores.
> 
> Respective sections in slurm.conf:
> 
> # NODES
> NodeName=node[01-06] Sockets=2 CoresPerSocket=20 ThreadsPerCore=1
> Gres=gpu:2(S:0-1) RealMemory=257433
> 
> # PARTITIONS
> PartitionName=cpu Default=YES Nodes=node[01-06] MaxNodes=6 MinNodes=0
> DefaultTime=04:00:00 MaxTime=14-00:00:00 MaxCPUsPerNode=38
> PartitionName=gpu Nodes=node[01-06] MaxNodes=6 MinNodes=0
> DefaultTime=04:00:00 MaxTime=14-00:00:00 MaxCPUsPerNode=2
> 
> and in gres.conf:
> Name=gpu Type=v100 File=/dev/nvidia0 Cores=0-19
> Name=gpu Type=v100 File=/dev/nvidia1 Cores=20-39
> 
> However, it seems to be not working properly. If I first submit GPU job
> using all available in "gpu" partition resources and then CPU job
> allocating the rest of the CPU cores (i.e. 38 cores per node) in "cpu"
> partition, it works perfectly fine. Both jobs start running. But if I
> change the submission order and start CPU-job before GPU-job,  the "cpu"
> job starts running while the "gpu" job stays in queue with PENDING
> status and RESOURCES reason.
> 
> My first guess was that "cpu" job allocates cores assigned to respective
> GPUs in gres.conf and prevents the GPU devices from running. However, it
> seems not to be the case, because 37 cores job per node instead of 38
> solves the problem.
> 
> Another thought was it has something to do with the specialized cores
> reservation, but I tried to change CoreSpecCount option without success.
> 
> So, any ideas how to fix this behavior and where should look?
> 
> Thanks!
> 




Re: [slurm-users] How to show state of CLOUD nodes

2020-02-28 Thread Kirill 'kkm' Katsnelson
I'm running clusters entirely in Google Cloud. I'm not sure I'm
understanding the issue--do the nodes disappear from view entirely only
when they fail to power up by ResumeTimeout? Failures of this kind are
happening in GCE when resources are momentarily unavailable, but the nodes
are still there, only shown as DOWN. FWIW, I'm currently using 19.05.4-1.

I have a trigger on the controller to catch and return these nodes back to
POWER_SAVE. The offset of 20s lets all moving parts to settle; in any case,
Slurm batches trigger runs internally, on a 15s schedule IIRC, so it's not
precise. --flags=PERM makes the trigger permanent, so you need to install
it once and for all:

strigger --set --down --flags=PERM --offset=20 --program=$script

and the $script points to the full path (on the controller) of the
following script. I'm copying the Log function and the _logname gymnastics
from a file which is dot-sourced by the main program in my setup, as it's
part of a larger set of scripts; it's more complex than it has to be for
your case, but I did not want to introduce a bug by hastily paring it down.
You'll do that if you want.

  8<8<8<
#!/bin/bash

set -u

# Tidy up name for logging: '.../slurm_resume.sh' => 'slurm-resume'
_logname=$(basename "$0")
_logname=${_logname%%.*}
_logname=${_logname//_/-}

Log() {
  local level=$1; shift;
  [[ $level == *.* ]] || level=daemon.$level  # So we can use e.g.
auth.notice.
  logger -p $level -t $_logname -- "$@"
}

reason=recovery

for n; do
  Log notice "Recovering failed node(s) '$n'"
  scontrol update nodename="$n" reason="$reason" state=DRAIN &&
  scontrol update nodename="$n" reason="$reason" state=POWER_DOWN ||
Log alert "The command 'scontrol update nodename=$n' failed." \
  "Is scontrol on PATH?"
done

exit 0
  8<8<8<

The sequence of DRAIN first then POWER_DOWN is a magic left over from v18;
see if POWER_DOWN alone does the trick. Or don't, as long as it works :)

Also make sure you have (some of) the following in slurm.conf, assuming EC2
provides DNS name resolution--GCE does.

# Important for cloud: do not assume the nodes will retain their IP
# addresses, and do not cache name-to-IP mapping.
CommunicationParameters=NoAddrCache
SlurmctldParameters=cloud_dns,idle_on_node_suspend
PrivateData=cloud   # Always show cloud nodes.
ReturnToService=2   # When a DOWN node boots, it becomes available.

Hope this might help.

 -kkm

On Thu, Feb 27, 2020 at 4:11 PM Carter, Allan  wrote:

> I’m setting up an EC2 SLURM cluster and when an instance doesn’t resume
> fast enough I get an error like:
>
>
>
> node c7-c5-24xl-464 not resumed by ResumeTimeout(600) - marking down and
> power_save
>
>
>
> I keep running into issues where my cloud nodes do not show up in sinfo
> and I can’t display their information with scontrol. This makes it
> difficult to know which of my CLOUD nodes are available for scheduling and
> which are down for some reason and can’t be used. I haven’t figured out
> when slurm will show a cloud node and when it won’t and this make it pretty
> hard to manage the cluster.
>
>
>
> Would I be better off just removing the CLOUD attribute on my EC2 nodes?
> What is the advantage of making them CLOUD nodes if it just make it more
> difficult to manage the cluster?
>
>
>


Re: [slurm-users] Question about SacctMgr....

2020-02-28 Thread Bjørn-Helge Mevik
Ole Holm Nielsen  writes:

> You may use the (undocumented) format=... option to select only the

A while ago, after meticulous study of the man page, I discovered that
the format option is not actually undocumented, it is just very well
hidden. :) All that "man sacctmgr" says about it is

GLOBAL FORMAT OPTION
   When using the format option for listing various fields you can
   put a %NUMBER afterwards to specify how many characters should be 
printed.

   e.g. format=name%30 will print 30 characters of field name right
   justified.  A -30 will print 30  characters left justified.

(in addition to using it in a couple of examples). :)

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] Question about SacctMgr....

2020-02-28 Thread Ole Holm Nielsen

sacctmgr show association

You may use the (undocumented) format=... option to select only the 
columns you want, for example:


sacctmgr show assoc format=user,account,qos

Usage of the format option is only given in the Examples section of the 
sacctmgr page https://slurm.schedmd.com/sacctmgr.html.  See also these 
bugzilla cases:

https://bugs.schedmd.com/show_bug.cgi?id=7668
https://bugs.schedmd.com/show_bug.cgi?id=6790#c3

/Ole

On 2/28/20 9:38 AM, Matthias Krawutschke wrote:

Dear Slurm-User,

I have a simple question about User and Account – Management on SLURM.

How can I find /print out, which User is associated with which account?

I can list accounts and User, but not in combination. I had no found this 
on the documentation.




Re: [slurm-users] Question about SacctMgr....

2020-02-28 Thread Marcus Boden
Hi,

your looking for 'associations' between users, accounts and their limits.
Try `sacctmgr show assoc [tree]`

Best,
Marcus

On 20-02-28 09:38, Matthias Krawutschke wrote:
> Dear Slurm-User,
> 
>  
> 
> I have a simple question about User and Account – Management on SLURM.
> 
>  
> 
> How can I find /print out, which User is associated with which account?
> 
>  
> 
> I can list accounts and User, but not in combination. I had no found this on
> the documentation.
> 
>  
> 
> Best regards….
> 
>  
> 
>  
> 
>  
> 
> Matthias Krawutschke, Dipl. Inf.
> 
>  
> 
> Universität Potsdam
> ZIM - Zentrum für Informationstechnologie und Medienmanagement
> 
> Team Infrastruktur Server und Storage 
> Arbeitsbereich: High-Performance-Computing on Cluster - Environment
> 
>  
> 
> Campus Am Neuen Palais: Am Neuen Palais 10 | 14469 Potsdam
> Tel: +49 331 977-, Fax: +49 331 977-1750
> 
>  
> 
> Internet:  
> https://www.uni-potsdam.de/de/zim/angebote-loesungen/hpc.html
> 
>  
> 
>  
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience
Tel.:   +49 (0)551 201-2191
E-Mail: mbo...@gwdg.de
---
Gesellschaft fuer wissenschaftliche
Datenverarbeitung mbH Goettingen (GWDG)
Am Fassberg 11, 37077 Goettingen
URL:http://www.gwdg.de
E-Mail: g...@gwdg.de
Tel.:   +49 (0)551 201-1510
Fax:+49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender:
Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Goettingen
Registergericht: Goettingen
Handelsregister-Nr. B 598
---


smime.p7s
Description: S/MIME cryptographic signature


[slurm-users] Question about SacctMgr....

2020-02-28 Thread Matthias Krawutschke
Dear Slurm-User,

 

I have a simple question about User and Account – Management on SLURM.

 

How can I find /print out, which User is associated with which account?

 

I can list accounts and User, but not in combination. I had no found this on
the documentation.

 

Best regards….

 

 

 

Matthias Krawutschke, Dipl. Inf.

 

Universität Potsdam
ZIM - Zentrum für Informationstechnologie und Medienmanagement

Team Infrastruktur Server und Storage 
Arbeitsbereich: High-Performance-Computing on Cluster - Environment

 

Campus Am Neuen Palais: Am Neuen Palais 10 | 14469 Potsdam
Tel: +49 331 977-, Fax: +49 331 977-1750

 

Internet:  
https://www.uni-potsdam.de/de/zim/angebote-loesungen/hpc.html