Re: [slurm-users] OpenMPI interactive change in behavior?

2021-05-27 Thread Chris Samuel
On Monday, 26 April 2021 2:12:41 PM PDT John DeSantis wrote:

> Furthermore,
> searching the mailing list suggests that the appropriate method is to use
> `salloc` first, despite version 17.11.9 not needing `salloc` for an
> "interactive" sessions.

Before 20.11 with salloc you needed to set a SallocDefaultCommand to use srun 
to push the session over on to a compute node, and then you needed to set a 
bunch of things to prevent that srun from consuming resources that the 
subsequent srun's would need.  That was especially annoying when you were 
dealing with GPUs as you would need to "srun" anything that needed to access 
them (when you used cgroups to control access).

With 20.11 there's a new "use_interactive_step" option that uses similar 
trickery, except Slurm handles not consuming those resources for you and 
handles GPUs correctly.

So for your 20.11 system I would recommend giving salloc and the 
"use_interactive_step" option a go and see if it helps.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






[slurm-users] DMTCP or MANA with Slurm?

2021-05-27 Thread Prentice Bisbal
Is anyone currently using DMTCP or MANA with Slurm? I'm trying to DMTCP 
up right now, and I'm having issues with it. Googling for answers to my 
problems, all I find are other people asking the same questions on the 
dmtcp-forum mailing list without getting any answers. There was commit 
to the DMTCP github on Mar 1, but there hasn't been an official release 
since August 14, 2019.


And MANA, which is based on DMTCP, doesn't seem to have had any activity 
in over 2 years.


Given the lack of traffic on the mailing list and lack of releases, I'm 
beginning to think that both of these project are all but abandoned.


I know NERSC gave a talk about using MANA on their systems just a couple 
of weeks ago. I just watched a recording of it on YouTube yesterday.


--
Prentice




Re: [slurm-users] [External] Re: pam_slurm_adopt not working for all users

2021-05-27 Thread Prentice Bisbal

On 5/27/21 2:49 AM, Ole Holm Nielsen wrote:

Hi Loris,

On 5/27/21 8:19 AM, Loris Bennett wrote:

Regarding keys vs. host-based SSH, I see that host-based would be more
elegant, but would involve more configuration.  What exactly are the
simplification gains you see? I just have a single cluster and naively I
would think dropping a script into /etc/profile.d on the login node
would be less work than re-configuring SSH for the login node and
multiple compute node images.


IMHO, it's really simply to setup hostbased SSH authentification:
https://wiki.fysik.dtu.dk/niflheim/SLURM#host-based-authentication

This is more secure on Linux clusters, and you don't need to configure 
users' SSH keys, so it requires less configuration for the sysadmin in 
the long run.


What makes this more secure?

--
Prentice




Re: [slurm-users] [External] Re: pam_slurm_adopt not working for all users

2021-05-27 Thread Prentice Bisbal

Loris,

Your analogy is incorrect, because Slurm doesn't use SSH to launch jobs, 
it uses it's own communication protocol, which uses munge for 
authentication. Some schedulers used to use ssh to launch jobs, but most 
have moved to using their own communications protocol outside of SSH. 
It's possible that Slurm used SSH in the early days, too. I wouldn't 
know. I've only been using Slurm for the past 5 years.


In those cases, you usually needed host-based SSH so that the scheduler 
daemon could launch jobs on the compute nodes. In that situation, you 
would be able to ssh from one node to another without per-user ssh keys, 
since they'd already be setup on a per-host basis. Perhaps that's what 
you are remembering.


Prentice

On 5/25/21 8:09 AM, Loris Bennett wrote:

Hi everyone,

Thanks for all the replies.

I think my main problem is that I expect logging in to a node with a job
to work with pam_slurm_adopt but without any SSH keys.  My assumption
was that MUNGE takes care of the authentication, since users' jobs start
on nodes with the need for keys.

Can someone confirm that this expectation is wrong and, if possible, why
the analogy with jobs is incorrect?

I have a vague memory that this used work on our old cluster with an
older version of Slurm, but I could be thinking of a time before we set
up pam_slurm_adopt.

Cheers,

Loris
   


Brian Andrus  writes:


Oh, you could also use the ssh-agent to mange the keys, then use 'ssh-add
~/.ssh/id_rsa' to type the passphrase once for your whole session (from that
system).

Brian Andrus


On 5/21/2021 5:53 AM, Loris Bennett wrote:

Hi,

We have set up pam_slurm_adopt using the official Slurm documentation
and Ole's information on the subject.  It works for a user who has SSH
keys set up, albeit the passphrase is needed:

$ salloc --partition=gpu --gres=gpu:1 --qos=hiprio --ntasks=1 
--time=00:30:00 --mem=100
salloc: Granted job allocation 7202461
salloc: Waiting for resource configuration
salloc: Nodes g003 are ready for job

$ ssh g003
Warning: Permanently added 'g003' (ECDSA) to the list of known hosts.
Enter passphrase for key '/home/loris/.ssh/id_rsa':
Last login: Wed May  5 08:50:00 2021 from login.curta.zedat.fu-berlin.de

$ ssh g004
Warning: Permanently added 'g004' (ECDSA) to the list of known hosts.
Enter passphrase for key '/home/loris/.ssh/id_rsa':
Access denied: user loris (uid=182317) has no active jobs on this node.
Access denied by pam_slurm_adopt: you have no active jobs on this node
Authentication failed.

If SSH keys are not set up, then the user is asked for a password:

$ squeue --me
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
   7201647  main test_job nokeylee  R3:45:24  1 c005
   7201646  main test_job nokeylee  R3:46:09  1 c005
$ ssh c005
Warning: Permanently added 'c005' (ECDSA) to the list of known hosts.
nokeylee@c005's password:

My assumption was that a user should be able to log into a node on which
that person has a running job without any further ado, i.e. without the
necessity to set up anything else or to enter any credentials.

Is this assumption correct?

If so, how can I best debug what I have done wrong?

Cheers,

Loris





Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-27 Thread Lloyd Brown

I mistyped that.  "they CAN'T get into the login nodes using SSH keys"

On 5/27/21 10:08 AM, Lloyd Brown wrote:

they get into the login nodes using SSH keys


--
Lloyd Brown
HPC Systems Administrator
Office of Research Computing
Brigham Young University
http://marylou.byu.edu




Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-27 Thread Lloyd Brown
While that's absolutely a significant issue, here's how we solved it, 
despite still using user keys. This basically assures that while people 
can SSH around with keys within our cluster, they get into the login 
nodes using SSH keys.  Combine that with the required enrollment in 2FA, 
and I think we're doing decently well.


Network routing rules and switch ACLs prevent users from getting into 
the non-login nodes from outside the cluster.



(excerpt from sshd_config on login nodes only - It's much simpler on 
non-login nodes):




# default behavior - disallow PubKeyAuthentication
PubKeyAuthentication no

# default behavior - force people to the "you must enroll in 2FA" 
message, and then exit

ForceCommand /usr/local/bin/2fa_notice.sh

#All users enrolled in 2FA, are part of the twofactusers group
Match group twofactusers
    ForceCommand none

#Allow PubKeyAuthentication for subnets that are internal to the cluster
Match Address ListOfClusterInternalSubnets
    PubKeyAuthentication yes


Lloyd


On 5/27/21 9:27 AM, Michael Jennings wrote:


As far as abuse of keys goes:  What's stopping your user from taking
that private key you created for them (which is, as you recall,
*unencrypted*) outside of your cluster to another host somewhere else
on campus.  Maybe something that has tons of untrusted folks with
root.  Then any of those folks can SSH to your cluster as that user.


--
Lloyd Brown
HPC Systems Administrator
Office of Research Computing
Brigham Young University
http://marylou.byu.edu




Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-27 Thread Michael Jennings

On Thursday, 27 May 2021, at 08:19:14 (+0200),
Loris Bennett wrote:


Thanks for the detailed explanations.  I was obviously completely
confused about what MUNGE does.  Would it be possible to say, in very
hand-waving terms, that MUNGE performs a similar role for the access of
processes to nodes as SSH does for the access of users to nodes?


If you replace the word "processes" with the word "jobs," you've got
it. :-)

MUNGE is really just intended to be a simple, lightweight solution to
allow for creating a single, global "credential domain" among all the
hosts in an HPC cluster using a single shared secret.  Without going
into too much detail with the crypto stuff, it basically allows a
trusted local entity to cryptographically prove to another that
they're both part of the same trust/cred domain; having established
this, they know they can trust each other to provide and/or validate
credentials between hosts.

But I want to emphasize the "single shared secret" part.  That means
there's a single trust domain.  Think "root of trust" with nothing but
the root of trust.  So you can authenticate a single group of hosts to
all the rest of the group such that all are equals, but that's it.
There's no additional facility for authenticating different roles or
anything like that.  Either you have the same shared secret or you
don't; nothing else is possible.


Regarding keys vs. host-based SSH, I see that host-based would be more
elegant, but would involve more configuration.  What exactly are the
simplification gains you see? I just have a single cluster and naively I
would think dropping a script into /etc/profile.d on the login node
would be less work than re-configuring SSH for the login node and
multiple compute node images.


I like to think of it as "one and done."  At least in our case at
LANL, and at LBNL previously, all nodes of the same type/group boot
the same VNFS image.  As long as I don't need to cryptographically
differentiate among, say, compute nodes, I only have to set up a
single set of credentials for all the hosts, and I'm done.

It also saves overall support time in my experience.  By taking the
responsibility for inter-machine trust myself at the system level, I
don't have to worry about (1) modifying a user's SSH config without
their knowledge, (2) running the risk of them messing with their
config and breaking it, or (3) any user support/services calls about
"why can't I do any of the things on the stuff?!"  :-)

It is totally a personal/team choice, but I'll be honest:  Once I
"discovered" host-based authentication and all the headaches it saved
our sysadmin and consulting teams, I was kicking myself for having
done it the other way for so long! :-D


Regarding AuthorizedKeysCommand, I don't think we can use that, because
users don't necessarily have existing SSH keys.  What abuse scenarios
where you thinking of in connection with in-homedir key pairs?


Users don't have to have existing keys for it to work; the command you
specify can easily create a key pair, drop the private key, and output
the public key.  Or even simpler, you can specify a value for
"AuthorizedKeysFile" that points to a directory users can't write to,
and store a key pair for each user in that location.  Lots of ways to
do it.

But if I'm being frank about it, if I had my druthers, we'd be using
certificates for authentication, not files.  The advantages are, in my
very humble opinion, well worth a little extra setup time!

As far as abuse of keys goes:  What's stopping your user from taking
that private key you created for them (which is, as you recall,
*unencrypted*) outside of your cluster to another host somewhere else
on campus.  Maybe something that has tons of untrusted folks with
root.  Then any of those folks can SSH to your cluster as that user.

Credential theft is a *huge* problem in HPC across the world, so I
always recommend that sysadmins think of it as Public Enemy #1!  The
more direct and permanent control you have over user credentials, the
better. :-)


Would it be correct to say that, if one were daft enough, one could
build some sort of terminal server on top of MUNGE without using SSH,
but which could then replicate basic SSH behaviour?


No; that would only provide a method to authenticate servers at best.
You can't authenticate users for the reasons I noted above.  Single
shared key, single trust domain.


Your explanation is very clear, but it still seems like quite a few
steps with various gotchas, like the fact that, as I understand it,
shosts.equiv has to contain all the possible ways a host might be
addressed (short name, long name, IP).


You are correct, though that's easy to automate with a teensy weensy
shell script.  But yes, there's more up-front configuration.  Again,
though, I truly believe it saves admin time in the long run (not to
mention user support staff time and user pain).  But again, that's a
personal or team choice.

I'm not sure if I'm clearing things up or just 

Re: [slurm-users] Building SLURM with X11 support

2021-05-27 Thread Marcus Boden

Hi Thekla,

it is build in by default since... some time. You need to activate it by 
adding

PrologFlags=X11
to your slurm.conf (see here: 
https://slurm.schedmd.com/slurm.conf.html#OPT_PrologFlags)


Best,
Marcus

On 27.05.21 14:07, Thekla Loizou wrote:

Dear all,

I am trying to use X11 forwarding in SLURM with no success.

We are installing SLURM using RPMs that we generate with the command 
"rpmbuild -ta slurm*.tar.bz2" as per the documentation.


I am currently working with SLURM version 20.11.7-1.

What I am missing when it comes to build SLURM with X11 enabled? Which 
flags and packages are required?


Regards,

Thekla




--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Building SLURM with X11 support

2021-05-27 Thread Ole Holm Nielsen

On 5/27/21 2:07 PM, Thekla Loizou wrote:

I am trying to use X11 forwarding in SLURM with no success.

We are installing SLURM using RPMs that we generate with the command 
"rpmbuild -ta slurm*.tar.bz2" as per the documentation.


I am currently working with SLURM version 20.11.7-1.

What I am missing when it comes to build SLURM with X11 enabled? Which 
flags and packages are required?


What is your OS?  Do you have X11 installed?

Did you install all Slurm prerequisites?  For CentOS 7 it is:

yum install rpm-build gcc openssl openssl-devel libssh2-devel pam-devel 
numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel 
rrdtool-devel ncurses-devel gtk2-devel libssh2-devel libibmad libibumad 
perl-Switch perl-ExtUtils-MakeMaker


see 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites


I hope this helps.

/Ole



[slurm-users] Building SLURM with X11 support

2021-05-27 Thread Thekla Loizou

Dear all,

I am trying to use X11 forwarding in SLURM with no success.

We are installing SLURM using RPMs that we generate with the command 
"rpmbuild -ta slurm*.tar.bz2" as per the documentation.


I am currently working with SLURM version 20.11.7-1.

What I am missing when it comes to build SLURM with X11 enabled? Which 
flags and packages are required?


Regards,

Thekla




Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-27 Thread Loris Bennett
Hi Ole,

Ole Holm Nielsen  writes:

> Hi Loris,
>
> On 5/27/21 8:19 AM, Loris Bennett wrote:
>> Regarding keys vs. host-based SSH, I see that host-based would be more
>> elegant, but would involve more configuration.  What exactly are the
>> simplification gains you see? I just have a single cluster and naively I
>> would think dropping a script into /etc/profile.d on the login node
>> would be less work than re-configuring SSH for the login node and
>> multiple compute node images.
>
> IMHO, it's really simply to setup hostbased SSH authentification:
> https://wiki.fysik.dtu.dk/niflheim/SLURM#host-based-authentication

Your explanation is very clear, but it still seems like quite a few
steps with various gotchas, like the fact that, as I understand it,
shosts.equiv has to contain all the possible ways a host might be
addressed (short name, long name, IP).

> This is more secure on Linux clusters, and you don't need to configure users'
> SSH keys, so it requires less configuration for the sysadmin in the long run.

It is not clear to me what the security advantage is and setting up the
keys it just one script in /etc/profile.d.  Regarding the long term, the
keys which were set up on our old cluster were just migrated to the new
cluster and still work, so it is also a one-time thing.

I assume I must be missing something.

Cheers,

Loris

-- 
Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-27 Thread Loris Bennett
Ward Poelmans  writes:

> On 27/05/2021 08:19, Loris Bennett wrote:
>> Thanks for the detailed explanations.  I was obviously completely
>> confused about what MUNGE does.  Would it be possible to say, in very
>> hand-waving terms, that MUNGE performs a similar role for the access of
>> processes to nodes as SSH does for the access of users to nodes?
>
> A tiny bit yes. Munge allows you to authenticate users between servers
> (like a unix socket does within a single machine):
> https://github.com/dun/munge/wiki/Man-7-munge

OK, thanks for the information.  I had already read the man page for
MUNGE, but to me it doesn't make it explicitly clear that MUNGE doesn't,
out of the box, include the possibility to do something like SSH.

Would it be correct to say that, if one were daft enough, one could
build some sort of terminal server on top of MUNGE without using SSH,
but which could then replicate basic SSH behaviour?

Cheers,

Loris

-- 
Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-27 Thread Ward Poelmans
On 27/05/2021 08:19, Loris Bennett wrote:
> Thanks for the detailed explanations.  I was obviously completely
> confused about what MUNGE does.  Would it be possible to say, in very
> hand-waving terms, that MUNGE performs a similar role for the access of
> processes to nodes as SSH does for the access of users to nodes?

A tiny bit yes. Munge allows you to authenticate users between servers
(like a unix socket does within a single machine):
https://github.com/dun/munge/wiki/Man-7-munge


Ward



Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-27 Thread Ole Holm Nielsen

Hi Loris,

On 5/27/21 8:19 AM, Loris Bennett wrote:

Regarding keys vs. host-based SSH, I see that host-based would be more
elegant, but would involve more configuration.  What exactly are the
simplification gains you see? I just have a single cluster and naively I
would think dropping a script into /etc/profile.d on the login node
would be less work than re-configuring SSH for the login node and
multiple compute node images.


IMHO, it's really simply to setup hostbased SSH authentification:
https://wiki.fysik.dtu.dk/niflheim/SLURM#host-based-authentication

This is more secure on Linux clusters, and you don't need to configure 
users' SSH keys, so it requires less configuration for the sysadmin in the 
long run.


/Ole



Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-27 Thread Loris Bennett
Hi Michael,

Michael Jennings  writes:

> On Tuesday, 25 May 2021, at 14:09:54 (+0200),
> Loris Bennett wrote:
>
>> I think my main problem is that I expect logging in to a node with a job
>> to work with pam_slurm_adopt but without any SSH keys.  My assumption
>> was that MUNGE takes care of the authentication, since users' jobs start
>> on nodes with the need for keys.
>> 
>> Can someone confirm that this expectation is wrong and, if possible, why
>> the analogy with jobs is incorrect?
>
> Yes, that expectation is incorrect.  When Slurm launches jobs, even
> interactive ones, it is Slurm itself that handles connecting all the
> right sockets to all the right places, and MUNGE handles the
> authentication for that action.
>
> SSHing into cluster node isn't done through Slurm; thus, sshd handles
> the authentication piece by calling out to your PAM stack (by
> default).  And you should think of pam_slurm_adopt as adding a
> "required but not sufficient" step in your auth process for SSH; that
> is, if it fails, the user can't get in, but if it succeeds, PAM just
> moves on to the next module in the stack.
>
> (Technically speaking, it's PAM, so the above is only the default
> configuration.  It's theoretically possible to set up PAM in a
> different way...but that's very much a not-good idea.)
>
>> I have a vague memory that this used work on our old cluster with an
>> older version of Slurm, but I could be thinking of a time before we set
>> up pam_slurm_adopt.
>
> Some cluster tools, such as Warewulf and PERCEUS, come with built-in
> scripts to create SSH key pairs (with unencrypted private keys) that
> had special names for any (non-system) user who didn't already have a
> pair.  Maybe the prior cluster was doing something like that?  Or
> could it have been using Host-based Auth?
>
>> I have discovered that the users whose /home directories were migrated
>> from our previous cluster all seem to have a pair of keys which were
>> created along with files like '~/.bash_profile'.  Users who have been
>> set up on the new cluster don't have these files.
>> 
>> Is there some /etc/skel-like mechanism which will create passwordless
>> SSH keys when a user logs into the system for the first time?  It looks
>> increasingly to me that such a mechanism must have existed on our old
>> cluster.
>
> That tends to point toward the "something was doing it for you before
> that is no longer present" theory.
>
> You do NOT want to use /etc/skel for this, though.  That would cause
> all your users to have the same unencrypted private key providing
> access to their user account, which means they'd be able to SSH around
> as each other.  That's...problematic. ;-)
>
>> I was just getting round to the idea that /etc/profile.d might be
>> the way to go, so your script looks like exactly the sort of thing I
>> need.
>
> You can definitely do it that way, and a lot of sites do.  But
> honestly, you're better served by setting up Host-based Auth for SSH.
> It uses the same public/private keypair KEX to authenticate each other
> that is normally used for users, so as long as your hosts are secure,
> you can rely on the security of HostbasedAuthentication.
>
> With unencrypted private keys (that's what "passphraseless" really
> means), you definitely can be opening the door to abuse.  If you want
> to go that route, you'd likely want to set up something that users
> couldn't abuse, e.g. via AuthorizedKeysCommand, rather than the
> traditional in-homedir key pairs.
>
> We use host-based for all of our clusters here at LANL, and it
> simplifies a *lot* for us.  If you want to give it a try, there's a
> good cookbook here:
> https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Host-based_Authentication
>
> HTH,
> Michael

Thanks for the detailed explanations.  I was obviously completely
confused about what MUNGE does.  Would it be possible to say, in very
hand-waving terms, that MUNGE performs a similar role for the access of
processes to nodes as SSH does for the access of users to nodes?

Regarding keys vs. host-based SSH, I see that host-based would be more
elegant, but would involve more configuration.  What exactly are the
simplification gains you see? I just have a single cluster and naively I
would think dropping a script into /etc/profile.d on the login node
would be less work than re-configuring SSH for the login node and
multiple compute node images.

Regarding AuthorizedKeysCommand, I don't think we can use that, because
users don't necessarily have existing SSH keys.  What abuse scenarios
where you thinking of in connection with in-homedir key pairs?

Cheers,

Loris
-- 
Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de