Re: [slurm-users] OpenMPI interactive change in behavior?
On Monday, 26 April 2021 2:12:41 PM PDT John DeSantis wrote: > Furthermore, > searching the mailing list suggests that the appropriate method is to use > `salloc` first, despite version 17.11.9 not needing `salloc` for an > "interactive" sessions. Before 20.11 with salloc you needed to set a SallocDefaultCommand to use srun to push the session over on to a compute node, and then you needed to set a bunch of things to prevent that srun from consuming resources that the subsequent srun's would need. That was especially annoying when you were dealing with GPUs as you would need to "srun" anything that needed to access them (when you used cgroups to control access). With 20.11 there's a new "use_interactive_step" option that uses similar trickery, except Slurm handles not consuming those resources for you and handles GPUs correctly. So for your 20.11 system I would recommend giving salloc and the "use_interactive_step" option a go and see if it helps. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
[slurm-users] DMTCP or MANA with Slurm?
Is anyone currently using DMTCP or MANA with Slurm? I'm trying to DMTCP up right now, and I'm having issues with it. Googling for answers to my problems, all I find are other people asking the same questions on the dmtcp-forum mailing list without getting any answers. There was commit to the DMTCP github on Mar 1, but there hasn't been an official release since August 14, 2019. And MANA, which is based on DMTCP, doesn't seem to have had any activity in over 2 years. Given the lack of traffic on the mailing list and lack of releases, I'm beginning to think that both of these project are all but abandoned. I know NERSC gave a talk about using MANA on their systems just a couple of weeks ago. I just watched a recording of it on YouTube yesterday. -- Prentice
Re: [slurm-users] [External] Re: pam_slurm_adopt not working for all users
On 5/27/21 2:49 AM, Ole Holm Nielsen wrote: Hi Loris, On 5/27/21 8:19 AM, Loris Bennett wrote: Regarding keys vs. host-based SSH, I see that host-based would be more elegant, but would involve more configuration. What exactly are the simplification gains you see? I just have a single cluster and naively I would think dropping a script into /etc/profile.d on the login node would be less work than re-configuring SSH for the login node and multiple compute node images. IMHO, it's really simply to setup hostbased SSH authentification: https://wiki.fysik.dtu.dk/niflheim/SLURM#host-based-authentication This is more secure on Linux clusters, and you don't need to configure users' SSH keys, so it requires less configuration for the sysadmin in the long run. What makes this more secure? -- Prentice
Re: [slurm-users] [External] Re: pam_slurm_adopt not working for all users
Loris, Your analogy is incorrect, because Slurm doesn't use SSH to launch jobs, it uses it's own communication protocol, which uses munge for authentication. Some schedulers used to use ssh to launch jobs, but most have moved to using their own communications protocol outside of SSH. It's possible that Slurm used SSH in the early days, too. I wouldn't know. I've only been using Slurm for the past 5 years. In those cases, you usually needed host-based SSH so that the scheduler daemon could launch jobs on the compute nodes. In that situation, you would be able to ssh from one node to another without per-user ssh keys, since they'd already be setup on a per-host basis. Perhaps that's what you are remembering. Prentice On 5/25/21 8:09 AM, Loris Bennett wrote: Hi everyone, Thanks for all the replies. I think my main problem is that I expect logging in to a node with a job to work with pam_slurm_adopt but without any SSH keys. My assumption was that MUNGE takes care of the authentication, since users' jobs start on nodes with the need for keys. Can someone confirm that this expectation is wrong and, if possible, why the analogy with jobs is incorrect? I have a vague memory that this used work on our old cluster with an older version of Slurm, but I could be thinking of a time before we set up pam_slurm_adopt. Cheers, Loris Brian Andrus writes: Oh, you could also use the ssh-agent to mange the keys, then use 'ssh-add ~/.ssh/id_rsa' to type the passphrase once for your whole session (from that system). Brian Andrus On 5/21/2021 5:53 AM, Loris Bennett wrote: Hi, We have set up pam_slurm_adopt using the official Slurm documentation and Ole's information on the subject. It works for a user who has SSH keys set up, albeit the passphrase is needed: $ salloc --partition=gpu --gres=gpu:1 --qos=hiprio --ntasks=1 --time=00:30:00 --mem=100 salloc: Granted job allocation 7202461 salloc: Waiting for resource configuration salloc: Nodes g003 are ready for job $ ssh g003 Warning: Permanently added 'g003' (ECDSA) to the list of known hosts. Enter passphrase for key '/home/loris/.ssh/id_rsa': Last login: Wed May 5 08:50:00 2021 from login.curta.zedat.fu-berlin.de $ ssh g004 Warning: Permanently added 'g004' (ECDSA) to the list of known hosts. Enter passphrase for key '/home/loris/.ssh/id_rsa': Access denied: user loris (uid=182317) has no active jobs on this node. Access denied by pam_slurm_adopt: you have no active jobs on this node Authentication failed. If SSH keys are not set up, then the user is asked for a password: $ squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7201647 main test_job nokeylee R3:45:24 1 c005 7201646 main test_job nokeylee R3:46:09 1 c005 $ ssh c005 Warning: Permanently added 'c005' (ECDSA) to the list of known hosts. nokeylee@c005's password: My assumption was that a user should be able to log into a node on which that person has a running job without any further ado, i.e. without the necessity to set up anything else or to enter any credentials. Is this assumption correct? If so, how can I best debug what I have done wrong? Cheers, Loris
Re: [slurm-users] pam_slurm_adopt not working for all users
I mistyped that. "they CAN'T get into the login nodes using SSH keys" On 5/27/21 10:08 AM, Lloyd Brown wrote: they get into the login nodes using SSH keys -- Lloyd Brown HPC Systems Administrator Office of Research Computing Brigham Young University http://marylou.byu.edu
Re: [slurm-users] pam_slurm_adopt not working for all users
While that's absolutely a significant issue, here's how we solved it, despite still using user keys. This basically assures that while people can SSH around with keys within our cluster, they get into the login nodes using SSH keys. Combine that with the required enrollment in 2FA, and I think we're doing decently well. Network routing rules and switch ACLs prevent users from getting into the non-login nodes from outside the cluster. (excerpt from sshd_config on login nodes only - It's much simpler on non-login nodes): # default behavior - disallow PubKeyAuthentication PubKeyAuthentication no # default behavior - force people to the "you must enroll in 2FA" message, and then exit ForceCommand /usr/local/bin/2fa_notice.sh #All users enrolled in 2FA, are part of the twofactusers group Match group twofactusers ForceCommand none #Allow PubKeyAuthentication for subnets that are internal to the cluster Match Address ListOfClusterInternalSubnets PubKeyAuthentication yes Lloyd On 5/27/21 9:27 AM, Michael Jennings wrote: As far as abuse of keys goes: What's stopping your user from taking that private key you created for them (which is, as you recall, *unencrypted*) outside of your cluster to another host somewhere else on campus. Maybe something that has tons of untrusted folks with root. Then any of those folks can SSH to your cluster as that user. -- Lloyd Brown HPC Systems Administrator Office of Research Computing Brigham Young University http://marylou.byu.edu
Re: [slurm-users] pam_slurm_adopt not working for all users
On Thursday, 27 May 2021, at 08:19:14 (+0200), Loris Bennett wrote: Thanks for the detailed explanations. I was obviously completely confused about what MUNGE does. Would it be possible to say, in very hand-waving terms, that MUNGE performs a similar role for the access of processes to nodes as SSH does for the access of users to nodes? If you replace the word "processes" with the word "jobs," you've got it. :-) MUNGE is really just intended to be a simple, lightweight solution to allow for creating a single, global "credential domain" among all the hosts in an HPC cluster using a single shared secret. Without going into too much detail with the crypto stuff, it basically allows a trusted local entity to cryptographically prove to another that they're both part of the same trust/cred domain; having established this, they know they can trust each other to provide and/or validate credentials between hosts. But I want to emphasize the "single shared secret" part. That means there's a single trust domain. Think "root of trust" with nothing but the root of trust. So you can authenticate a single group of hosts to all the rest of the group such that all are equals, but that's it. There's no additional facility for authenticating different roles or anything like that. Either you have the same shared secret or you don't; nothing else is possible. Regarding keys vs. host-based SSH, I see that host-based would be more elegant, but would involve more configuration. What exactly are the simplification gains you see? I just have a single cluster and naively I would think dropping a script into /etc/profile.d on the login node would be less work than re-configuring SSH for the login node and multiple compute node images. I like to think of it as "one and done." At least in our case at LANL, and at LBNL previously, all nodes of the same type/group boot the same VNFS image. As long as I don't need to cryptographically differentiate among, say, compute nodes, I only have to set up a single set of credentials for all the hosts, and I'm done. It also saves overall support time in my experience. By taking the responsibility for inter-machine trust myself at the system level, I don't have to worry about (1) modifying a user's SSH config without their knowledge, (2) running the risk of them messing with their config and breaking it, or (3) any user support/services calls about "why can't I do any of the things on the stuff?!" :-) It is totally a personal/team choice, but I'll be honest: Once I "discovered" host-based authentication and all the headaches it saved our sysadmin and consulting teams, I was kicking myself for having done it the other way for so long! :-D Regarding AuthorizedKeysCommand, I don't think we can use that, because users don't necessarily have existing SSH keys. What abuse scenarios where you thinking of in connection with in-homedir key pairs? Users don't have to have existing keys for it to work; the command you specify can easily create a key pair, drop the private key, and output the public key. Or even simpler, you can specify a value for "AuthorizedKeysFile" that points to a directory users can't write to, and store a key pair for each user in that location. Lots of ways to do it. But if I'm being frank about it, if I had my druthers, we'd be using certificates for authentication, not files. The advantages are, in my very humble opinion, well worth a little extra setup time! As far as abuse of keys goes: What's stopping your user from taking that private key you created for them (which is, as you recall, *unencrypted*) outside of your cluster to another host somewhere else on campus. Maybe something that has tons of untrusted folks with root. Then any of those folks can SSH to your cluster as that user. Credential theft is a *huge* problem in HPC across the world, so I always recommend that sysadmins think of it as Public Enemy #1! The more direct and permanent control you have over user credentials, the better. :-) Would it be correct to say that, if one were daft enough, one could build some sort of terminal server on top of MUNGE without using SSH, but which could then replicate basic SSH behaviour? No; that would only provide a method to authenticate servers at best. You can't authenticate users for the reasons I noted above. Single shared key, single trust domain. Your explanation is very clear, but it still seems like quite a few steps with various gotchas, like the fact that, as I understand it, shosts.equiv has to contain all the possible ways a host might be addressed (short name, long name, IP). You are correct, though that's easy to automate with a teensy weensy shell script. But yes, there's more up-front configuration. Again, though, I truly believe it saves admin time in the long run (not to mention user support staff time and user pain). But again, that's a personal or team choice. I'm not sure if I'm clearing things up or just
Re: [slurm-users] Building SLURM with X11 support
Hi Thekla, it is build in by default since... some time. You need to activate it by adding PrologFlags=X11 to your slurm.conf (see here: https://slurm.schedmd.com/slurm.conf.html#OPT_PrologFlags) Best, Marcus On 27.05.21 14:07, Thekla Loizou wrote: Dear all, I am trying to use X11 forwarding in SLURM with no success. We are installing SLURM using RPMs that we generate with the command "rpmbuild -ta slurm*.tar.bz2" as per the documentation. I am currently working with SLURM version 20.11.7-1. What I am missing when it comes to build SLURM with X11 enabled? Which flags and packages are required? Regards, Thekla -- Marcus Vincent Boden, M.Sc. Arbeitsgruppe eScience, HPC-Team Tel.: +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de - Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de Geschäftsführer: Prof. Dr. Ramin Yahyapour Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau Sitz der Gesellschaft: Göttingen Registergericht: Göttingen, Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001 - smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] Building SLURM with X11 support
On 5/27/21 2:07 PM, Thekla Loizou wrote: I am trying to use X11 forwarding in SLURM with no success. We are installing SLURM using RPMs that we generate with the command "rpmbuild -ta slurm*.tar.bz2" as per the documentation. I am currently working with SLURM version 20.11.7-1. What I am missing when it comes to build SLURM with X11 enabled? Which flags and packages are required? What is your OS? Do you have X11 installed? Did you install all Slurm prerequisites? For CentOS 7 it is: yum install rpm-build gcc openssl openssl-devel libssh2-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel gtk2-devel libssh2-devel libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker see https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites I hope this helps. /Ole
[slurm-users] Building SLURM with X11 support
Dear all, I am trying to use X11 forwarding in SLURM with no success. We are installing SLURM using RPMs that we generate with the command "rpmbuild -ta slurm*.tar.bz2" as per the documentation. I am currently working with SLURM version 20.11.7-1. What I am missing when it comes to build SLURM with X11 enabled? Which flags and packages are required? Regards, Thekla
Re: [slurm-users] pam_slurm_adopt not working for all users
Hi Ole, Ole Holm Nielsen writes: > Hi Loris, > > On 5/27/21 8:19 AM, Loris Bennett wrote: >> Regarding keys vs. host-based SSH, I see that host-based would be more >> elegant, but would involve more configuration. What exactly are the >> simplification gains you see? I just have a single cluster and naively I >> would think dropping a script into /etc/profile.d on the login node >> would be less work than re-configuring SSH for the login node and >> multiple compute node images. > > IMHO, it's really simply to setup hostbased SSH authentification: > https://wiki.fysik.dtu.dk/niflheim/SLURM#host-based-authentication Your explanation is very clear, but it still seems like quite a few steps with various gotchas, like the fact that, as I understand it, shosts.equiv has to contain all the possible ways a host might be addressed (short name, long name, IP). > This is more secure on Linux clusters, and you don't need to configure users' > SSH keys, so it requires less configuration for the sysadmin in the long run. It is not clear to me what the security advantage is and setting up the keys it just one script in /etc/profile.d. Regarding the long term, the keys which were set up on our old cluster were just migrated to the new cluster and still work, so it is also a one-time thing. I assume I must be missing something. Cheers, Loris -- Dr. Loris Bennett (Hr./Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
Re: [slurm-users] pam_slurm_adopt not working for all users
Ward Poelmans writes: > On 27/05/2021 08:19, Loris Bennett wrote: >> Thanks for the detailed explanations. I was obviously completely >> confused about what MUNGE does. Would it be possible to say, in very >> hand-waving terms, that MUNGE performs a similar role for the access of >> processes to nodes as SSH does for the access of users to nodes? > > A tiny bit yes. Munge allows you to authenticate users between servers > (like a unix socket does within a single machine): > https://github.com/dun/munge/wiki/Man-7-munge OK, thanks for the information. I had already read the man page for MUNGE, but to me it doesn't make it explicitly clear that MUNGE doesn't, out of the box, include the possibility to do something like SSH. Would it be correct to say that, if one were daft enough, one could build some sort of terminal server on top of MUNGE without using SSH, but which could then replicate basic SSH behaviour? Cheers, Loris -- Dr. Loris Bennett (Hr./Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
Re: [slurm-users] pam_slurm_adopt not working for all users
On 27/05/2021 08:19, Loris Bennett wrote: > Thanks for the detailed explanations. I was obviously completely > confused about what MUNGE does. Would it be possible to say, in very > hand-waving terms, that MUNGE performs a similar role for the access of > processes to nodes as SSH does for the access of users to nodes? A tiny bit yes. Munge allows you to authenticate users between servers (like a unix socket does within a single machine): https://github.com/dun/munge/wiki/Man-7-munge Ward
Re: [slurm-users] pam_slurm_adopt not working for all users
Hi Loris, On 5/27/21 8:19 AM, Loris Bennett wrote: Regarding keys vs. host-based SSH, I see that host-based would be more elegant, but would involve more configuration. What exactly are the simplification gains you see? I just have a single cluster and naively I would think dropping a script into /etc/profile.d on the login node would be less work than re-configuring SSH for the login node and multiple compute node images. IMHO, it's really simply to setup hostbased SSH authentification: https://wiki.fysik.dtu.dk/niflheim/SLURM#host-based-authentication This is more secure on Linux clusters, and you don't need to configure users' SSH keys, so it requires less configuration for the sysadmin in the long run. /Ole
Re: [slurm-users] pam_slurm_adopt not working for all users
Hi Michael, Michael Jennings writes: > On Tuesday, 25 May 2021, at 14:09:54 (+0200), > Loris Bennett wrote: > >> I think my main problem is that I expect logging in to a node with a job >> to work with pam_slurm_adopt but without any SSH keys. My assumption >> was that MUNGE takes care of the authentication, since users' jobs start >> on nodes with the need for keys. >> >> Can someone confirm that this expectation is wrong and, if possible, why >> the analogy with jobs is incorrect? > > Yes, that expectation is incorrect. When Slurm launches jobs, even > interactive ones, it is Slurm itself that handles connecting all the > right sockets to all the right places, and MUNGE handles the > authentication for that action. > > SSHing into cluster node isn't done through Slurm; thus, sshd handles > the authentication piece by calling out to your PAM stack (by > default). And you should think of pam_slurm_adopt as adding a > "required but not sufficient" step in your auth process for SSH; that > is, if it fails, the user can't get in, but if it succeeds, PAM just > moves on to the next module in the stack. > > (Technically speaking, it's PAM, so the above is only the default > configuration. It's theoretically possible to set up PAM in a > different way...but that's very much a not-good idea.) > >> I have a vague memory that this used work on our old cluster with an >> older version of Slurm, but I could be thinking of a time before we set >> up pam_slurm_adopt. > > Some cluster tools, such as Warewulf and PERCEUS, come with built-in > scripts to create SSH key pairs (with unencrypted private keys) that > had special names for any (non-system) user who didn't already have a > pair. Maybe the prior cluster was doing something like that? Or > could it have been using Host-based Auth? > >> I have discovered that the users whose /home directories were migrated >> from our previous cluster all seem to have a pair of keys which were >> created along with files like '~/.bash_profile'. Users who have been >> set up on the new cluster don't have these files. >> >> Is there some /etc/skel-like mechanism which will create passwordless >> SSH keys when a user logs into the system for the first time? It looks >> increasingly to me that such a mechanism must have existed on our old >> cluster. > > That tends to point toward the "something was doing it for you before > that is no longer present" theory. > > You do NOT want to use /etc/skel for this, though. That would cause > all your users to have the same unencrypted private key providing > access to their user account, which means they'd be able to SSH around > as each other. That's...problematic. ;-) > >> I was just getting round to the idea that /etc/profile.d might be >> the way to go, so your script looks like exactly the sort of thing I >> need. > > You can definitely do it that way, and a lot of sites do. But > honestly, you're better served by setting up Host-based Auth for SSH. > It uses the same public/private keypair KEX to authenticate each other > that is normally used for users, so as long as your hosts are secure, > you can rely on the security of HostbasedAuthentication. > > With unencrypted private keys (that's what "passphraseless" really > means), you definitely can be opening the door to abuse. If you want > to go that route, you'd likely want to set up something that users > couldn't abuse, e.g. via AuthorizedKeysCommand, rather than the > traditional in-homedir key pairs. > > We use host-based for all of our clusters here at LANL, and it > simplifies a *lot* for us. If you want to give it a try, there's a > good cookbook here: > https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Host-based_Authentication > > HTH, > Michael Thanks for the detailed explanations. I was obviously completely confused about what MUNGE does. Would it be possible to say, in very hand-waving terms, that MUNGE performs a similar role for the access of processes to nodes as SSH does for the access of users to nodes? Regarding keys vs. host-based SSH, I see that host-based would be more elegant, but would involve more configuration. What exactly are the simplification gains you see? I just have a single cluster and naively I would think dropping a script into /etc/profile.d on the login node would be less work than re-configuring SSH for the login node and multiple compute node images. Regarding AuthorizedKeysCommand, I don't think we can use that, because users don't necessarily have existing SSH keys. What abuse scenarios where you thinking of in connection with in-homedir key pairs? Cheers, Loris -- Dr. Loris Bennett (Hr./Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de