Hi,
Ah okay,  so your requirements include completely insulating (some) jobs from outside access, including root?
Correct.
I've seen this kind of requirements on e.g. working non-defaced medical data - generally a tough problem imo because this level of data security seems more or less incompatible with the idea of a multi-user HPC system.

I remember that this year's ZKI-AK Supercomputing spring meeting had Sebastian Krey from GWDG presenting the KISSKI ("KI-Servicezentrum für Sensible und Kritische Infrastrukturen", https://kisski.gwdg.de/ ) project, which works in this problem domain, are you involved in that? The setup with containerization and 'node hardening' sounds very similar to me.
Indeed. We (ZIH TU Dresden) are working together with Hendrik Nolte from GWDG to implement their concept of a "secure Workflow on HPC" on our system. In short the idea here is to have nodes with additional (cryptographic) authentication of jobs. I'm just double-checking alternatives for some details which may result in easier implementation of the concept.
Re "preventing the scripts from running": I'd say it's about as easy as to otherwise manipulate any job submission that goes through slurmctld (e.g. by editing slurm.conf), so without knowing your exact use case and requirements, I can't think of a simple solution.
The resource manager, i.e. slurmctld, and slurmd run on different machines.
There is a local copy of slurm.conf for slurmctld, and the node(s), i.e. slurmd, each using only the relevant parts. So the slurmd doesn't care about the submit plugins and slurmctld doesn't (need to) know about the Prolog, correct? The idea in the workflow is that only the node itself needs to be considered secure and access to the node is only possible via the slurmd running on the node. So that slurmd can be configured to always execute the Prolog (a local script) prior to each job and deny its execution on failed authentication. Circumventing this authentication now requires modifying the slurm.conf on that node, which has to be considered impossible as an attacker with that capability could also modify anything else (e.g. the Prolog to remove the checks).

But the possibility of slurmd handling a `--no-alloc` job introduces a new way to circumvent the authentication. Using the slurm.conf of the slurmctld effectively only disables requests to the slurmd to not run the Prolog (i.e. -Z flag), but if the slurmd somehow receives such an request it would still handle it. So now the security relies additionally on the security of the resource manager. It would be more secure if slurmd on that node(s) could be configured to never skip the Prolog, even if the user seems to be privileged. As the node could be rebooted prior to each job using a readonly image the security of each job can be ensured without any influence on the rest of the cluster.

So in summary: We don't want to trust the slurmctld (running somewhere else) but only the slurmd (running on the node) to always execute the Prolog.

I hope that explains it well enough.
Kind regards,
Alex

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to