Some details on our setup in light of the approach Jarrod outlined …

Am Wed, 29 Mar 2023 17:37:30 +0000
schrieb Jarrod Johnson <jjohns...@lenovo.com>:

> On confluent diskless, there is an interesting benefit that becomes a
> challenge for bittorrent: a typical diskless node never downloads the
> whole diskless image.  This means less ram sucked up by the diskless
> image, and also that the diskless image can be large without pruning.

I guess this is mitigated by our OS image being rather minimal to begin
with. It only has the basic system software and drivers, up to a
working C/C++ compiler setup that is able to bootstrap further software.

Such further software is provided in a versioned tree via NFS and
managed via environment modules. So such an approach to optimize the
usage of a large OS image by only keeping necessary parts in memory
would not benefit us much. The squashfs is below 1G, which, for our
compute nodes with 64G of RAM is no big deal. A full image of 10G would
be annoying.

Rather, the split for us is this below 1G system image and the software
tree on NFS with 421G, grown over about 8 years system lifetime. Add to
that an uncounted number of anaconda, spack, whatever trees that users
installed into their storage shares.

Getting whole images quickly out to the cluster nodes is very much
valid for this scenario also for the next system we will set up. Of
course one could imagine full-on NFS root, but there are reasons why
that is out of fashion, and with a minimal main system image, it can be
considered as a mode of aggressive client caching.

It might not matter much with 10G network or IB on the image server,
but any avoidable bottleneck sucks, even if it does not hurt right now
in practice.

> trick were done to only torrent the parts as needed locally

That does sound like a complexity nightmare … but it might still
provide benefit, assuming that nodes need the same parts, mostly. You'd
need to do a lot of work to integrate those layers. Not worth it, I
guess.

> the diskless images are now encrypted […] by node TPM

Hm. Use of TPMs on cluster nodes. Didn't think about that much, yet.
Another point: I'd love vendors to finally implement safeguards to
ensure that root on a server cannot manipulate any firmware (and be it
network card, hard disk) from userspace, and especially cannot access
the BMC, which should only answer to external IPMI requests. Can Secure
Boot really ensure nothing has been messed with through a root exploit?
I'd love a simple switch that only allows certain platform changes in
the pre-boot environment (BIOS, UEFI … and IPMI from the outside) and
have things locked down once the Kernel boots.

I still don't see how you really can trust a machine once someone had
root on it, if you're really paranoid.

The whole machinery of crypto checking (Secure Boot) is a rather
elaborate mess which could be avoided if there was a clear hardware
barrier that only allows certain modifications (also to PCIe and SATA
devices, at least onboard devices) outside the booted Linux context. If
there's no way to modify things for rogue users/hackers, then you know
the system is clean on a fresh boot from the network, and maybe after
replacing any SATA or USB devices that just cannot be protected that way.

Is any vendor for compute nodes offering this kind of manipulation
protection?

I'd love that kind of security to start with. Not having to
theoretically trash the hardware once someone possibly got a root
exploit. Then talk about encrypting images and securing userspace …

> if [ "untethered" = "$(getarg confluent_imagemethod)" ]; then
>     mount -t tmpfs untethered /mnt/remoteimg
>     curl 
> https://$confluent_whost/confluent-public/os/$confluent_profile/rootimg.sfs 
> -o /mnt/remoteimg/rootimg.sfs
> else
>     confluent_urls="$confluent_urls 
> https://$confluent_whost/confluent-public/os/$confluent_profile/rootimg.sfs";
>     /opt/confluent/bin/urlmount $confluent_urls /mnt/remoteimg
> fi

Looks easy enough.

> Is the logic for getting the image.  One thing to note is that a
> typical diskless image boot in confluent, the booted system does not​
> see rootimg.sfs, so the torrent execution would have to stay in the
> 'initramfs' world (which does persist after boot, as a separate mount
> namespace)

I think such is why I hooked the rootimg.sfs up to /dev/loop0 back in
the day, and hacked ctorret to allow the block device as data source.
The loop device stays accessible.

Anyone from xCAT with thoughts on this? Should I work on a patch for
current xCAT (not sure where I'd find time to test that, though)?

I don't know which kind of cluster management our next system will
have. It could be that my path of least resistance is a quick hack on
that one like I did with xCAT back in 2015 …


Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg


_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to