Hey all, We have a very heterogeneous cluster with a mix of Intel & AMD machines, GPUs, and FPGAs. We use Slurm to allocate time on the CPUs right now, but do not manage the GPU or FPGAs with Slurm yet.
Some of our work involves hardware/SoC design, and we have been using Firesim (https://fires.im/) for a good amount of our work. Firesim is a bunch of Python that sits on top of invoking programs to provide "seamless" functionality and orchestration. However, Firesim was designed for AWS where a single user controls all of the resources (FPGAs in our case) used for the simulation. We have gone through the effort to pull out the important bits from Firesim to write a shell script that we can execute locally to achieve the same thing as Firesim' orchestration, without the need for an orchestrator. We have two FPGAs in two separate but identical machines and multiple people that want to use them for very long periods of time (days or longer) and want to let people freely move between the two FPGAs. All of our other machines in the cluster use Slurm, so we are looking to have Slurm multiplex these FPGA jobs. We already have the NFS stuff set up, so shared disk is fine. The important thing in our desired Slurm setup is that no user is allocated the same FPGA at the same time for the same job (you cannot have multiple FPGA jobs running the same design in parallel, partly because of bitstreams, partly because of disk images). Slurm does not track FPGAs by default. I have looked into how to teach Slurm about the FPGAs and know that I need to tell Slurm about the FPGAs as a GRES through gres.conf. I just want to confirm that I have the correct steps below: 1. Mark the 2 FPGAs as GRES with gres.conf. What should they be marked as? Do I need anything special? 2. Then enable tracking on the FPGA GRES to make a TRES in slurm.conf. What should this be labeled as? I see other groups using GresTypes=gpu in their configuration. Would that apply here? Since FPGA usage is exclusive already, I could see the GPU GRES type applying. 3. Then the TRES can be used by a QoS to enforce limits? The other important question, will I need to write a Slurm plugin (GRES or otherwise) to manage and detect the FPGAs? I saw the files stuff for GRES-es, but want to be sure. Writing a plugin is not a big deal for us, but it is future maintenance that I would want to document. -- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
