Hey all,

We have a very heterogeneous cluster with a mix of Intel & AMD machines, GPUs, 
and FPGAs. We use Slurm to allocate time on the CPUs right now, but do not 
manage the GPU or FPGAs with Slurm yet.

Some of our work involves hardware/SoC design, and we have been using Firesim 
(https://fires.im/) for a good amount of our work. Firesim is a bunch of Python 
that sits on top of invoking programs to provide "seamless" functionality and 
orchestration. However, Firesim was designed for AWS where a single user 
controls all of the resources (FPGAs in our case) used for the simulation. We 
have gone through the effort to pull out the important bits from Firesim to 
write a shell script that we can execute locally to achieve the same thing as 
Firesim' orchestration, without the need for an orchestrator. 

We have two FPGAs in two separate but identical machines and multiple people 
that want to use them for very long periods of time (days or longer) and want 
to let people freely move between the two FPGAs. All of our other machines in 
the cluster use Slurm, so we are looking to have Slurm multiplex these FPGA 
jobs. We already have the NFS stuff set up, so shared disk is fine. The 
important thing in our desired Slurm setup is that no user is allocated the 
same FPGA at the same time for the same job (you cannot have multiple FPGA jobs 
running the same design in parallel, partly because of bitstreams, partly 
because of disk images).

Slurm does not track FPGAs by default. I have looked into how to teach Slurm 
about the FPGAs and know that I need to tell Slurm about the FPGAs as a GRES 
through gres.conf. I just want to confirm that I have the correct steps below:
  1. Mark the 2 FPGAs as GRES with gres.conf.
      What should they be marked as? Do I need anything special?
  2. Then enable tracking on the FPGA GRES to make a TRES in slurm.conf.
      What should this be labeled as? I see other groups using GresTypes=gpu in 
their configuration. Would that apply here? Since FPGA usage is exclusive 
already, I could see the GPU GRES type applying.
  3. Then the TRES can be used by a QoS to enforce limits?

The other important question, will I need to write a Slurm plugin (GRES or 
otherwise) to manage and detect the FPGAs? I saw the files stuff for GRES-es, 
but want to be sure. Writing a plugin is not a big deal for us, but it is 
future maintenance that I would want to document.

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to