Hi Sergio, What you describe would require major complex changes and several months work. Changes would be required to the data structures in src/common/gres.h and code in src/common/gres.c. Major changes would also be required to the select plugin of your choice, basically adding yet another dimension to it's logic optimizing resource selection. If you decide to proceed, I would suggest spending a several weeks studying the code and developing a design. Then post the design to this mailing list for comment.
Quoting Sergio Iserte Agut <sise...@uji.es>: > I have a cluster with 3 nodes (one of them has 2 GPUs while the others 1). > > I know if I run: > *# srun --gres=gpu:3 hostname* > I would get the error: > * srun: error: Unable to allocate resources: Requested node > configuration is not available* > Because Slurm is not able to allocate gres within nodes. > My purpose is Slurm is able to do it, with a global GPUs counter. I have a > tool which distribute the work within the GPUs of the cluster, that's why I > would like Slurm schedule and select this GPUs. > > I saw some clues in */var/log/slurmctld.log*: > * _pick_best_nodes: job 110 never runnable* > * _slurm_rpc_allocate_resources: Requested node configuration is not > available * > * > * > I have spent several days in order to understand where is generated the > error to start with my implementation. > And I have discovered this flow among the modules: > *node_scheduler.c -> node_select.c -> select_plugin.c -> gres.c* > However, I don't know where I can start, because I wouldn't like to modify > the Slurm Core, I prefer do it with plug-ins. > > I hope this is well explained. > > Regards, > Sergio Iserte. >