Dear Community, We are trying to activate sharding. Our Compute Node are configured with 64 cores, 4 phisical GPU MI250x ( 8 logical ) 4 Numa Domain. 1 Phisical Gpu / 2 logical GPU for each Numa Domain. 1 Logical GPU for each l3 cache domain
gres.conf AutoDetect=rsmi NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD128 Count=4 NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD129 Count=4 NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD130 Count=4 NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD131 Count=4 NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD132 Count=4 NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD133 Count=4 NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD134 Count=4 NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD135 Count=4 If I ask 2 cores with block:cyclic I have the expected result srun -N1 -n2 -c1 --cpu-bind=cores -m block:cyclic --pty bash cpuset cgroup is 1,17 But if I add 2 shard in the request I don't expect this result srun -N1 -n2 -c1 --cpu-bind=cores --gres=shard:2 -m block:cyclic --pty bash cpuset cgroup is 1-2 ROCR_VISIBILE_DEVICES=0 Is it possibile request 2 sharding in round robin fashion, in order to run a multigpu job on different GPUs? srun -N1 -n2 -c1 --cpu-bind=cores --gres=shard:2 -m block:cyclic --pty bash Practically, I would to have this result cpuset cgroup is 1-17 ROCR_VISIBILE_DEVICES=0,1 Thank you in advance, Alessandro -- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
