Hi,

We have allowed some courses to use our slurm cluster for teaching purposes, 
which of course leads to all kinds of exciting experiments - not always the 
most clever programming but it certainly teaches me where we need tighten up 
configurations.

The default method of thinking for many students just starting out is to grab 
as much CPU as possible - not fully understanding cluster computing and batch 
scheduling. One example I see often is students using the R parallel package 
and calling detectCores(), which of course is returning all the cores linux 
reports. They also did not specify --ntasks, so slurm assigns 1 of course - but 
there is no check on the ballooning of R processes created with detectCores() 
and then whatever they're doing with that number. Now we have overloaded nodes.

I see that availableCores() is suggested as a more friendly method for shared 
resources like this, where it would return the number of cores that were 
assigned via SLURM. Therefore, a student using the parallel package would need 
to explicitly specify the number of cores in their submit file. This would be 
nice IF students voluntarily used availableCores() instead of detectCores(), 
but we know that's not really enforceable.

I thought cgroups (which we are using) would prevent some of this behavior on 
the nodes (we are constraining CPU and RAM) -I'd like there to be no I/O wait 
times if possible. I would like it if either linux or slurm could constrain a 
job from grabbing more cores than assigned at submit time. Is there something 
else I should be configuring to safeguard against this behavior? If SLURM 
assigns 1 cpu to the task then no matter what craziness is in the code, 1 is 
all they're getting. Possible?

Thanks for any insight!

--mike



Reply via email to