Re: [slurm-users] salloc problem
On 27/10/22 4:18 am, Gizo Nanava wrote: we run into another issue when using salloc interactively on a cluster where Slurm power saving is enabled. The problem seems to be caused by the job_container plugin and occurs when the job starts on a node which boots from a power down state. If I resubmit a job immediately after the failure to the same node, it always works. I can't find any other way to reproduce the issue other than booting a reserved node from a power down state. Looking at this: slurmstepd: error: container_p_join: open failed for /scratch/job_containers/791670/.ns: No such file or directory I'm wondering is a separate filesystem and, if so, could /scratch be only getting mounted _after_ slurmd has started on the node? If that's the case then it would explain the error and why it works immediately after. On our systems we always try and ensure that slurmd is the very last thing to start on a node, and it only starts if everything has succeeded up to that point. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node
On 27/10/22 11:30 pm, Richard Chang wrote: Yes, the system is a HPE Cray EX, and I am trying to use switch/hpe_slingshot. Which version of Slurm are you using Richard? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Prolog and job_submit
On 30/10/22 12:27 pm, Davide DelVento wrote: But if I understand correctly your Prolog vs TaskProlog distinction, the latter would have the environmental variable and run as user, whereas the former runs as root and doesn't get the environment, That's correct. My personal view is that injecting arbitrary input from a user (such as these environment variables) would make life hazardous from a security point of view for a root privileged process such as a prolog. not even from the job_submit script. That is correct, all the job_submit will do is inject the environment variable into the jobs environment, just as if a user had done so. The problem with a TaskProlog approach is that what I want to do (making a non-accessible file available) would work best as root. As a workaround is that I could make that just obscure but still user-possible. Not ideal, but better than nothing as it is now. Alternatively, I could use another way to let the job_submit lua script communicate with the Prolog, not sure exactly what (temp directory on the shared filesystem, writeable only by root??) My only other thought is that you might be able to use node features & job constraints to communicate this without the user realising. For instance you could declare the nodes where the software is installed to have "Feature=mysoftware" and then your job submit could spot users requesting the license and add the constraint "mysoftware" to their job. The (root privileged) Prolog can see that via the SLURM_JOB_CONSTRAINTS environment variable and so could react to it. Then when 23.02 comes out you could use the new SLURM_JOB_LICENSES environment variable in addition and retire the old way once jobs using the old method have completed. Thanks for pointing to that commit. I bit too down the road but good to know. No worries, best of luck! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
[slurm-users] What happens if slurmdbd loses connection to mysql
Hi, I have two dedicated nodes for slurm, node1 and node2. I have created the following. *Role* *SlurmCTLD* *SlurmDBD* *Mariadb Server for accounting storage* *Primary* Node1 Node2 Node2 *Backup* Node2 Node1 - Shared NFS Storage from an NFS Server, for StateSaveLocation. I want to know what if Node2 goes down. I have read in the documentation that if slurmdbd does down, slurmctld can still hold back the accounting info and when the slurmdbd is back up, it will get it passed on and written to the backend database ( not the exact words, but in that vein). Just want to know what if node2 goes down and the backup slurmdbd in node1 takes over. Will it fail instantaneously or keep logging the data in it's memory and write back to the DB when it is back up ? Hope I could explain what I mean. Thanks & regards, Richard.
Re: [slurm-users] Prolog and job_submit
Hi Chris, > Unfortunately it looks like the license request information doesn't get > propagated into any prologs from what I see from a scan of the > documentation. :-( Thanks. If I am reading you right, I did notice the same thing and in fact that's why I wrote that job_submit lua script which gets the license information and sets an environmental variable, in the hope that such a variable would be inherited by the prolog script. But if I understand correctly your Prolog vs TaskProlog distinction, the latter would have the environmental variable and run as user, whereas the former runs as root and doesn't get the environment, not even from the job_submit script. The problem with a TaskProlog approach is that what I want to do (making a non-accessible file available) would work best as root. As a workaround is that I could make that just obscure but still user-possible. Not ideal, but better than nothing as it is now. Alternatively, I could use another way to let the job_submit lua script communicate with the Prolog, not sure exactly what (temp directory on the shared filesystem, writeable only by root??) Thanks for pointing to that commit. I bit too down the road but good to know. Cheers, Davide
Re: [slurm-users] Prolog and job_submit
On 30/10/22 10:23 am, Chris Samuel wrote: Unfortunately it looks like the license request information doesn't get propagated into any prologs from what I see from a scan of the documentation. 🙁 This _may_ be fixed in the next major Slurm release (February) if I'm reading this right: https://github.com/SchedMD/slurm/commit/3c6c4c08d8deb89aa2c992a65964f53663097d26 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Prolog and job_submit
On 29/10/22 7:37 am, Davide DelVento wrote: So either I misinterpreted that "same environment as the user tasks" or there is something else that I am doing wrong. Slurm has a number of different prologs that can run which can cause confusion, and I suspect that's what's happening here. The "Prolog" in your configuration runs as root, but its the "TaskProlog" that runs as the user and so has access to the jobs environment (including the environment variable you are setting). Unfortunately it looks like the license request information doesn't get propagated into any prologs from what I see from a scan of the documentation. :-( Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA