[slurm-users] Re: Recover Batch Script Error
Are you absolutely certain you’ve done it before for completed jobs? I would not expect that to work for completed jobs, with the possible exception of very recently completed jobs (or am I thinking of Torque?). Other replies mention the relatively new feature (21.08?) to store the job script in the database. Be mindful of the database implications here (I believe I have had conversations about this recently with some experienced sites on this mailing list). -- #BlackLivesMatter || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Feb 16, 2024, at 14:41, Jason Simms via slurm-users wrote: Hello all, I've used the "scontrol write batch_script" command to output the job submission script from completed jobs in the past, but for some reason, no matter which job I specify, it tells me it is invalid. Any way to troubleshoot this? Alternatively, is there another way - even if a manual database query - to recover the job script, assuming it exists in the database? sacct --jobs=38960 JobID JobName PartitionAccount AllocCPUS State ExitCode -- -- -- -- -- 38960amr_run_v+ tsmith2lab tsmith2lab 72 COMPLETED 0:0 38960.batch batchtsmith2lab 40 COMPLETED 0:0 38960.extern externtsmith2lab 72 COMPLETED 0:0 38960.0 hydra_pmi+tsmith2lab 72 COMPLETED 0:0 scontrol write batch_script 38960 job script retrieval failed: Invalid job id specified Warmest regards, Jason -- Jason L. Simms, Ph.D., M.P.H. Manager of Research Computing Swarthmore College Information Technology Services (610) 328-8102 Schedule a meeting: https://calendly.com/jlsimms -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Recover Batch Script Error
Yes, that is what we are also doing and it works well. Note that requesting a batch script for another user, one sees nothing (rather than an error message saying that one does not have permissions) On Fri, Feb 16, 2024 at 12:48 PM Paul Edmon via slurm-users < slurm-users@lists.schedmd.com> wrote: > Are you using the job_script storage option? If so then you should be able > to get at it by doing: > > sacct -B j JOBID > > https://slurm.schedmd.com/sacct.html#OPT_batch-script > > -Paul Edmon- > On 2/16/2024 2:41 PM, Jason Simms via slurm-users wrote: > > Hello all, > > I've used the "scontrol write batch_script" command to output the job > submission script from completed jobs in the past, but for some reason, no > matter which job I specify, it tells me it is invalid. Any way to > troubleshoot this? Alternatively, is there another way - even if a manual > database query - to recover the job script, assuming it exists in the > database? > > sacct --jobs=38960 > JobID JobName PartitionAccount AllocCPUS State > ExitCode > -- -- -- -- -- > > 38960amr_run_v+ tsmith2lab tsmith2lab 72 COMPLETED > 0:0 > 38960.batch batchtsmith2lab 40 COMPLETED > 0:0 > 38960.extern externtsmith2lab 72 COMPLETED > 0:0 > 38960.0 hydra_pmi+tsmith2lab 72 COMPLETED > 0:0 > > scontrol write batch_script 38960 > job script retrieval failed: Invalid job id specified > > Warmest regards, > Jason > > -- > *Jason L. Simms, Ph.D., M.P.H.* > Manager of Research Computing > Swarthmore College > Information Technology Services > (610) 328-8102 > Schedule a meeting: https://calendly.com/jlsimms > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Recover Batch Script Error
Are you using the job_script storage option? If so then you should be able to get at it by doing: sacct -B j JOBID https://slurm.schedmd.com/sacct.html#OPT_batch-script -Paul Edmon- On 2/16/2024 2:41 PM, Jason Simms via slurm-users wrote: Hello all, I've used the "scontrol write batch_script" command to output the job submission script from completed jobs in the past, but for some reason, no matter which job I specify, it tells me it is invalid. Any way to troubleshoot this? Alternatively, is there another way - even if a manual database query - to recover the job script, assuming it exists in the database? sacct --jobs=38960 JobID JobName Partition Account AllocCPUS State ExitCode -- -- -- -- -- 38960 amr_run_v+ tsmith2lab tsmith2lab 72 COMPLETED 0:0 38960.batch batch tsmith2lab 40 COMPLETED 0:0 38960.extern extern tsmith2lab 72 COMPLETED 0:0 38960.0 hydra_pmi+ tsmith2lab 72 COMPLETED 0:0 scontrol write batch_script 38960 job script retrieval failed: Invalid job id specified Warmest regards, Jason -- *Jason L. Simms, Ph.D., M.P.H.* Manager of Research Computing Swarthmore College Information Technology Services (610) 328-8102 Schedule a meeting: https://calendly.com/jlsimms -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Recover Batch Script Error
Hello all, I've used the "scontrol write batch_script" command to output the job submission script from completed jobs in the past, but for some reason, no matter which job I specify, it tells me it is invalid. Any way to troubleshoot this? Alternatively, is there another way - even if a manual database query - to recover the job script, assuming it exists in the database? sacct --jobs=38960 JobID JobName PartitionAccount AllocCPUS State ExitCode -- -- -- -- -- 38960amr_run_v+ tsmith2lab tsmith2lab 72 COMPLETED 0:0 38960.batch batchtsmith2lab 40 COMPLETED 0:0 38960.extern externtsmith2lab 72 COMPLETED 0:0 38960.0 hydra_pmi+tsmith2lab 72 COMPLETED 0:0 scontrol write batch_script 38960 job script retrieval failed: Invalid job id specified Warmest regards, Jason -- *Jason L. Simms, Ph.D., M.P.H.* Manager of Research Computing Swarthmore College Information Technology Services (610) 328-8102 Schedule a meeting: https://calendly.com/jlsimms -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Need help managing licence
The simple answer is to just add a line such as Licenses=whatever:20 and then request your users to use the -L option as described at https://slurm.schedmd.com/licenses.html This works very well, however it does not do enforcement like Slurm does with other resources. You will find posts in this list from me trying to achieve such enforcement with prolog, but I ended up banging too much my head on the keyboard and so I eventually gave up. User education was easier for me. Depending on your user community, banging your head on the keyboard might be easier than educating your users -- if so please share how you solve the issue On Fri, Feb 16, 2024 at 7:48 AM Sylvain MARET via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello everyone ! > > Recently our users bought a cplex dynamic license and want to use it on > our slurm cluster. > I've installed the paid version of cplex within modules so authorized > user can load it with a simple module load cplex/2111 command but I > don't know how to manage and ensure slurm doesn't launch a job if 20 > people are already running code with this license. > > How do you guys manage paid licenses on your cluster ? Any advice would > be appreciated ! > > Regards, > Sylvain Maret > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Need help managing licence
Hello everyone ! Recently our users bought a cplex dynamic license and want to use it on our slurm cluster. I've installed the paid version of cplex within modules so authorized user can load it with a simple module load cplex/2111 command but I don't know how to manage and ensure slurm doesn't launch a job if 20 people are already running code with this license. How do you guys manage paid licenses on your cluster ? Any advice would be appreciated ! Regards, Sylvain Maret -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] MPI/PMIx Issues after 23.11 Update
Hi all, We’re running a small slurm dev cluster on Ubuntu and are facing issues with MPI/PMIx after upgrading slurm from 23.02.5 to 23.11.3. The first job step to use MPI within a job fails roughly 80% of the time but following attempts to use MPI within the same job work fine. For the failing job step we see this error after hitting the MPI timeout: slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_reset_if_to: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:741: 0x55a03f8d7a90: collective timeout seq=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_log: hpc-d-msh-01a02 [1]: pmixp_coll.c:286: Dumping collective state slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:759: 0x55a03f8d7a90: COLL_FENCE_RING state seq=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:762: my peerid: 1:hpc-d-msh-01a02 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:769: neighbor id: next 0:hpc-d-msh-01a01, prev 0:hpc-d-msh-01a01 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b08, #0, in-use=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b40, #1, in-use=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b78, #2, in-use=1 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:787: seq=0 contribs: loc=1/prev=1/fwd=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:791: neighbor contribs [2]: slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:824: done contrib: hpc-d-msh-01a01 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:826: wait contrib: - slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:828: status=PMIXP_COLL_RING_FINILIZE slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:831: buf (offset/size): 0/33362 [hpc-d-msh-01a01.tds.hpc.barf1.com:47652] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13) [hpc-d-msh-01a01.tds.hpc.barf1.com:47652] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 31 [hpc-d-msh-01a01:47652] *** An error occurred in MPI_Send [hpc-d-msh-01a01:47652] *** reported by process [683360612,0] [hpc-d-msh-01a01:47652] *** on communicator MPI_COMM_WORLD [hpc-d-msh-01a01:47652] *** MPI_ERR_OTHER: known error not in list [hpc-d-msh-01a01:47652] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [hpc-d-msh-01a01:47652] ***and potentially your MPI job) slurmstepd: error: *** STEP 1890.0 ON hpc-d-msh-01a01 CANCELLED AT 2024-02-14T16:14:52 *** OpenMPI/PMIx versions have not changed and downgrading slurm to 23.02.5 seems to resolve the issue. We’d appreciate any pointers anyone might have. Thanks Oli This email comprises confidential information of Mercedes-Benz Grand Prix Limited ("MGP") unless it contains an explicit statement to the contrary made by an authorised representative of MGP. Contracts may only be concluded on behalf of MGP by its authorised signatories and not solely by email communication. No employee, agent, contractor, consultant and/or other representative of MGP is authorised to conclude any legally binding agreement on behalf of MGP by email alone without the express prior written confirmation of two authorised signatories of MGP. Mercedes-Benz Grand Prix Limited. Registered in England No. 787446. Registered Office at Mercedes-Benz Grand Prix Limited, Operations Centre, Brackley, Northants NN13 7BD. Note: The MGP Legal Department also acts on behalf of Mercedes-Benz Motorsport Limited ("MBM") and the above notice applies mutatis mutandis in respect of all email communications of MBM. MBM: Mercedes-Benz Motorsport Limited. Registered in England No. 13057973. Registered office at Mercedes-Benz Motorsport Limited, Lauda Drive, Brackley, Northants NN13 7BD. Please consider the environment before printing this email. -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com