[slurm-users] Re: Recover Batch Script Error

2024-02-16 Thread Ryan Novosielski via slurm-users
Are you absolutely certain you’ve done it before for completed jobs? I would 
not expect that to work for completed jobs, with the possible exception of very 
recently completed jobs (or am I thinking of Torque?).

Other replies mention the relatively new feature (21.08?) to store the job 
script in the database. Be mindful of the database implications here (I believe 
I have had conversations about this recently with some experienced sites on 
this mailing list).

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Feb 16, 2024, at 14:41, Jason Simms via slurm-users 
 wrote:

Hello all,

I've used the "scontrol write batch_script" command to output the job 
submission script from completed jobs in the past, but for some reason, no 
matter which job I specify, it tells me it is invalid. Any way to troubleshoot 
this? Alternatively, is there another way - even if a manual database query - 
to recover the job script, assuming it exists in the database?

sacct --jobs=38960
JobID   JobName  PartitionAccount  AllocCPUS  State ExitCode
 -- -- -- -- -- 
38960amr_run_v+ tsmith2lab tsmith2lab 72  COMPLETED  0:0
38960.batch   batchtsmith2lab 40  COMPLETED  0:0
38960.extern externtsmith2lab 72  COMPLETED  0:0
38960.0  hydra_pmi+tsmith2lab 72  COMPLETED  0:0

scontrol write batch_script 38960
job script retrieval failed: Invalid job id specified

Warmest regards,
Jason

--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Recover Batch Script Error

2024-02-16 Thread Davide DelVento via slurm-users
Yes, that is what we are also doing and it works well.
Note that requesting a batch script for another user, one sees nothing
(rather than an error message saying that one does not have permissions)

On Fri, Feb 16, 2024 at 12:48 PM Paul Edmon via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Are you using the job_script storage option? If so then you should be able
> to get at it by doing:
>
> sacct -B j JOBID
>
> https://slurm.schedmd.com/sacct.html#OPT_batch-script
>
> -Paul Edmon-
> On 2/16/2024 2:41 PM, Jason Simms via slurm-users wrote:
>
> Hello all,
>
> I've used the "scontrol write batch_script" command to output the job
> submission script from completed jobs in the past, but for some reason, no
> matter which job I specify, it tells me it is invalid. Any way to
> troubleshoot this? Alternatively, is there another way - even if a manual
> database query - to recover the job script, assuming it exists in the
> database?
>
> sacct --jobs=38960
> JobID   JobName  PartitionAccount  AllocCPUS  State
> ExitCode
>  -- -- -- -- --
> 
> 38960amr_run_v+ tsmith2lab tsmith2lab 72  COMPLETED
>  0:0
> 38960.batch   batchtsmith2lab 40  COMPLETED
>  0:0
> 38960.extern externtsmith2lab 72  COMPLETED
>  0:0
> 38960.0  hydra_pmi+tsmith2lab 72  COMPLETED
>  0:0
>
> scontrol write batch_script 38960
> job script retrieval failed: Invalid job id specified
>
> Warmest regards,
> Jason
>
> --
> *Jason L. Simms, Ph.D., M.P.H.*
> Manager of Research Computing
> Swarthmore College
> Information Technology Services
> (610) 328-8102
> Schedule a meeting: https://calendly.com/jlsimms
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Recover Batch Script Error

2024-02-16 Thread Paul Edmon via slurm-users
Are you using the job_script storage option? If so then you should be 
able to get at it by doing:


sacct -B j JOBID

https://slurm.schedmd.com/sacct.html#OPT_batch-script

-Paul Edmon-

On 2/16/2024 2:41 PM, Jason Simms via slurm-users wrote:

Hello all,

I've used the "scontrol write batch_script" command to output the job 
submission script from completed jobs in the past, but for some 
reason, no matter which job I specify, it tells me it is invalid. Any 
way to troubleshoot this? Alternatively, is there another way - even 
if a manual database query - to recover the job script, assuming it 
exists in the database?


sacct --jobs=38960
JobID           JobName  Partition    Account  AllocCPUS  State ExitCode
 -- -- -- -- -- 

38960        amr_run_v+ tsmith2lab tsmith2lab         72  COMPLETED   
   0:0
38960.batch       batch            tsmith2lab         40  COMPLETED   
   0:0
38960.extern     extern            tsmith2lab         72  COMPLETED   
   0:0
38960.0      hydra_pmi+            tsmith2lab         72  COMPLETED   
   0:0


scontrol write batch_script 38960
job script retrieval failed: Invalid job id specified

Warmest regards,
Jason

--
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Recover Batch Script Error

2024-02-16 Thread Jason Simms via slurm-users
Hello all,

I've used the "scontrol write batch_script" command to output the job
submission script from completed jobs in the past, but for some reason, no
matter which job I specify, it tells me it is invalid. Any way to
troubleshoot this? Alternatively, is there another way - even if a manual
database query - to recover the job script, assuming it exists in the
database?

sacct --jobs=38960
JobID   JobName  PartitionAccount  AllocCPUS  State ExitCode
 -- -- -- -- -- 
38960amr_run_v+ tsmith2lab tsmith2lab 72  COMPLETED  0:0
38960.batch   batchtsmith2lab 40  COMPLETED  0:0
38960.extern externtsmith2lab 72  COMPLETED  0:0
38960.0  hydra_pmi+tsmith2lab 72  COMPLETED  0:0

scontrol write batch_script 38960
job script retrieval failed: Invalid job id specified

Warmest regards,
Jason

-- 
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Need help managing licence

2024-02-16 Thread Davide DelVento via slurm-users
The simple answer is to just add a line such as
Licenses=whatever:20

and then request your users to use the -L option as described at

https://slurm.schedmd.com/licenses.html

This works very well, however it does not do enforcement like Slurm does
with other resources. You will find posts in this list from me trying to
achieve such enforcement with prolog, but I ended up banging too much my
head on the keyboard and so I eventually gave up. User education was easier
for me. Depending on your user community, banging your head on the keyboard
might be easier than educating your users -- if so please share how you
solve the issue

On Fri, Feb 16, 2024 at 7:48 AM Sylvain MARET via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hello everyone !
>
> Recently our users bought a cplex dynamic license and want to use it on
> our slurm cluster.
> I've installed the paid version of cplex within modules so authorized
> user can load it with a simple module load cplex/2111 command but I
> don't know how to manage and ensure slurm doesn't launch a job if 20
> people are already running code with this license.
>
> How do you guys manage paid licenses on your cluster ? Any advice would
> be appreciated !
>
> Regards,
> Sylvain Maret
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Need help managing licence

2024-02-16 Thread Sylvain MARET via slurm-users

Hello everyone !

Recently our users bought a cplex dynamic license and want to use it on 
our slurm cluster.
I've installed the paid version of cplex within modules so authorized 
user can load it with a simple module load cplex/2111 command but I 
don't know how to manage and ensure slurm doesn't launch a job if 20 
people are already running code with this license.


How do you guys manage paid licenses on your cluster ? Any advice would 
be appreciated !


Regards,
Sylvain Maret


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] MPI/PMIx Issues after 23.11 Update

2024-02-16 Thread Oliver Smith via slurm-users
Hi all,

We’re running a small slurm dev cluster on Ubuntu and are facing issues with 
MPI/PMIx after upgrading slurm from 23.02.5 to 23.11.3.

The first job step to use MPI within a job fails roughly 80% of the time but 
following attempts to use MPI within the same job work fine. For the failing 
job step we see this error after hitting the MPI timeout:

slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_reset_if_to: hpc-d-msh-01a02 
[1]: pmixp_coll_ring.c:741: 0x55a03f8d7a90: collective timeout seq=0
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_log: hpc-d-msh-01a02 [1]: 
pmixp_coll.c:286: Dumping collective state
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:759: 0x55a03f8d7a90: COLL_FENCE_RING state seq=0
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:762: my peerid: 1:hpc-d-msh-01a02
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:769: neighbor id: next 0:hpc-d-msh-01a01, prev 
0:hpc-d-msh-01a01
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b08, #0, in-use=0
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b40, #1, in-use=0
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b78, #2, in-use=1
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:787:  seq=0 contribs: loc=1/prev=1/fwd=0
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:791:  neighbor contribs [2]:
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:824:  done contrib: hpc-d-msh-01a01
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:826:  wait contrib: -
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:828:  status=PMIXP_COLL_RING_FINILIZE
slurmstepd: error:  mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: 
pmixp_coll_ring.c:831:  buf (offset/size): 0/33362
[hpc-d-msh-01a01.tds.hpc.barf1.com:47652] pml_ucx.c:178  Error: Failed to 
receive UCX worker address: Not found (-13)
[hpc-d-msh-01a01.tds.hpc.barf1.com:47652] pml_ucx.c:477  Error: Failed to 
resolve UCX endpoint for rank 31
[hpc-d-msh-01a01:47652] *** An error occurred in MPI_Send
[hpc-d-msh-01a01:47652] *** reported by process [683360612,0]
[hpc-d-msh-01a01:47652] *** on communicator MPI_COMM_WORLD
[hpc-d-msh-01a01:47652] *** MPI_ERR_OTHER: known error not in list
[hpc-d-msh-01a01:47652] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[hpc-d-msh-01a01:47652] ***and potentially your MPI job)
slurmstepd: error: *** STEP 1890.0 ON hpc-d-msh-01a01 CANCELLED AT 
2024-02-14T16:14:52 ***


OpenMPI/PMIx versions have not changed and downgrading slurm to 23.02.5 seems 
to resolve the issue. We’d appreciate any pointers anyone might have.

Thanks

Oli


This email comprises confidential information of Mercedes-Benz Grand Prix 
Limited ("MGP") unless it contains an explicit statement to the contrary made 
by an authorised representative of MGP.

Contracts may only be concluded on behalf of MGP by its authorised signatories 
and not solely by email communication. No employee, agent, contractor, 
consultant and/or other representative of MGP is authorised to conclude any 
legally binding agreement on behalf of MGP by email alone without the express 
prior written confirmation of two authorised signatories of MGP.

Mercedes-Benz Grand Prix Limited. Registered in England No. 787446. Registered 
Office at Mercedes-Benz Grand Prix Limited, Operations Centre, Brackley, 
Northants NN13 7BD.

Note: The MGP Legal Department also acts on behalf of Mercedes-Benz Motorsport 
Limited ("MBM") and the above notice applies mutatis mutandis in respect of all 
email communications of MBM. MBM: Mercedes-Benz Motorsport Limited. Registered 
in England No. 13057973. Registered office at Mercedes-Benz Motorsport Limited, 
Lauda Drive, Brackley, Northants NN13 7BD.

Please consider the environment before printing this email.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com