Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-19 Thread mohammed shambakey
Hi I'm not an expert, but is it possible that the currently running jobs is consuming the whole node because it is allocated the whole memory of the node (so the other 2 jobs had to wait until it finishes)? Maybe if you try to restrict the required memory for each job? Regards On Thu, Jan 18,

[slurm-users] how start_time is calculated for slurm mutli-cluster

2023-11-05 Thread mohammed shambakey
Hi I'm having a hard time figuring out the distribution of jobs between 2 clusters in a Slurm multi-cluster environment. The documentation says that each job is submitted to the cluster that provides the earliest start time, and once the task is submitted to a cluster, it can't be re-distributed

[slurm-users] (no subject)

2023-11-05 Thread mohammed shambakey
I'm having a hard time figuring out the distribution of jobs between 2 clusters in a Slurm multi-cluster environment. The documentation says that each job is submitted to the cluster that provides the earliest start time, and once the task is submitted to a cluster, it can't be re-distributed to

[slurm-users] Question about Failed to unpack DBD_NODE_STATE message

2023-10-28 Thread mohammed shambakey
Hi slurmdbd produces the following error in the log file: error: CONN:X Failed to unpack DBD_NODE_STATE message I tried to restart it many times, but it keeps getting back. I restarted the machine, but it's still there. Regards -- Mohammed

Re: [slurm-users] Question about gdb sbatch

2023-10-21 Thread mohammed shambakey
point gdb at the location of the source code, and then follow > any of the gazillion tutorials around about gdb. If you are not familiar > with gdb already, I strongly recommend that you start with some simpler > program before attempting something as big as slurm. > > Have a grea

[slurm-users] problem with slurm configuration and pmix

2023-10-21 Thread mohammed shambakey
Hi I clone the slurm repository from github (version 23.11), and tried to configure it as follows: configure --config-cache --prefix=/usr/slurm_vm_23.11 --sysconfdir=/etc/slurm_vm_23 --with-http-parser=/usr/ --with-yaml=/usr/ --with-jwt=/usr/ --with-mysql_config=/usr/bin --enable-debug

[slurm-users] Question about gdb sbatch

2023-10-20 Thread mohammed shambakey
Hi Is it possible to debug "sbatch" itself when submitting a script? For example, I want to debug the following command: sbatch -Mall some_script.sh I don't want to debug the "some_script.sh". I want to debug the "sbatch" itself when submitting the "some_script.sh". I tried to use "gdb" but I'm

[slurm-users] unpredictable behavior of slurm multi-cluster

2023-10-08 Thread mohammed shambakey
Hi I have 2 slurm clusters: cluster A with 3 compute nodes, each node has 32 CPUs; Cluster B with 4 compute nodes, each node has 8 CPUs. I'm using slurm multicluster on clusters A and B. I tried to run Nas Parallel Benchmarks (sp.A.x) on them. Initially, I tried to benchmark the execution time

Re: [slurm-users] Submitting jobs from machines outside the cluster

2023-08-27 Thread mohammed shambakey
Hi May be slurm rest api be useful (https://slurm.schedmd.com/rest.html)? But I think you will need to generate a token to be able to communicate with the cluster. Regards On Sun, Aug 27, 2023, 8:20 AM Steven Swanson wrote: > Can I submit jobs with a computer/docker container that is not part

[slurm-users] Fwd: question about time statistics for sbatch

2023-07-09 Thread mohammed shambakey
Hi I'm doing a simple benchmark to record the time for issuing a sbatch command. The contents of the script are: #!/bin/bash IFS='= ' read _ local_clusterid <<< $(scontrol show config |grep -i clustername) # Extract local cluster name echo "Local cluster: "$local_clusterid # Check input

[slurm-users] question about time statistics for sbatch

2023-07-09 Thread mohammed shambakey
Hi I'm doing a simple benchmark to record the time for issuing a sbatch command. The contents of the script are: #!/bin/bash IFS='= ' read _ local_clusterid <<< $(scontrol show config |grep -i clustername) # Extract local cluster name echo "Local cluster: "$local_clusterid # Check input

[slurm-users] Problem with Cuda program in multi-cluster

2023-07-04 Thread mohammed shambakey
Hi I work on 3 clusters: A, B, C. Each of Clusters A and C has 3 compute nodes and the head node. One of the 3 compute nodes has an old GPU in each cluster of A and C. All nodes, on all clusters, have Ubuntu 22.04 except for the 2 nodes with GPU (both of them have Ubuntu 18.04 to suit the old GPU

Re: [slurm-users] Slurm Rest API error

2023-07-01 Thread mohammed shambakey
Hi I'm also trying to use slurm rest api. I wonder if the error about slurmdbd has anything to do with it. Does slurmctld connect correctly to slurmdbd? Regards On Wed, Jun 28, 2023, 9:03 PM Brian Andrus wrote: > Vlad, > > Actually, it looks like it is working. You are using v0.39 for the

Re: [slurm-users] federation vs multi-cluster

2023-06-26 Thread mohammed shambakey
e, but separate > clusters is not one of them. > > Brian Andrus > On 6/26/2023 6:11 AM, mohammed shambakey wrote: > > Hi > > Just out of interest, I wonder what the exact difference between slurm > multi-cluster and federation (apart from unique job id, and federation

[slurm-users] federation vs multi-cluster

2023-06-26 Thread mohammed shambakey
Hi Just out of interest, I wonder what the exact difference between slurm multi-cluster and federation (apart from unique job id, and federation limitations) is. Usually, I use the "-Mall" option with multi-cluster. Initially, I thought the federation will send tasks to more than on cluster at

[slurm-users] slurm restapi and multi-cluster

2023-06-08 Thread mohammed shambakey
Hi Is it possible to connect slurm restapi queries to a multi-cluster/federation? I guess each request uses one (and only one) JWT, so it is not possible to do it, right? Regards -- Mohammed

Re: [slurm-users] sview not installed

2023-04-23 Thread mohammed shambakey
the sview). After that, sview is installed in the correct location. Regards On Sun, Apr 23, 2023 at 10:50 AM Ole Holm Nielsen < ole.h.niel...@fysik.dtu.dk> wrote: > On 23-04-2023 02:43, mohammed shambakey wrote: > > I installed slurm 23.11.0-0rc1, and sview is not installed, despite it

[slurm-users] sview not installed

2023-04-22 Thread mohammed shambakey
Hi I installed slurm 23.11.0-0rc1, and sview is not installed, despite it exists in /src/sview/sview. I can execute it from that path but not /bin (because it does not exist there). I tried just copying it to /bin, but it complained about being just a wrapper. I wonder if I'm missing something?

Re: [slurm-users] How to install a newer slurm version on Ubuntu 18.04

2023-04-14 Thread mohammed shambakey
ay be an NFS share or something like that. > I would also strongly recommend not mixing Ubuntu releases in the same > cluster. > > Reed > > On Apr 14, 2023, at 11:12 AM, mohammed shambakey > wrote: > > Hi > > I'm new to slurm, and sorry if this is a repeated email. I ha

[slurm-users] How to install a newer slurm version on Ubuntu 18.04

2023-04-14 Thread mohammed shambakey
Hi I'm new to slurm, and sorry if this is a repeated email. I have a cluster at my work consisting of one head node, and 3 compute nodes. Ubuntu 22.04 is installed on the head node, and 2 compute nodes, whereas the third has Ubuntu 18.04 (it is needed because it hosts an old M10 GPU). I