Re: [slurm-users] Strange error, submission denied

2019-02-14 Thread Marcus Wagner
I have filed a bug: https://bugs.schedmd.com/show_bug.cgi?id=6522 Lets see, what ScheMD has to tell us ;) Best Marcus On 2/15/19 6:25 AM, Marcus Wagner wrote: NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*    TRES=cpu=48,mem=182400M,node=1,billing=48 -- Marcus Wagner, Di

Re: [slurm-users] Strange error, submission denied

2019-02-14 Thread Marcus Wagner
Hi Chris, that can't be right, or there is some bug elsewhere: We have configured CR_ONE_TASK_PER_CORE, so two tasks won't get a core and its hyperthread. According to your  theory, I configured 48 threads. But then using just --ntasks=48 would give me two nodes, right? But Slurm schedules t

Re: [slurm-users] Strange error, submission denied

2019-02-14 Thread Christopher Samuel
On 2/14/19 12:22 AM, Marcus Wagner wrote: CPUs=96 Boards=1 SocketsPerBoard=4 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=191905 That's different to what you put in your config in the original email though. There you had: CPUs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=2 This config

Re: [slurm-users] Analyzing a stuck job

2019-02-14 Thread Christopher Samuel
On 2/14/19 8:02 AM, Mahmood Naderan wrote: One job is in RH state which means JobHoldMaxRequeue. The output file, specified by --output shows nothing suspicious. Is there any way to analyze the stuck job? This happens when a job fails to start for MAX_BATCH_REQUEUE times (which is 5 at the mo

[slurm-users] Analyzing a stuck job

2019-02-14 Thread Mahmood Naderan
Hi, One job is in RH state which means JobHoldMaxRequeue. The output file, specified by --output shows nothing suspicious. Is there any way to analyze the stuck job? Regards, Mahmood

Re: [slurm-users] Strange error, submission denied

2019-02-14 Thread Marcus Wagner
Hi Andreas, might be that this is one of the bugs in Slurm 18. I think, I will open a bug report and see what they say. Thank you very much, nonetheless. Best Marcus On 2/14/19 2:36 PM, Andreas Henkel wrote: Hi Marcus, for us slurmd -C as well as numactl -H looked fine, too. But we're

Re: [slurm-users] Strange error, submission denied

2019-02-14 Thread Andreas Henkel
Hi Marcus, for us slurmd -C as well as numactl -H looked fine, too. But we're using task/cgroup only and every job starting on a skylake node gave us |error("task/cgroup: task[%u] infinite loop broken while trying " "to provision compute elements using %s (bitmap:%s)", | from src/plugins/task/cg

Re: [slurm-users] Strange error, submission denied

2019-02-14 Thread Marcus Wagner
Hi Andreas, as slurmd -C shows, it detects 4 numa-nodes taking these as sockets. This was also the way, we configured slurm. numactl -H clearly shows the four domains and which belongs to which socket: node distances: node   0   1   2   3   0:  10  11  21  21   1:  11  10  21  21   2:  21  2

Re: [slurm-users] Strange error, submission denied

2019-02-14 Thread Henkel, Andreas
Hi Marcus, We have skylake too and it didn’t work for us. We used cgroups only and process binding went completely havoc with subnuma enabled. While searching for solutions I found that hwloc does support subnuma only with version > 2 (when looking for skylake in hwloc you will get hits in versi

Re: [slurm-users] Strange error, submission denied

2019-02-14 Thread Marcus Wagner
Hi Andreas, On 2/14/19 8:56 AM, Henkel, Andreas wrote: Hi Marcus, More ideas: CPUs doesn’t always count as core but may take the meaning of one thread, hence makes different Maybe the behavior of CR_ONE_TASK is still not solid nor properly documente and ntasks and ntasks-per-node are honor