I have filed a bug:
https://bugs.schedmd.com/show_bug.cgi?id=6522
Lets see, what ScheMD has to tell us ;)
Best
Marcus
On 2/15/19 6:25 AM, Marcus Wagner wrote:
NumNodes=1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,mem=182400M,node=1,billing=48
--
Marcus Wagner, Di
Hi Chris,
that can't be right, or there is some bug elsewhere:
We have configured CR_ONE_TASK_PER_CORE, so two tasks won't get a core
and its hyperthread.
According to your theory, I configured 48 threads. But then using just
--ntasks=48 would give me two nodes, right?
But Slurm schedules t
On 2/14/19 12:22 AM, Marcus Wagner wrote:
CPUs=96 Boards=1 SocketsPerBoard=4 CoresPerSocket=12 ThreadsPerCore=2
RealMemory=191905
That's different to what you put in your config in the original email
though. There you had:
CPUs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=2
This config
On 2/14/19 8:02 AM, Mahmood Naderan wrote:
One job is in RH state which means JobHoldMaxRequeue.
The output file, specified by --output shows nothing suspicious.
Is there any way to analyze the stuck job?
This happens when a job fails to start for MAX_BATCH_REQUEUE times
(which is 5 at the mo
Hi,
One job is in RH state which means JobHoldMaxRequeue.
The output file, specified by --output shows nothing suspicious.
Is there any way to analyze the stuck job?
Regards,
Mahmood
Hi Andreas,
might be that this is one of the bugs in Slurm 18.
I think, I will open a bug report and see what they say.
Thank you very much, nonetheless.
Best
Marcus
On 2/14/19 2:36 PM, Andreas Henkel wrote:
Hi Marcus,
for us slurmd -C as well as numactl -H looked fine, too. But we're
Hi Marcus,
for us slurmd -C as well as numactl -H looked fine, too. But we're using
task/cgroup only and every job starting on a skylake node gave us
|error("task/cgroup: task[%u] infinite loop broken while trying " "to
provision compute elements using %s (bitmap:%s)", |
from src/plugins/task/cg
Hi Andreas,
as slurmd -C shows, it detects 4 numa-nodes taking these as sockets.
This was also the way, we configured slurm.
numactl -H clearly shows the four domains and which belongs to which socket:
node distances:
node 0 1 2 3
0: 10 11 21 21
1: 11 10 21 21
2: 21 2
Hi Marcus,
We have skylake too and it didn’t work for us. We used cgroups only and process
binding went completely havoc with subnuma enabled.
While searching for solutions I found that hwloc does support subnuma only with
version > 2 (when looking for skylake in hwloc you will get hits in versi
Hi Andreas,
On 2/14/19 8:56 AM, Henkel, Andreas wrote:
Hi Marcus,
More ideas:
CPUs doesn’t always count as core but may take the meaning of one thread, hence
makes different
Maybe the behavior of CR_ONE_TASK is still not solid nor properly documente
and ntasks and ntasks-per-node are honor
10 matches
Mail list logo