Re: [slurm-users] About x11 support

2018-11-26 Thread Goetz, Patrick G
I'm a little confused about how this would work. For example, where does slurmctld run? And if on each submit host, why aren't the control daemons stepping all over each other? On 11/22/18 6:38 AM, Stu Midgley wrote: > indeed. > > All our workstations are submit hosts and in the queue, so

Re: [slurm-users] Slurm / OpenHPC socket timeout errors

2018-11-26 Thread Michael Robbert
I believe that fragmentation only happens on routers when passing traffic from one subnet to another. Since this traffic was all on a single subnet there was no router involved to fragment the packets. Mike On 11/26/18 1:49 PM, Kenneth Roberts wrote: D’oh! The compute nodes had different MTU

Re: [slurm-users] Slurm / OpenHPC socket timeout errors

2018-11-26 Thread Kenneth Roberts
D'oh! The compute nodes had different MTU on the network interfaces than the master. Once all set to 1500, it works! So ... any ideas why that was a problem? Maybe the interfaces had no fragmentation set and there were dropped packets? Thanks for listening. Ken From: slurm-users

Re: [slurm-users] How to check the percent cpu of a job?

2018-11-26 Thread Peter Kjellström
On Thu, 22 Nov 2018 01:51:59 +0800 (GMT+08:00) 宋亚磊 wrote: > Hello everyone, > > How to check the percent cpu of a job in slurm? I tried sacct, sstat, > squeue, but I can't find that how to check. Can someone help me? I've written a small tool, jobload, that takes a jobid and outputs current

Re: [slurm-users] About x11 support

2018-11-26 Thread Brendan Moloney
I posted about the local display issue a while back ("Built in X11 forwarding in 17.11 won't work on local displays"). I agree that having some local managed workstations that can also act as submit nodes is not so uncommon. However we also ran into this on our official "login nodes" because we

Re: [slurm-users] Slurm / OpenHPC socket timeout errors

2018-11-26 Thread Kenneth Roberts
I wasn't looking close enough at the times in the log file. c2: [2018-11-26T10:09:40.963] debug3: in the service_connection c2: [2018-11-26T10:10:00.983] debug: slurm_recv_timeout at 0 of 9589, timeout c2: [2018-11-26T10:10:00.983] error: slurm_receive_msg_and_forward: Socket timed out on

Re: [slurm-users] Slurm / OpenHPC socket timeout errors

2018-11-26 Thread Kenneth Roberts
Here is the debug log on a node (c2) when the job fails c2: [2018-11-26T07:35:56.261] debug3: in the service_connection c2: [2018-11-26T07:36:16.281] debug: slurm_recv_timeout at 0 of 9680, timeout c2: [2018-11-26T07:36:16.282] error: slurm_receive_msg_and_forward: Socket timed out on

[slurm-users] Socket exclusive

2018-11-26 Thread Daniel Barker
Hi, All, I have a heterogenous cluster in which some users need to submit socket exclusive jobs. All of the nodes have enough cores on a single socket for the jobs to run. Is there a way to submit a job which is socket exclusive without specifying the core count? Something like this, but with

[slurm-users] Configuring partition limit MaxCPUsPerNode

2018-11-26 Thread Michael Gutteridge
I'm either misunderstanding how to configure the limit "MaxCPUsPerNode" or how it behaves. My desired end-state is that if a user submits a job to a partition that requests more resources (CPUs) than available on any node in that partition, the job will be immediately rejected, rather than

Re: [slurm-users] how to find out why a job won't run?

2018-11-26 Thread R. Paul Wiegand
Steve, This doesn't really address your question, and I am guessing you are aware of this; however, since you did not mention it: "scontrol show job " will give you a lot of detail about a job (a lot more than squeue). It's "Reason" is the same as sinfo and squeue, though. So no help there.

Re: [slurm-users] About x11 support

2018-11-26 Thread Marcus Wagner
Hi Chris, I really think, it is not that uncommon. But in another way like Tina explained. We HAVE special loginnodes to the cluster, no institute can submit from their workstations, they have to login to our loginnodes. BUT, they can do it not only by logging in per ssh, but also per FastX,

Re: [slurm-users] SlurmctlDebug=

2018-11-26 Thread Bjørn-Helge Mevik
The numerical values were used first, then they added the symbolic values. Perhaps you could just look in the slurmctld.log output to see what is the maximal log level reported there? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc

Re: [slurm-users] how to find out why a job won't run?

2018-11-26 Thread Daan van Rossum
I'm also interested in this. Another example: "Reason=(ReqNodeNotAvail)" is all that a user sees in a situation when his/her job's walltime runs into a system maintenance reservation. * on Friday, 2018-11-23 09:55 -0500, Steven Dick wrote: > I'm looking for a tool that will tell me why a