[slurm-users] Priority wait

2017-11-14 Thread Zohar Roe MLM
Hello, Trying again with the slurm.conf This time. I have a cluster name: Autobot In this cluster I have servers: Optimus[1-10] and Megatron[1-10]. I sent 3000 jobs with feature Optimus and part are running while part are pendind. Which is ok. But I have sent 1000 jobs to Megatron and they are a

[slurm-users] Jobs take more time then what they need

2017-11-14 Thread Zohar Roe MLM
Hello, Having another strange problem with slurm 17.02.6. I have a cluster with 250 cpus. I am sending a testing job that only sleep for 60 seconds. A lot of the jobs are taking more than 7 or 8 minute until they finish running (I can see them in RUNNING mode for more the 7 minutes). Is there a re

[slurm-users] Jobs in pending state

2018-04-29 Thread Zohar Roe MLM
Hello. I am having 2 cluster in my slurm.conf: CLUS_WORK1 server1 server2 server3 CLUS_WORK2 pc1 pc2 pc3 When I'm sending 10,000 jobs to CLUS_WORK1 they are good and start running while a few are in pending state (which is ok). But if I send new jobs to CLUS_WORK2 which is idle, I see that the j

Re: [slurm-users] Jobs in pending state

2018-04-30 Thread Zohar Roe MLM
econd partition is getting primarily scheduled by the backfill scheduler.  I would try the partition_job_depth option as otherwise the main loop only looks at priority order and not by partition. -Paul Edmon- On 4/29/2018 5:32 AM, Zohar Roe MLM wrote: > Hello. > I am having 2 cluster in my

[slurm-users] Question about sacct

2018-05-15 Thread Zohar Roe MLM
Hello, Trying to understand some problems with "sacct " command. I sent 10 jobs to slurm and I can see them all running in with squeue command. Now, when I am running "sacct -j 398000" to check one of the jobs, I see two problems: 1) Its take the sacct command about 3 minutes to return r

Re: [slurm-users] Question about sacct

2018-05-16 Thread Zohar Roe MLM
] Question about sacct Is accounting setup to use a slurmdbd/database backend or file (AccountingStorageType)? 3 minutes could make sense if data are being stored in a (large) flat file. From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Zohar Roe MLM Sent: 16 May 2018 07

[slurm-users] Can't find an address

2018-10-24 Thread Zohar Roe MLM
Hello, I have a node that from some reason change state to "Down" evert few minutes. When I change it with scontrol to "resume" its ok until Down again. In the slurm server log I can see error: "agent/is_node_resp: node:myName1 RPC:REQUEST_PING : Can't find an address, check slurm.conf" Now, The

Re: [slurm-users] Can't find an address

2018-10-25 Thread Zohar Roe MLM
the server can't find it (And it happen every two minute, always). Thanks for your ideas, Roy. From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Lachlan Musicman Sent: Thursday, October 25, 2018 1:59 AM To: Slurm User Community List Subject: Re: [slurm-use

Re: [slurm-users] Can't find an address

2018-10-27 Thread Zohar Roe MLM
hat the "hostname" command returns the same name that Slurm > expects on your compute nodes. > > ____________ > From: Zohar Roe Mlm > Sent: Thursday, October 25, 2018 3:02AM > To: 'Slurm User Community List' > Cc: > Subject: Re: [slu

[slurm-users] Give priority to specific server

2019-07-14 Thread Zohar Roe MLM
Hello, I am having two servers in my slurm.conf: NodeName=serv1 NodeAddr=131.100.100.1 CPUs=4 RealMemory=256000 Features=test,workserv NodeName=serv2 NodeAddr=131.100.100.2 CPUs=4 RealMemory=256000 Features=test,workserv When I am sending a job with features "test", The server "serv1" always ge

Re: [slurm-users] Give priority to specific server

2019-07-14 Thread Zohar Roe MLM
d HPC Team Lead | Research Platform Services Research Computing | CoEPP | School of Physics University of Melbourne On Sun, 14 Jul 2019 at 18:41, Zohar Roe MLM mailto:rzoh...@iai.co.il>> wrote: Hello, I am having two servers in my slurm.conf: NodeName=serv1 NodeAddr=131.100.100.1 CPUs=4 R

[slurm-users] Servers in pending state

2020-03-11 Thread Zohar Roe MLM
Hello, I have a queue with 6 servers. When 4 of the servers are with heavy load, If I send new jobs to the other 2 servers which are free and under different partition and features, The jobs are still in pending mode (can take them 20 minutes to start running) If I change their priority with "s