Hi all,
I found out a way to avoid oversubscribing. I had to comment this configuration:
PreemptMode=Suspend,Gang
PreemptType=preempt/partition_prio
In my actual configuration, all the partitions are at the same priority. At
times, I increase the priority of a partition and jobs in other partit
Great. Thank you very much. It passed the problematic point.
On Tue, Apr 17, 2018, 19:24 Ole Holm Nielsen
wrote:
> On 04/17/2018 04:38 PM, Mahmood Naderan wrote:
> > That parameter is used in slurm.conf. Should I modify that only on the
> > head node? Or all nodes? Then should I restart slurm
I had a job running last night, with a 30 minute timeout. (It's a
well-tested script that runs multiple times daily.)
On one run, in a middle of a set of runs for this job, I got this on the
console after about 8 minutes:
srun: forcing job termination
srun: Job step aborted: Waiting up to 32
FYI, I just discovered that doc/man/man1/Makefile does not respect
configure's --htmldir flag:
[root@centosdev slurm]# fgrep '$ ./configure' work/slurm-17.11.5/config.log
$ ./configure --bindir=/usr/pkg/bin
--htmldir=/usr/pkg/share/doc/slurm-wlm/html --with-munge=/usr/pkg
--with-hwloc=/usr
Hi,
I have given 4 users the operator role and they are all part of the coordinator
accounts. However, when I su to the users in question, they get a permission
denied error when trying to cancel a job.
What am I missing?
Ronan
On 04/17/2018 04:38 PM, Mahmood Naderan wrote:
That parameter is used in slurm.conf. Should I modify that only on the
head node? Or all nodes? Then should I restart slurm processes?
Yes, definitely! I collected the detailed instructions here:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configurat
That parameter is used in slurm.conf. Should I modify that only on the
head node? Or all nodes? Then should I restart slurm processes?
Regards,
Mahmood
On Tue, Apr 17, 2018 at 4:18 PM, Chris Samuel wrote:
> On Tuesday, 17 April 2018 7:23:40 PM AEST Mahmood Naderan wrote:
>
>> [hamid@rocks7 ca
Hello,
I have a recurring error in the log of slurmctld:
[2018-04-10T19:32:40.145] error: _unpack_ret_list: message type 24949,
record 0 of 56214
[2018-04-10T19:32:40.145] error: invalid type trying to be freed 24949
[2018-04-10T19:32:40.145] error: unpacking header
[2018-04-10T19:32:40.145] erro
Hi Chris,
> You might want to double check the config is acting as expected with:
>
> scontrol show part | fgrep OverSubscribe
PriorityJobFactor=10 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=NO
PriorityJobFactor=10 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=NO
Priorit
Hi Gareth,
Your assessment is also what I would have thought MaxRSS should be the
maximum of the sum of all RSS in a sample, swap and shared memory does
complicate things but I think most people expect jobs to only be killed if
their RSS exceeds their memory request.
That being said as far as I un
Hmmm... the man page says of "reduce_completing_frag,"
"By default if a job is found completing then no jobs are scheduled. If
this parameter is used the node in a completing job are taken out of
consideration."
This feels like it's missing a word or two. The first sentence says
that, by def
I think the situation is likely to be a little different. Let’s consider a
fortran program that statically or dynamically defines large arrays. This
defines a virtual memory size – like declaring that this is the maximum amount
of memory you might use if you fill the arrays. That amount of real
Hi Loris,
Thanks for your explanation!
I would have interpreted as max(sum()).
Is there a way to get max(sum()) or at least sum form of sum()? The
assumption that all processes are peaking at the same value is not a valid
one unless all threads have essentially the same workload...
Thanks again!
E
On Tuesday, 17 April 2018 7:41:18 PM AEST De Giorgi Jean-Claude wrote:
> Thanks a lot for your help.
> Yes, I misunderstood the "format" part.
>
> Thank you for your example.
My pleasure, glad it was useful! We have a newer version of Slurm which has
different (& more) format options to you so
On Tuesday, 17 April 2018 7:23:40 PM AEST Mahmood Naderan wrote:
> [hamid@rocks7 case1_source2]$ scontrol show config | fgrep VSizeFactor
> VSizeFactor = 110 percent
Great, I think that's the cause of the limit you are seeing..
VSizeFactor
Memory specifications
Hi Eli,
"E.S. Rosenberg" writes:
> Hi fellow slurm users,
> We have been struggling for a while with understanding how MaxRSS is reported.
>
> This because jobs often die with MaxRSS not even approaching 10% of the
> requested memory sometimes.
>
> I just found the following document:
> https:/
Hello,
Am 16.04.2018 um 18:50 schrieb Michael Di Domenico:
> On Mon, Apr 16, 2018 at 6:35 AM, wrote:
> perhaps i missed something in the email, but it sounds like you have
> 56 cores, you have two running jobs that consume 52 cores, leaving you
> four free.
No. From the original mail:
<--- %
Hi fellow slurm users,
We have been struggling for a while with understanding how MaxRSS is
reported.
This because jobs often die with MaxRSS not even approaching 10% of the
requested memory sometimes.
I just found the following document:
https://research.csc.fi/-/a
It says:
"*maxrss *= maximum
Loris, Ole, thankyou so much. That is the Python script I was thinking of.
On 17 April 2018 at 11:15, Ole Holm Nielsen
wrote:
> On 04/17/2018 10:56 AM, John Hearns wrote:
>
>> Please can some kind soul remind me what the Python code for mangling
>> Slurm and PBS machinefiles is called please?
Hi Chris,
Thanks a lot for your help.
Yes, I misunderstood the "format" part.
Thank you for your example.
Regards,
Jean-Claude
On 17.04.18, 05:43, "slurm-users on behalf of Chris Samuel"
wrote:
On Tuesday, 17 April 2018 12:52:04 AM AEST De Giorgi Jean-Claude wrote:
> Accordin
Hello Daniel,
Thank you for your information.
That’s very helpful.
Regards,
Jean-Claude
From: slurm-users on behalf of Daniel
Grimwood
Reply-To: Slurm User Community List
Date: Tuesday, 17 April 2018 at 05:10
To: 'Slurm User Community List'
Subject: Re: [slurm-users] SLURM's reservations
See
[hamid@rocks7 case1_source2]$ scontrol show config | fgrep VSizeFactor
VSizeFactor = 110 percent
Regards,
Mahmood
On Tue, Apr 17, 2018 at 12:51 PM, Chris Samuel wrote:
> On Tuesday, 17 April 2018 5:08:09 PM AEST Mahmood Naderan wrote:
>
>> So, UsePAM has not been set. So, slu
On 04/17/2018 10:56 AM, John Hearns wrote:
Please can some kind soul remind me what the Python code for mangling
Slurm and PBS machinefiles is called please? We discussed it here about
a year ago, in the context of running Ansys.
I have a Cunning Plan (TM) to recode it in Julia, for no real re
Loris Bennett writes:
> Hi John,
>
> John Hearns writes:
>
>> Please can some kind soul remind me what the Python code for mangling
>> Slurm and PBS machinefiles is called please? We discussed it here
>> about a year ago, in the context of running Ansys.
>>
>> I have a Cunning Plan (TM) to recod
Hi John,
John Hearns writes:
> Please can some kind soul remind me what the Python code for mangling
> Slurm and PBS machinefiles is called please? We discussed it here
> about a year ago, in the context of running Ansys.
>
> I have a Cunning Plan (TM) to recode it in Julia, for no real reason
>
On 04/17/2018 09:14 AM, David Rodríguez wrote:
Thanks Chris!
Thanks Ole!
In fact, I followed your wiki. But I had many doubts in order to use
version 17.11 or 17.02 because I don know the differences between them.
Finally, I installed the last one.
Always install the latest and greatest ver
Please can some kind soul remind me what the Python code for mangling Slurm
and PBS machinefiles is called please? We discussed it here about a year
ago, in the context of running Ansys.
I have a Cunning Plan (TM) to recode it in Julia, for no real reason other
than curiosity.
On Tuesday, 17 April 2018 5:26:26 AM AEST Stéphane Larose wrote:
> So some jobs are now sharing the same cores but I don’t understand why since
> OverSubscribe is set to no.
You might want to double check the config is acting as expected with:
scontrol show part | fgrep OverSubscribe
Also what
On Monday, 16 April 2018 11:00:59 PM AEST Sysadmin CAOS wrote:
> However, this script is not working as I desire because logged value for
> job_desc.max_cpus return a value too big. I get "4294967294" for
> job_desc.max_cpus and "1" for job_desc.max_nodes...
That large number means the limit has
On Tuesday, 17 April 2018 5:08:09 PM AEST Mahmood Naderan wrote:
> So, UsePAM has not been set. So, slurm shouldn't limit anything. Is
> that correct? however, I see that slurm limits the virtual memory size
What does this say?
scontrol show config | fgrep VSizeFactor
--
Chris Samuel : htt
On Tuesday, 17 April 2018 5:14:39 PM AEST David Rodríguez wrote:
> There are two steps that change a little from actual version. For example,
> in slurm-17.11.5-1 does not appear "slurm-plugins-$VER*rpm"
Yes, that's a change documented in the RELEASE_NOTES for Slurm 17.11.x.
NOTE: The slurm.spec
Thanks Chris!
Thanks Ole!
In fact, I followed your wiki. But I had many doubts in order to use
version 17.11 or 17.02 because I don know the differences between them.
Finally, I installed the last one.
There are two steps that change a little from actual version. For example,
in slurm-17.11.5-1
Hi Bill,
Sorry for the late reply. As I greped for pam_limits.so, I see
[root@rocks7 ~]# grep -r pam_limits.so /etc/pam.d/
/etc/pam.d/sudo:sessionrequired pam_limits.so
/etc/pam.d/runuser:session requiredpam_limits.so
/etc/pam.d/sudo-i:sessionrequired pam_limit
Hello all,
I am submitting a job to a SLURM scheduler, which contains an array of
small jobs.
For example, here's a script that simply prints out the date and hostname
of the compute node from within a heredoc:
---
#!/bin/bash
...(variables)...
sbatch --parsable --partition=${job
34 matches
Mail list logo