Dear Yogendra,
It seems you are missing the PriorityMaxAge parameter.
Set it and the Age parameter should start working.
Regards,
Carles Fenoy
On Mon, Nov 11, 2013 at 11:04 AM, wrote:
> Hi Team
>
>
>
>
>
> We have enabled mutlifactor priority (PrirityWeightAge & PriorityJobSize)
> but Pririt
Dear Ulf,
You can try to define the parameter DefMemPerCPU=400 in your slurm.conf
file. This will set a default memory limit for each job to 400MB per core
requested.
This should be enough to fulfil your requirement
Regards,
Carles Fenoy
Barcelona Supercomputing Center
On Thu, Dec 12, 2013 at 1
Hi Lucas,
It seems that your slurm is not compiled with munge support. Have you
compiled it yourself? Check the output of the configure command to check
that munge is correctly configured or add the path to munge with the
--with-munge parameter.
Regards,
Carles Fenoy
Barcelona Supercomputing Cent
Hi Lucas,
It seems that your nodes can not reach your slurm controller. Do you have
any firewall configured in the compute nodes? Try with a telnet if you can
reach the controller port from a compute node.
Regards,
Carles Fenoy
Barcelona Supercomputing Center
On Fri, Apr 25, 2014 at 11:47 PM, L
Hi Michal,
What do you mean with "It's possible using LSF without MPI? as far as I
know a process can not use resources in different compute nodes. Only
distributed programming models would allow the usage of different compute
nodes but at least there must be a task per node, unless you use somethi
Hi,
We had some users complaining for the same behaviour but for resident
memory. What we did was modify the accounting plugin to consider the
proportional set size (PSS) instead of RSS. This way the shared memory is
accounted only one, but proportionally for each process, so if 2 processed
share
Hi Chris,
Yo can create a cgroup restricting only memory to the front end memory
amount - the amount needed by the system, and attaching the slurm processes
to that cgroup. This way the if the oom is invoked in the cgroup it will
only kill tasks belonging to the cgroup.
Regards,
Carles Fenoy
El 22
There is no way, but if the reason is AssociationJobLimit it has nothing to
do with qos. Check the user association limits.
Carles Fenoy
On Tue, Sep 23, 2014 at 11:16 PM, Marcin Stolarek wrote:
>
>
> 2014-09-23 20:23 GMT+02:00 Eva Hocks :
>
>>
>>
>>
>> How can I get a job started after it was p
Hi Uwe,
Looks like you are missing an "s" at the end of the name. We have:
SchedulerParameters=max_switch_wait=864000,...
Regards,
Carles Fenoy
Barcelona Supercomputing Center
On Mon, Oct 20, 2014 at 10:27 AM, Uwe Sauter
wrote:
>
> Hi all,
>
> I'm trying to configure the scheduling parameter
Do you have mpi installed on all the cluster nodes?
On Wed, May 6, 2015 at 3:34 PM, Uwe Sauter wrote:
>
> Check the file permissions for libmpichcxx.so.1.2 as well as the
> permissions on the parent directories. Might be that you are not
> allowed to access the folder structure as the user you'r
des is
> not found. I nodes openmpi-1.5.4-2 is installed.
>
>
>
> *De:* Carlos Fenoy [mailto:mini...@gmail.com]
> *Enviado el:* miércoles, 06 de mayo de 2015 10:11
> *Para:* slurm-dev
> *Asunto:* [slurm-dev] Re: Error while loading shared libraries
>
>
>
> Do yo
how it's made?
>
>
> -Original Message-
> From: Carlos Fenoy
> To: "slurm-dev"
> Date: Wed, 06 May 2015 07:27:31 -0700
> Subject: [slurm-dev] Re: Error while loading shared libraries
>
> Then you can not execute a binary compiled with MPICH in the comp
Hi all,
I'm trying to account the usage of the filesystem per job. How accurate is
the data reported by sacct? (fields MaxDiskRead and MaxDiskWrite) Will this
data be accurate also for parallel filesystems as Lustre or GPFS?
Regards,
Carles
--
--
Carles Fenoy
As already commented on several threads, you have to specify DefMemPerCPU
or DefMemPerNode. The default behavior of slurm is to allocate all the
memory.
On Wed, Jun 17, 2015 at 10:19 AM, Saerda Halifu
wrote:
>
> Hi,
>
> Thanks for your answer, SelectType is set to select/cons_res.
>
> scontrol
Hi Jordan,
Check
http://stackoverflow.com/questions/25170763/how-to-change-how-frequently-slurm-updates-the-output-file-stdout/25189364#25189364
this question and answer on stackoverflow.
Regards,
Carlos
On Tue, Jul 14, 2015 at 3:15 PM, Aaron Knister
wrote:
>
> Hi Jordan,
>
> The answer is, we
works
Regards,
Carlos Fenoy
On Wed, Nov 4, 2015 at 2:40 PM, charlie hemlock
wrote:
> Hi Trevor/Triveni,
> Thank you for your responses.
>
> I have been able to use the sbatch --array to run all the jobs, *but only
> a single job/task executes per node.*
> In my example cluster
mutually exclusive.
>
> *NOTE: Enforcement of memory limits currently requires enabling of
> accounting, which samples memory use on a periodic basis (data need not be
> stored, just collected).*
> Perhaps CR_Core would be a better option
> and/or "enabling of accounting" requ
Trevor,
If using cons_res there is no need to specify Shared=YES unless you want to
share the same resources among different jobs.
form slurm.conf man page:
YES Makes all resources in the partition available for sharing upon
request by the job. Resources will only be over-subscribed when
t; # Scheduling
>
> #
>
> SelectType=select/cons_res
>
> SelectTypeParameters=*CR_CPU*
>
>
>
> #
>
> # COMPUTE NODES
>
> #
>
> NodeName=clunode[#] Procs = 12 ...
>
> On Wed, Nov 4, 2015 at 10:21 AM, Carlos Fenoy wrote:
>
>> Trevor,
>>
There seems to be a wrong character in the double dashes "--".
On Tue, 24 Nov 2015, 22:04 Zentz, Scott C. wrote:
> Hello Everyone!
>
>
>
> I have a user who is trying to exclude some hosts from their job
> submission and was using –exclude to accomplish this. He claims that he was
> able to do t
Barbara, I don't think that is the issue here. The killer is the OOM not
Slurm, so Slurm is not accounting incorrectly the amount of memory, but it
seems that the cached memory is also accounted in the cgroup and it is what
is causing the OOM to kill gzip.
Regards,
Carlos
On Fri, Dec 18, 2015 at
You should not start the slurmctld on all the nodes, only in the head node
of the cluster, and in the compute nodes start the slurmd with service
slurm start
On Mon, Dec 21, 2015 at 6:27 PM, Fany Pagés Díaz wrote:
> I had to turn off my cluster by electricity problems, and now slurm not
> workin
gt; to post the content of your slurm.conf file.
>
>
>
> Phil Eckert
>
> LLNL
>
>
>
> *From: *Fany Pagés Díaz
> *Reply-To: *slurm-dev
> *Date: *Monday, December 21, 2015 at 12:39 PM
> *To: *slurm-dev
> *Subject: *[slurm-dev] Re: slum in the nodes not w
Check the date and time in all the nodes
On Mon, 4 Jan 2016, 20:12 Fany Pagés Díaz wrote:
> When I try to start slurm I have the next error in my logs of nodes
>
>
>
> error: slurm_receive_msg: Zero Bytes were transmitted or received
>
> error: slurm_receive_msg: Zero Bytes were transmitted or r
Hi,
Have you tried the ElasticSearch job completion plugin? With Kibana you can
have nice charts for reporting and easily query the database to retrieve
the information required. It does not support importing old jobs, but as
soon as you set it up, you can start getting nice dashboards with kibana
Lehto
wrote:
>
> - Original Message -
> > From: "Christopher Samuel"
> > To: "slurm-dev"
> > Sent: Wednesday, 6 January, 2016 01:21:48
> > Subject: [slurm-dev] Re: Accounting, SlurmDBD and XDMoD
>
> > On 06/01/16 09:51, Carlos
Hi Loris,
Can you check when did the job actually started or ended? It may be that
the job spans 2 days, and that is the reason the the sreport is reporting
less time.
Regards,
Carlos
On Fri, Jan 22, 2016 at 9:31 AM, Loris Bennett
wrote:
>
> Hi,
>
> Using version 15.08.4 I am looking at the va
Hi Glen,
I think your issue is with the MAINT flag in the reservation. Try removing
that flag and try again.
Regards,
Carlos
On Fri, Apr 15, 2016 at 4:09 PM, Glen MacLachlan wrote:
> Dear all,
>
> Wrapping up a maintenance period and I want to run some test jobs before I
> release the reservat
Hi,
You can see the script running an scontrol show job JOBID - dd
On Fri, 13 May 2016, 05:59 Husen R, wrote:
> Dear all,
>
> Does slurm provide feature to get command that being executed/will be
> executed by running/pending jobs ?
>
> The "squeue -O command" just gives the full path to sbatch
Hi,
You can not have multiple default partitions. Only the last one is being
set as Default. If you check with scontrol show job you will see that there
is only one partition "requested" by your job. You can submit to multiple
partitions or use a job submit plugin to assign multiple partitions if
On Tue, May 17, 2016 at 10:02 AM, Loris Bennett
wrote:
>
> Benjamin Redling
> writes:
>
> > On 2016-05-13 05:58, Husen R wrote:
> >> Does slurm provide feature to get command that being executed/will be
> >> executed by running/pending jobs ?
> >
> > scontrol show --detail job
> > or
> > scontr
If you do not specify the number of nodes does it work as expected?
On Tue, 12 Jul 2016, 09:25 Loris Bennett,
wrote:
>
> Husen R writes:
>
> > Re: [slurm-dev] Re: number of processes in slurm job
> >
> > Hi,
> >
> > Thanks for your reply !
> >
> > I use this sbatch script
> >
> > #!/bin/bash
>
Hi,
I've seen the same with 15.08.4 version. When running array jobs with just
the "hostname" command, makes some jobs to stay in the running state for
several minutes.
Regard,
Carlos
On Wed, Jul 13, 2016 at 3:38 PM, Marcin Stolarek
wrote:
> Hi guys,
>
> I have a cluster with a few nodes. Use
Is the slurmd process running in the bootcpuset?
On Wed, Jul 20, 2016 at 9:29 AM, Christopher Samuel
wrote:
>
> On 20/07/16 17:13, A. Podstawka wrote:
>
> > no direct error message, but the jobs get started in the bootcpuset
>
> Do the processes show up in the tasks file for that cgroup?
>
> Is
Try starting the slurmd in another cgroup, maybe one dedicated to slurm
On Wed, Jul 20, 2016 at 11:26 AM, A. Podstawka
wrote:
>
> Hi,
>
>
>
> Am 20.07.2016 um 09:28 schrieb Christopher Samuel:
>
>> On 20/07/16 17:13, A. Podstawka wrote:
>>
>> no direct error message, but the jobs get started in
Are all the jobs running in different nodes? What version of slurm are you
using? If you try to submit a job requesting 9 nodes do you get an error?
On Tue, Jul 26, 2016 at 12:24 PM, Luque, N.B. wrote:
> Thanks a lot Kent Engström for your help.
> I guess that what I wanted was GrpNodes, I set t
Hi David,
Check the ReturnToService parameter of the slurm.conf file.
ReturnToService
Controls when a DOWN node will be returned to service. The
default value is 0. Supported values include
0 A node will remain in the DOWN state until a system
administrator explic
Have you added the cluster to the database?
something like: "sacctmgr add cluster CLUSTER_NAME"
On Wed, Aug 24, 2016 at 11:04 AM, Bancal Samuel
wrote:
>
> Hi,
>
> Thanks for your quick answer.
>
> In fact NodeName=DEFAULT is not the server's hostname, but matches all
> subsequent nodes defined
Marco,
It seems that your application is reading the slurm.conf, if it is an ompss
application maybe you need to recompile the application or the runtime with
your modifications to slurm in order for it to understand the new parameter.
Regards,
Carlos
On Fri, Sep 16, 2016 at 2:05 PM, Marco D'Ami
Hi All,
I'm working on a plugin that stores performance information of every task
of every job in influxdb. This can be visualized easily with Grafana and
provides information of cpu used and memory used as well as read and writes
from filesystems. This plugin is using the profile capability of sl
ults in real time as the job is running or only once it is
> finished?
> Thank you,
> Igor
>
>
>> On Mon, Sep 19, 2016 at 11:07 AM, Carlos Fenoy wrote:
>> Hi All,
>>
>> I'm working on a plugin that stores performance information of every task of
>
Hi,
You have set a MaxMemPerCPU lower than what you are asking for. Try
changing that and check if that solves the issue.
Regards,
Carlos
On Tue, Nov 1, 2016 at 10:27 PM, Chad Cropper
wrote:
> SBATCH submissions are not utilizing the –mem-per-cpu option for
> scheduling purposes. Also the Allo
Hi,
Is the user defined in all the compute nodes? Does it has the same UID in
all the hosts?
Regards,
Carlos
On Mon, Nov 28, 2016 at 6:54 PM, Andrus, Brian Contractor
wrote:
> Paddy,
> Nope, it is exactly 8 characters: clwalton
>
>
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate
Hi,
You should take a look at the job_submit plugin. That is the nest place to
check if a job should be queued or it can be rejected otherwise.
Regards,
Carlos
On Thu, Jan 26, 2017 at 1:03 PM, Dmitry Chirikov wrote:
> Hi all,
>
> Playing with SPANK I faced up with an issue
> Seems I can't get
Are you specifying a memory limit for your jobs? You haven't set a default
limit per cpu and slurm will allocate all the memory of a node if nothing
else is specified.
Regards,
Carlos Fenoy
On Sun, 12 Feb 2017, 22:54 Travis DePrato, wrote:
> Yep! Doing everything I can think of,
Buenas Daniel,
If you don't need to run interactive jobs (srun, salloc) there should not
be any issue. You only need the client packages and the config files on the
submit hosts. The submit hosts must be able to reach the slurmctld host,
but they do not need to see the internal cluster.
Regards,
Hi,
any idea why the output of your job is not complete? There is nothing after
"Copying files...". Does the /work/tants directory exists in all the nodes?
The variable $SLURM_JOB_NAME is interpreted by bash so srun only sees "srun
-N2 -n2 rm -rf /work/tants/mpicopytest"
Regards,
Carlos
On Mon,
Hi Pablo,
This issue may be related to yours:
https://bugs.schedmd.com/show_bug.cgi?id=3260
Apparently there is a way to disable some internal dbd caching, compiling
with the flag: --enable-memory-leak-debug
Regards,
Carlos
On Fri, Sep 22, 2017 at 2:00 PM, Pablo Escobar
wrote:
> Hi,
>
> We ha
48 matches
Mail list logo