[slurm-dev] Re: Slurm configuration problem --Age factor not working @ all

2013-11-11 Thread Carlos Fenoy
Dear Yogendra, It seems you are missing the PriorityMaxAge parameter. Set it and the Age parameter should start working. Regards, Carles Fenoy On Mon, Nov 11, 2013 at 11:04 AM, wrote: > Hi Team > > > > > > We have enabled mutlifactor priority (PrirityWeightAge & PriorityJobSize) > but Pririt

[slurm-dev] Re: Scheduling for remaining memory

2013-12-12 Thread Carlos Fenoy
Dear Ulf, You can try to define the parameter DefMemPerCPU=400 in your slurm.conf file. This will set a default memory limit for each job to 400MB per core requested. This should be enough to fulfil your requirement Regards, Carles Fenoy Barcelona Supercomputing Center On Thu, Dec 12, 2013 at 1

[slurm-dev] Re: slurmdbd is not initializing

2014-04-25 Thread Carlos Fenoy
Hi Lucas, It seems that your slurm is not compiled with munge support. Have you compiled it yourself? Check the output of the configure command to check that munge is correctly configured or add the path to munge with the --with-munge parameter. Regards, Carles Fenoy Barcelona Supercomputing Cent

[slurm-dev] Re: Node state is changing from idle to down

2014-04-26 Thread Carlos Fenoy
Hi Lucas, It seems that your nodes can not reach your slurm controller. Do you have any firewall configured in the compute nodes? Try with a telnet if you can reach the controller port from a compute node. Regards, Carles Fenoy Barcelona Supercomputing Center On Fri, Apr 25, 2014 at 11:47 PM, L

[slurm-dev] Re: "Requested node configuration is not available" when using -c

2014-09-10 Thread Carlos Fenoy
Hi Michal, What do you mean with "It's possible using LSF without MPI? as far as I know a process can not use resources in different compute nodes. Only distributed programming models would allow the usage of different compute nodes but at least there must be a task per node, unless you use somethi

[slurm-dev] Re: overcounting of SysV shared memory segments?

2014-09-23 Thread Carlos Fenoy
Hi, We had some users complaining for the same behaviour but for resident memory. What we did was modify the accounting plugin to consider the proportional set size (PSS) instead of RSS. This way the shared memory is accounted only one, but proportionally for each process, so if 2 processed share

[slurm-dev] Re: Using control groups to restrict resource usage on BG/Q launch node?

2014-09-23 Thread Carlos Fenoy
Hi Chris, Yo can create a cgroup restricting only memory to the front end memory amount - the amount needed by the system, and attaching the slurm processes to that cgroup. This way the if the oom is invoked in the cgroup it will only kill tasks belonging to the cgroup. Regards, Carles Fenoy El 22

[slurm-dev] Re: job pending, not starting

2014-09-24 Thread Carlos Fenoy
There is no way, but if the reason is AssociationJobLimit it has nothing to do with qos. Check the user association limits. Carles Fenoy On Tue, Sep 23, 2014 at 11:16 PM, Marcin Stolarek wrote: > > > 2014-09-23 20:23 GMT+02:00 Eva Hocks : > >> >> >> >> How can I get a job started after it was p

[slurm-dev] Re: Documentation mismatch: man pages / html

2014-10-20 Thread Carlos Fenoy
Hi Uwe, Looks like you are missing an "s" at the end of the name. We have: SchedulerParameters=max_switch_wait=864000,... Regards, Carles Fenoy Barcelona Supercomputing Center On Mon, Oct 20, 2014 at 10:27 AM, Uwe Sauter wrote: > > Hi all, > > I'm trying to configure the scheduling parameter

[slurm-dev] Re: Error while loading shared libraries

2015-05-06 Thread Carlos Fenoy
Do you have mpi installed on all the cluster nodes? On Wed, May 6, 2015 at 3:34 PM, Uwe Sauter wrote: > > Check the file permissions for libmpichcxx.so.1.2 as well as the > permissions on the parent directories. Might be that you are not > allowed to access the folder structure as the user you'r

[slurm-dev] Re: Error while loading shared libraries

2015-05-06 Thread Carlos Fenoy
des is > not found. I nodes openmpi-1.5.4-2 is installed. > > > > *De:* Carlos Fenoy [mailto:mini...@gmail.com] > *Enviado el:* miércoles, 06 de mayo de 2015 10:11 > *Para:* slurm-dev > *Asunto:* [slurm-dev] Re: Error while loading shared libraries > > > > Do yo

[slurm-dev] Re: Error while loading shared libraries

2015-05-06 Thread Carlos Fenoy
how it's made? > > > -Original Message- > From: Carlos Fenoy > To: "slurm-dev" > Date: Wed, 06 May 2015 07:27:31 -0700 > Subject: [slurm-dev] Re: Error while loading shared libraries > > Then you can not execute a binary compiled with MPICH in the comp

[slurm-dev] Storage usage accounting

2015-05-12 Thread Carlos Fenoy
Hi all, I'm trying to account the usage of the filesystem per job. How accurate is the data reported by sacct? (fields MaxDiskRead and MaxDiskWrite) Will this data be accurate also for parallel filesystems as Lustre or GPFS? Regards, Carles -- -- Carles Fenoy

[slurm-dev] Re: Compute nodes not taking more then 1 jobs

2015-06-17 Thread Carlos Fenoy
As already commented on several threads, you have to specify DefMemPerCPU or DefMemPerNode. The default behavior of slurm is to allocate all the memory. On Wed, Jun 17, 2015 at 10:19 AM, Saerda Halifu wrote: > > Hi, > > Thanks for your answer, SelectType is set to select/cons_res. > > scontrol

[slurm-dev] Re: Where does standard out go before its copied over to the control node

2015-07-14 Thread Carlos Fenoy
Hi Jordan, Check http://stackoverflow.com/questions/25170763/how-to-change-how-frequently-slurm-updates-the-output-file-stdout/25189364#25189364 this question and answer on stackoverflow. Regards, Carlos On Tue, Jul 14, 2015 at 3:15 PM, Aaron Knister wrote: > > Hi Jordan, > > The answer is, we

[slurm-dev] Re: Need guidance to run multiple tasks per node with sbatch job array

2015-11-04 Thread Carlos Fenoy
works Regards, Carlos Fenoy On Wed, Nov 4, 2015 at 2:40 PM, charlie hemlock wrote: > Hi Trevor/Triveni, > Thank you for your responses. > > I have been able to use the sbatch --array to run all the jobs, *but only > a single job/task executes per node.* > In my example cluster

[slurm-dev] Re: Need guidance to run multiple tasks per node with sbatch job array

2015-11-04 Thread Carlos Fenoy
mutually exclusive. > > *NOTE: Enforcement of memory limits currently requires enabling of > accounting, which samples memory use on a periodic basis (data need not be > stored, just collected).* > Perhaps CR_Core would be a better option > and/or "enabling of accounting" requ

[slurm-dev] Re: Need guidance to run multiple tasks per node with sbatch job array

2015-11-04 Thread Carlos Fenoy
Trevor, If using cons_res there is no need to specify Shared=YES unless you want to share the same resources among different jobs. form slurm.conf man page: YES Makes all resources in the partition available for sharing upon request by the job. Resources will only be over-subscribed when

[slurm-dev] Re: Need guidance to run multiple tasks per node with sbatch job array

2015-11-04 Thread Carlos Fenoy
t; # Scheduling > > # > > SelectType=select/cons_res > > SelectTypeParameters=*CR_CPU* > > > > # > > # COMPUTE NODES > > # > > NodeName=clunode[#] Procs = 12 ... > > On Wed, Nov 4, 2015 at 10:21 AM, Carlos Fenoy wrote: > >> Trevor, >>

[slurm-dev] Re: Cannot exclude hosts with --exclude

2015-11-24 Thread Carlos Fenoy
There seems to be a wrong character in the double dashes "--". On Tue, 24 Nov 2015, 22:04 Zentz, Scott C. wrote: > Hello Everyone! > > > > I have a user who is trying to exclude some hosts from their job > submission and was using –exclude to accomplish this. He claims that he was > able to do t

[slurm-dev] Re: cgroups and memory accounting

2015-12-18 Thread Carlos Fenoy
Barbara, I don't think that is the issue here. The killer is the OOM not Slurm, so Slurm is not accounting incorrectly the amount of memory, but it seems that the cached memory is also accounted in the cgroup and it is what is causing the OOM to kill gzip. Regards, Carlos On Fri, Dec 18, 2015 at

[slurm-dev] Re: slum in the nodes not working

2015-12-21 Thread Carlos Fenoy
You should not start the slurmctld on all the nodes, only in the head node of the cluster, and in the compute nodes start the slurmd with service slurm start On Mon, Dec 21, 2015 at 6:27 PM, Fany Pagés Díaz wrote: > I had to turn off my cluster by electricity problems, and now slurm not > workin

[slurm-dev] Re: slum in the nodes not working

2015-12-22 Thread Carlos Fenoy
gt; to post the content of your slurm.conf file. > > > > Phil Eckert > > LLNL > > > > *From: *Fany Pagés Díaz > *Reply-To: *slurm-dev > *Date: *Monday, December 21, 2015 at 12:39 PM > *To: *slurm-dev > *Subject: *[slurm-dev] Re: slum in the nodes not w

[slurm-dev] Re: problem with start slurm

2016-01-04 Thread Carlos Fenoy
Check the date and time in all the nodes On Mon, 4 Jan 2016, 20:12 Fany Pagés Díaz wrote: > When I try to start slurm I have the next error in my logs of nodes > > > > error: slurm_receive_msg: Zero Bytes were transmitted or received > > error: slurm_receive_msg: Zero Bytes were transmitted or r

[slurm-dev] Re: Accounting, SlurmDBD and XDMoD

2016-01-05 Thread Carlos Fenoy
Hi, Have you tried the ElasticSearch job completion plugin? With Kibana you can have nice charts for reporting and easily query the database to retrieve the information required. It does not support importing old jobs, but as soon as you set it up, you can start getting nice dashboards with kibana

[slurm-dev] Re: Accounting, SlurmDBD and XDMoD

2016-01-11 Thread Carlos Fenoy
Lehto wrote: > > - Original Message - > > From: "Christopher Samuel" > > To: "slurm-dev" > > Sent: Wednesday, 6 January, 2016 01:21:48 > > Subject: [slurm-dev] Re: Accounting, SlurmDBD and XDMoD > > > On 06/01/16 09:51, Carlos

[slurm-dev] Re: sreport/sacct: discrepancy between utilization and CPUTime

2016-01-22 Thread Carlos Fenoy
Hi Loris, Can you check when did the job actually started or ended? It may be that the job spans 2 days, and that is the reason the the sreport is reporting less time. Regards, Carlos On Fri, Jan 22, 2016 at 9:31 AM, Loris Bennett wrote: > > Hi, > > Using version 15.08.4 I am looking at the va

[slurm-dev] Re: scontrol update not allowing jobs

2016-04-15 Thread Carlos Fenoy
Hi Glen, I think your issue is with the MAINT flag in the reservation. Try removing that flag and try again. Regards, Carlos On Fri, Apr 15, 2016 at 4:09 PM, Glen MacLachlan wrote: > Dear all, > > Wrapping up a maintenance period and I want to run some test jobs before I > release the reservat

[slurm-dev] Re: How to get command of a running/pending job

2016-05-12 Thread Carlos Fenoy
Hi, You can see the script running an scontrol show job JOBID - dd On Fri, 13 May 2016, 05:59 Husen R, wrote: > Dear all, > > Does slurm provide feature to get command that being executed/will be > executed by running/pending jobs ? > > The "squeue -O command" just gives the full path to sbatch

[slurm-dev] RE: Jobs are waiting for resources with some partitions

2016-05-13 Thread Carlos Fenoy
Hi, You can not have multiple default partitions. Only the last one is being set as Default. If you check with scontrol show job you will see that there is only one partition "requested" by your job. You can submit to multiple partitions or use a job submit plugin to assign multiple partitions if

[slurm-dev] Re: How to get command of a running/pending job

2016-05-17 Thread Carlos Fenoy
On Tue, May 17, 2016 at 10:02 AM, Loris Bennett wrote: > > Benjamin Redling > writes: > > > On 2016-05-13 05:58, Husen R wrote: > >> Does slurm provide feature to get command that being executed/will be > >> executed by running/pending jobs ? > > > > scontrol show --detail job > > or > > scontr

[slurm-dev] Re: number of processes in slurm job

2016-07-12 Thread Carlos Fenoy
If you do not specify the number of nodes does it work as expected? On Tue, 12 Jul 2016, 09:25 Loris Bennett, wrote: > > Husen R writes: > > > Re: [slurm-dev] Re: number of processes in slurm job > > > > Hi, > > > > Thanks for your reply ! > > > > I use this sbatch script > > > > #!/bin/bash >

[slurm-dev] Re: Process finished but jobs still "R" in squeue

2016-07-13 Thread Carlos Fenoy
Hi, I've seen the same with 15.08.4 version. When running array jobs with just the "hostname" command, makes some jobs to stay in the running state for several minutes. Regard, Carlos On Wed, Jul 13, 2016 at 3:38 PM, Marcin Stolarek wrote: > Hi guys, > > I have a cluster with a few nodes. Use

[slurm-dev] RE: SGI UV2000 with SLURM

2016-07-20 Thread Carlos Fenoy
Is the slurmd process running in the bootcpuset? On Wed, Jul 20, 2016 at 9:29 AM, Christopher Samuel wrote: > > On 20/07/16 17:13, A. Podstawka wrote: > > > no direct error message, but the jobs get started in the bootcpuset > > Do the processes show up in the tasks file for that cgroup? > > Is

[slurm-dev] RE: SGI UV2000 with SLURM

2016-07-20 Thread Carlos Fenoy
Try starting the slurmd in another cgroup, maybe one dedicated to slurm On Wed, Jul 20, 2016 at 11:26 AM, A. Podstawka wrote: > > Hi, > > > > Am 20.07.2016 um 09:28 schrieb Christopher Samuel: > >> On 20/07/16 17:13, A. Podstawka wrote: >> >> no direct error message, but the jobs get started in

[slurm-dev] Re: MaxNodes

2016-07-26 Thread Carlos Fenoy
Are all the jobs running in different nodes? What version of slurm are you using? If you try to submit a job requesting 9 nodes do you get an error? On Tue, Jul 26, 2016 at 12:24 PM, Luque, N.B. wrote: > Thanks a lot Kent Engström for your help. > I guess that what I wanted was GrpNodes, I set t

[slurm-dev] Re: Put node "idle" when node restart

2016-07-29 Thread Carlos Fenoy
Hi David, Check the ReturnToService parameter of the slurm.conf file. ReturnToService Controls when a DOWN node will be returned to service. The default value is 0. Supported values include 0 A node will remain in the DOWN state until a system administrator explic

[slurm-dev] Re: setup Slurm on Ubuntu 16.04 server

2016-08-24 Thread Carlos Fenoy
Have you added the cluster to the database? something like: "sacctmgr add cluster CLUSTER_NAME" On Wed, Aug 24, 2016 at 11:04 AM, Bancal Samuel wrote: > > Hi, > > Thanks for your quick answer. > > In fact NodeName=DEFAULT is not the server's hostname, but matches all > subsequent nodes defined

[slurm-dev] Re: slurm-dev problem with new parameter in slurm.conf

2016-09-16 Thread Carlos Fenoy
Marco, It seems that your application is reading the slurm.conf, if it is an ompss application maybe you need to recompile the application or the runtime with your modifications to slurm in order for it to understand the new parameter. Regards, Carlos On Fri, Sep 16, 2016 at 2:05 PM, Marco D'Ami

[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

2016-09-19 Thread Carlos Fenoy
Hi All, I'm working on a plugin that stores performance information of every task of every job in influxdb. This can be visualized easily with Grafana and provides information of cpu used and memory used as well as read and writes from filesystems. This plugin is using the profile capability of sl

[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

2016-09-19 Thread Carlos Fenoy
ults in real time as the job is running or only once it is > finished? > Thank you, > Igor > > >> On Mon, Sep 19, 2016 at 11:07 AM, Carlos Fenoy wrote: >> Hi All, >> >> I'm working on a plugin that stores performance information of every task of >

[slurm-dev] Re: mem-per-cpu is being ignored

2016-11-01 Thread Carlos Fenoy
Hi, You have set a MaxMemPerCPU lower than what you are asking for. Try changing that and check if that solves the issue. Regards, Carlos On Tue, Nov 1, 2016 at 10:27 PM, Chad Cropper wrote: > SBATCH submissions are not utilizing the –mem-per-cpu option for > scheduling purposes. Also the Allo

[slurm-dev] Re: squeue returns "invalid user" for a user that has jobs running

2016-11-28 Thread Carlos Fenoy
Hi, Is the user defined in all the compute nodes? Does it has the same UID in all the hosts? Regards, Carlos On Mon, Nov 28, 2016 at 6:54 PM, Andrus, Brian Contractor wrote: > Paddy, > Nope, it is exactly 8 characters: clwalton > > > Brian Andrus > ITACS/Research Computing > Naval Postgraduate

[slurm-dev] Re: SPANK and job variables/options

2017-01-31 Thread Carlos Fenoy
Hi, You should take a look at the job_submit plugin. That is the nest place to check if a job should be queued or it can be rejected otherwise. Regards, Carlos On Thu, Jan 26, 2017 at 1:03 PM, Dmitry Chirikov wrote: > Hi all, > > Playing with SPANK I faced up with an issue > Seems I can't get

[slurm-dev] Re: Can't Specify Memory Constraint or Run Multiple Jobs per Node

2017-02-12 Thread Carlos Fenoy
Are you specifying a memory limit for your jobs? You haven't set a default limit per cpu and slurm will allocate all the memory of a node if nothing else is specified. Regards, Carlos Fenoy On Sun, 12 Feb 2017, 22:54 Travis DePrato, wrote: > Yep! Doing everything I can think of,

[slurm-dev] Re: SLURM between two different networks

2017-06-14 Thread Carlos Fenoy
Buenas Daniel, If you don't need to run interactive jobs (srun, salloc) there should not be any issue. You only need the client packages and the config files on the submit hosts. The submit hosts must be able to reach the slurmctld host, but they do not need to see the internal cluster. Regards,

[slurm-dev] Re: srun can't use variables in a batch script after upgrade

2017-07-10 Thread Carlos Fenoy
Hi, any idea why the output of your job is not complete? There is nothing after "Copying files...". Does the /work/tants directory exists in all the nodes? The variable $SLURM_JOB_NAME is interpreted by bash so srun only sees "srun -N2 -n2 rm -rf /work/tants/mpicopytest" Regards, Carlos On Mon,

[slurm-dev] Re: slurmdbd using too much memory - OOM killer finishes it

2017-09-22 Thread Carlos Fenoy
Hi Pablo, This issue may be related to yours: https://bugs.schedmd.com/show_bug.cgi?id=3260 Apparently there is a way to disable some internal dbd caching, compiling with the flag: --enable-memory-leak-debug Regards, Carlos On Fri, Sep 22, 2017 at 2:00 PM, Pablo Escobar wrote: > Hi, > > We ha