from:"Lachlan Musicman"

[slurm-dev] Re: Cluster nodes

2017-11-04 Thread Lachlan Musicman

On 5 November 2017 at 11:08, ايمان <435204...@student.ksu.edu.sa> wrote: > Good morning; > > > I want to run parallel java code on more than one nodes , but it > executed only on one nodes ? > > How can I run it on more than one nodes? > Good morning Eman, Without more details, it's hard to an

[slurm-dev] Re: slurm-dev ÃÃ: Re: Slurm and parallel java libaray pj2

2017-10-30 Thread Lachlan Musicman

ide you are referring to - do you need a guide for SLURM or for pj2? Both can be found quickly with Google. Cheers L. > > -- > *من:* Lachlan Musicman > *‏‏تم الإرسال:* 10/صفر/1439 02:45 ص > * إلى: slurm-dev * > * ‏‏الموضوع: [slurm-dev] Re: Slurm an

[slurm-dev] Re: Slurm and parallel java libaray pj2

2017-10-29 Thread Lachlan Musicman

On 30 October 2017 at 10:04, ايمان <435204...@student.ksu.edu.sa> wrote: > I am programming a parallel program that use parallel java library pj2 . > > I want to run it using slurm but I did not know if slurm support this > library. > > And what is the correct commands to run java on cluster > > T

[slurm-dev] Re: Naming of output & error files

2017-10-25 Thread Lachlan Musicman

On 26 October 2017 at 13:27, Alex Chekholko wrote: > Why can't you just do > > for fasta_file in `ls /path/to/fasta_files`; do sbatch > --output=$fasta_file.out --error=$fasta_file.err myscript.sbatch > $fasta_file; done > Because it was staring me in the face and I ignored it. Thank you. che

[slurm-dev] Naming of output & error files

2017-10-25 Thread Lachlan Musicman

Hi All, I've now been asked twice in two days if there is any way to intelligently name slurm output files. Sometimes our users will do something like for fasta_file in `ls /path/to/fasta_files`; do sbatch myscript.sbatch $fasta_file; done They would like their output and error files to be like

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-19 Thread Lachlan Musicman

On 19 October 2017 at 20:37, Chris Samuel wrote: > > On Thursday, 19 October 2017 7:41:37 PM AEDT Nadav Toledo wrote: > > > running : id -u domain_name\\username , does return its uid > > So your system is not finding users as just "username", but instead only as > domain_name\\username which is

[slurm-dev] Re: peculiar resources configuration in SLURM

2017-10-09 Thread Lachlan Musicman

On 9 October 2017 at 22:06, cyberseawolf . wrote: > Hello everybody, > I'm a young system administrator that is moving from Torque/MAUI to Slurm. > I set up a pretty peculiar resource management in the previous queue system > and I would like to port it in the new one. > > - I have the following

[slurm-dev] Re: Configure partitions to ignore cpu limit

2017-10-08 Thread Lachlan Musicman

9857 > On Thu, Oct 5, 2017 at 3:34 PM, Lachlan Musicman > wrote: > >> On 6 October 2017 at 07:35, Doug Meyer wrote: >> >>> Within the cluster we have partitions that are shared and some that are >>> dedicated to specific groups. Is there a way to config

[slurm-dev] Re: Disable accounting/qos policy enforcement for single job

2017-10-05 Thread Lachlan Musicman

On 6 October 2017 at 07:29, Jacob Chappell wrote: > Is there a way (via scontrol for example) to disable accounting/qos policy > enforcement for a single job? We'd like to be able to allow a job to go > ahead and run, even though it may violate policy (MaxTRES) on a > case-by-case basis. > > Yes

[slurm-dev] Re: Configure partitions to ignore cpu limit

2017-10-05 Thread Lachlan Musicman

On 6 October 2017 at 07:35, Doug Meyer wrote: > Within the cluster we have partitions that are shared and some that are > dedicated to specific groups. Is there a way to configure slurm so the > private use partitions do not impact the priority system nor are they > counted against the account c

[slurm-dev] Re: defaults, passwd and data

2017-09-24 Thread Lachlan Musicman

On 24 September 2017 at 16:20, Daniel Letai wrote: > Hello, > > B. We have active directory(AD) in our faculty, and We prefer manage > users/groups from there , is it possible? any guide available somewhere? > > > Search this mailing list, this question pops up every now and again, there > is no

[slurm-dev] Re: Fwd: Slurm nodes down

2017-09-21 Thread Lachlan Musicman

On 21 September 2017 at 17:55, Fabrice Nininahazwe wrote: > > Dear developer, > > I have encountered some of the nodes that are down, I can ping to node > n003 and not node n001, I have run scontrol update to change the state with > no success below is the result after running scontrol show nodes

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-18 Thread Lachlan Musicman

On 18 September 2017 at 11:13, Christopher Samuel wrote: > > On 14/09/17 16:04, Lachlan Musicman wrote: > > > It's worth noting that before this change cgroups couldn't get down to > > the thread level. We would only consume at the core level - ie, all jobs >

[slurm-dev] Re: why the env is the env of submit node, not the env of job running node.

2017-09-15 Thread Lachlan Musicman

On 15 September 2017 at 17:09, Dr. Thomas Orgis wrote: > Hi Zhang, > > the default behaviour of slurm is to try to keep the environment > variables from the submit node. I do not like that and in our > installation, we urge users to always specify > > #SBATCH --export=NONE > > to avoid that (or o

[slurm-dev] Re: why the env is the env of submit node, not the env of job running node.

2017-09-14 Thread Lachlan Musicman

e/ /home/user1/.bashrc has already define > many variable, I think these are default variables, currently, every time, > I also need to source them before using, it is not reasonable from my view. > > Whether there is a way to configure slurm to use running node env, not > submit node

[slurm-dev] Re: why the env is the env of submit node, not the env of job running node.

2017-09-14 Thread Lachlan Musicman

On 14 September 2017 at 19:41, Chaofeng Zhang wrote: > On node A, I submit job file using sbatch command, the job is running on > the node B, you will find that the output is not the env of node B, it is > the env of node A. > > > > *#!/bin/bash* > > *#SBATCH --job-name=mnist10* > > *#SBATCH --pa

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-13 Thread Lachlan Musicman

On 14 September 2017 at 11:06, Lachlan Musicman wrote: > > I've just implemented the change from > > NodeName=papr-res-compute[34-36] CPUs=8 RealMemory=31000 State=UNKNOWN > > to > > NodeName=papr-res-compute[34-36] CPUs=8 RealMemory=31000 Sockets=1 > CoresPe

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-13 Thread Lachlan Musicman

On 14 September 2017 at 11:10, Christopher Samuel wrote: > > On 14/09/17 11:07, Lachlan Musicman wrote: > > > Node configuration differs from hardware: CPUs=8:8(hw) Boards=1:1(hw) > > SocketsPerBoard=8:1(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:2(hw) > > Hmm,

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-13 Thread Lachlan Musicman

On 13 September 2017 at 07:21, Patrick Goetz wrote: > > On 09/12/2017 04:21 AM, Gennaro Oliva wrote: > >> >> On Mon, Sep 11, 2017 at 04:51:04PM -0600, Lachlan Musicman wrote: >> >>> "Note also if you are running with more than 1 thread per core and >&

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-12 Thread Lachlan Musicman

On 13 September 2017 at 10:36, Christopher Samuel wrote: > > On 13/09/17 07:22, Patrick Goetz wrote: > > > All I have to say to this is: um, what? > > My take has always been that ThreadsPerCore is really for HPC workloads > where you've decided not to disable HT full stop but want to allocate >

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-11 Thread Lachlan Musicman

On 11 September 2017 at 20:11, Gennaro Oliva wrote: > > Hi Patrick, > > On Fri, Sep 08, 2017 at 01:17:33PM -0600, Patrick Goetz wrote: > > After some > > discussion on this list, someone convinced me that setting > > "ThreadsPerCore=2" informs Slurm that each CPU actually has 8 cores, so I > > se

[slurm-dev] Slurm and Environments and aliases

2017-08-16 Thread Lachlan Musicman

Hola, I was under the impression that environments travelled with slurm when sbatch was executed - so any node could execute any code as if it was the env I executed from or built within my sbatch scripts. We use Environment Modules and this has all worked just great. Very pleased. Recently I le

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-15 Thread Lachlan Musicman

On 16 August 2017 at 00:14, Will French wrote: > > On Aug 15, 2017, at 5:29 AM, Chris Samuel wrote: > > > > > > On Tuesday, 15 August 2017 4:34:55 PM AEST John Hearns wrote: > > > >> For the /proc/self you need to start an interactive job under Slurm. > > > > You can actually use srun to join an

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread Lachlan Musicman

On 15 August 2017 at 11:38, Christopher Samuel wrote: > On 15/08/17 09:41, Lachlan Musicman wrote: > > > I guess I'm not 100% sure what I'm looking for, but I do see that there > > is a > > > > 1:name=systemd:/user.slice/user-0.slice/session-373.scope &g

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread Lachlan Musicman

On 15 August 2017 at 07:41, Robbert Eggermont wrote: > > On 14-08-17 07:50, Lachlan Musicman wrote: > >> We have TaskPlugin=task/cgroup and when testing I noticed that the # of >> threads/cpus being allocated was rounded up to the nearest even. I presume >> this was du

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread Lachlan Musicman

On 14 August 2017 at 16:22, John Hearns wrote: > Lachlan, forgive me if I am teaching granny to suck eggs..,, > I have recently been workign with cgroups. > If you run an interactive job what do you see when cat /proc/self/cgroups > Also have you explored in /sys/fs/cgroups and checked what res

[slurm-dev] CGroups, Threads as CPUs, TaskPlugins

2017-08-13 Thread Lachlan Musicman

Hola, Slurm is complicated software, and sometimes the docs can be dense - I'm looking for some clarification please. We have a system set up with Threads as CPUs. 1 socket, 4 cores, 2 threads = 8 cpus I would like to implement CGroups because some of our users are quite happy to utilise all thr

[slurm-dev] Proctrack cgroup; documentation bug

2017-08-13 Thread Lachlan Musicman

Hola, Two things: in the documentation for slurm.conf the reference to ProcTrack = proctrack/cgroup tells people to see `man cgroup.conf` for more details. That man page holds no details re proctrack. https://slurm.schedmd.com/slurm.conf.html The details in question are on https://slurm.schedmd.

[slurm-dev] Re: RebootProgram - who uses it?

2017-08-08 Thread Lachlan Musicman

Yep, thanks Chris. I went with regular reboot and have now successfully used scontrol reboot ASAP Very handy! L. -- "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics is the insistence that we cannot ignore the truth, nor should we panic about it. It is a shared co

[slurm-dev] Re: RebootProgram - who uses it?

2017-08-07 Thread Lachlan Musicman

the Name of the Wee Man, so 'reboot' is now a 'legacy tool' > https://access.redhat.com/solutions/1580343 > > Jeez... Look HPC compute node - I'm in charge, gottit? Yeah, fight back > all you like with systemd, but I can pull the power plug. > Let's se

[slurm-dev] RebootProgram - who uses it?

2017-08-06 Thread Lachlan Musicman

I've just been asked about implementing a "drain and reboot" for nodes/partitions. In slurm.conf, there is a RebootProgram - does this need to be a direct link to a bin or can it be a command? RebootProgram=/usr/sbin/reboot or RebootProgram='systemctl disable reboot-guard; reboot' Cheers L.

[slurm-dev] Re: How to use cgroup in slurm?

2017-08-01 Thread Lachlan Musicman

291 > Yuseong-gu, Daejeon > Republic of Korea 305-701 > Tel. +82-10-2075-6911 <+82%2010-2075-6911> > > 2017-08-02 13:05 GMT+09:00 Lachlan Musicman : > >> [root@n6 /]# si >>> >>> PARTITIONNODES NODES(A/I/O/T) S:C:TMEMORY TMP_DISK >&g

[slurm-dev] Re: How to use cgroup in slurm?

2017-08-01 Thread Lachlan Musicman

> > [root@n6 /]# si > > PARTITIONNODES NODES(A/I/O/T) S:C:TMEMORY TMP_DISK > TIMELIMIT AVAIL_FEATURES NODELIST > > debug* 6 0/6/0/61:4:27785 113264 > infinite(null) c[1-6] > > (for a moment) > > [root@n6 /]# si > > PARTITION

[slurm-dev] Re: How to use cgroup in slurm?

2017-08-01 Thread Lachlan Musicman

Sumin, The error message is saying that the node is down. When you say "works with sinfo", you need to show us what that means - sinfo is a command that interrogates the state of nodes, whereas srun sends commands *to* nodes. So sinfo is meant to work - even if the nodes are down. It is hte softw

[slurm-dev] Re: Why my slurm is running on only one node?

2017-07-27 Thread Lachlan Musicman

On 28 July 2017 at 14:30, 허웅 wrote: > I modified my slurm.conf like : > > > > NodeName=GO[1-5] > > > > PartitionName=party Default=yes Nodes=GO[1-5] > > > > and I restarted slurmctld and slurmd services. > > > > [root@GO1]~# systemctl start slurmctld > > [root@GO1]~# systemctl status slurmctld >

[slurm-dev] Re: Why my slurm is running on only one node?

2017-07-27 Thread Lachlan Musicman

ty* idle > > sgo4 1party* idle > > sgo5 1party* idle > > [root@GO1]~# sn > Fri Jul 28 09:55:53 2017 >HOSTNAMES > GO1 > GO2 > GO3 > GO4 >

[slurm-dev] Re: Why my slurm is running on only one node?

2017-07-27 Thread Lachlan Musicman

y out is through, and the only way through is together. " *Greg Bloom* @greggish https://twitter.com/greggish/status/873177525903609857 On 28 July 2017 at 10:47, Lachlan Musicman wrote: > I think it's because hostname is so undemanding. > > How many CPUs does each host have?

[slurm-dev] Re: Why my slurm is running on only one node?

2017-07-27 Thread Lachlan Musicman

I think it's because hostname is so undemanding. How many CPUs does each host have? You may need to use ((number of cpus per host) + 1) to see action on another node. You can try using stress-ng to test higher loads? https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/ c

[slurm-dev] Re: #SBATCH --time= not always overriding default?

2017-06-30 Thread Lachlan Musicman

; WDIR=$PWD > #SBATCH -t 1:00 > > the -t 1:00 will get ignored by sbatch > > > On Thu, 29 Jun 2017, Lachlan Musicman wrote: > >> We have a 40min default time on our main partition. >> >> We are finding that researchers that use >> >> #SBATCH -

[slurm-dev] #SBATCH --time= not always overriding default?

2017-06-29 Thread Lachlan Musicman

We have a 40min default time on our main partition. We are finding that researchers that use #SBATCH --time=0-07:00:00 are still having their jobs terminated at 40 minutes. Using slurm 17.2.04 on Centos 7.3 Has anyone else experienced this? Cheers L. -- "Mission Statement: To provide ho

[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Lachlan Musicman

We did it in place, worked as noted on the tin. It was less painful than I expected. TBH, your procedures are admirable, but you shouldn't worry - it's a relatively smooth process. cheers L. -- "Mission Statement: To provide hope and inspiration for collective action, to build collective powe

[slurm-dev] Re: Slurm: which packages on hosts

2017-06-08 Thread Lachlan Musicman

On 9 June 2017 at 15:26, Lachlan Musicman wrote: > On 9 June 2017 at 14:53, Nicholas C Santucci wrote: > >> Those first two of your Gone list I noticed when 17.02.0 was released on >> Feb 23. >> A patch was released on Feb 27 due to the change in dependencies. >>

[slurm-dev] Re: Slurm: which packages on hosts

2017-06-08 Thread Lachlan Musicman

wait for it's accounting information to be available and include >> that >> >> On 02/27/2017 01:35 PM, dani wrote: >> >> Seems like no obsoletes was set on slurm-contribs, so yum complains of >> conflicts with slurm-sjobs and friends. >> >>

[slurm-dev] Slurm: which packages on hosts

2017-06-08 Thread Lachlan Musicman

Hola, I followed the instructions for building the 16.05.0 bz2 and installed the resulting rpms as follows: Each node got: slurm.x86_64 slurm-devel.x86_64 slurm-munge.x86_64 slurm-perlapi.x86_64 slurm-plugins.x86_64 slurm-sjobexit.x86_64 slurm-sjstat.x86_64 slurm-torque.x86_64 The head n

[slurm-dev] Re: Multinode MATLAB jobs

2017-05-31 Thread Lachlan Musicman

I'm pretty sure you need the MDCS. Having said that, I know people run GNU Octave on clusters, can't speak to it though. R works on a cluster quite nicely. cheers L. -- "Mission Statement: To provide hope and inspiration for collective action, to build collective power, to achieve collectiv

[slurm-dev] Re: questions

2017-05-30 Thread Lachlan Musicman

Hi and welcome to SLURM. It is late and I am tired, but: 1. SLURM is a cluster 2. front end will run the slurm-ctld service. Compute nodes will run slurmd service. How that is divided is up to you. cheers L. -- "Mission Statement: To provide hope and inspiration for collective action, t

[slurm-dev] QOS difficulties

2017-05-29 Thread Lachlan Musicman

Hola, I'd like some advice on QOS for clusters. Currently, we have a broad 90CPU limit (MaxTRESPU across all associations. I have a special partition on which I want no limits to apply, it has 80 CPUs. So, I have created a special QOS, named after the partition, with a MaxTRESPU of 180 CPUS. Th

[slurm-dev] Re: Job ends successfully but spawned processes still run?

2017-05-23 Thread Lachlan Musicman

On 24 May 2017 at 13:18, Christopher Samuel wrote: > > Hiya, > > On 24/05/17 13:10, Lachlan Musicman wrote: > > > Occasionally I'll see a bunch of processes "running" (sleeping) on a > > node well after the job they are associated with has finished. >

[slurm-dev] Job ends successfully but spawned processes still run?

2017-05-23 Thread Lachlan Musicman

Hola, Occasionally I'll see a bunch of processes "running" (sleeping) on a node well after the job they are associated with has finished. How does this happen - does slurm not make sure all processes spawned by a job have finished at completion? cheers L. -- "Mission Statement: To provide

[slurm-dev] Time limit exhausted for JobId=

2017-05-22 Thread Lachlan Musicman

One user has recently started to see their jobs killed after roughly 40 minutes, even though they have asked for four hours. 40 minutes is partitions' default, but this user has #SBATCH --time=04:00:00 in their sbatch file? I have found this: https://bugs.schedmd.com/show_bug.cgi?id=2353 and w

[slurm-dev] Re: PartitionTimeLimit : what does that mean?

2017-05-22 Thread Lachlan Musicman

- Patrice Cullors, *Black Lives Matter founder* On 23 May 2017 at 09:43, Lachlan Musicman wrote: > Hola, > > One of my users has been given the PartitionTimeLimit reason for his jobs > not running. > > He has requested 20 days for the job, but I don't remember setting a time &

[slurm-dev] PartitionTimeLimit : what does that mean?

2017-05-22 Thread Lachlan Musicman

Hola, One of my users has been given the PartitionTimeLimit reason for his jobs not running. He has requested 20 days for the job, but I don't remember setting a time limit on any partition? I do recall setting a default time, but not a time limit. The docs claim: https://slurm.schedmd.com/squ

[slurm-dev] Re: Slurm version 17.02.3 is now available

2017-05-10 Thread Lachlan Musicman

On 11 May 2017 at 08:33, Batsirai Mabvakure wrote: > Is there a command i can execute for slurm to update automatically without > having to download it again? > > Not really. Ubuntu packages SLURM IIRC, but you would need to wait until they do their packaging and push the new version. Currently

[slurm-dev] Re:

2017-05-09 Thread Lachlan Musicman

ed in grief and rage but pointed towards vision and dreams." - Patrice Cullors, *Black Lives Matter founder* On 10 May 2017 at 10:57, Lachlan Musicman wrote: > Running Slurm 16.05 on CentOS 7.3 I'm trying to start an interactive > session with > > srun -w papr-expanded01 --p

[slurm-dev]

2017-05-09 Thread Lachlan Musicman

Running Slurm 16.05 on CentOS 7.3 I'm trying to start an interactive session with srun -w papr-expanded01 --pty --mem 8192 -t 06:00 /bin/bash --partition=expanded srun -w papr-expanded01 --pty -t 06:00 /bin/bash --partition=expanded srun -w papr-expanded01 --pty --mem 8192 /bin/bash --partition=ex

[slurm-dev] Re: LDAP required?

2017-04-10 Thread Lachlan Musicman

On 11 April 2017 at 02:36, Raymond Wan wrote: > > For SLURM to work, I understand from web pages such as > https://slurm.schedmd.com/accounting.html that UIDs need to be shared > across nodes. Based on this web page, it seems sharing /etc/passwd > between nodes appears sufficient. The word LDAP

[slurm-dev] Re: Adding more nodes to SLURM

2017-02-18 Thread Lachlan Musicman

sirai mabvakure wrote: > Thank you so much for the reply. Is there another way I can configure the > nodes other than using mpich that allows me only to update the slurm.conf > file and not install slurm on every new node every time I scale up? > > > On 2017/02/18, 01:21, "La

[slurm-dev] Re: Adding more nodes to SLURM

2017-02-17 Thread Lachlan Musicman

You will need to install slurm on the new nodes, as it is installed on the other nodes. Since they all must have the same conf file, grab the canonical conf file and put it on all of the machines - making sure that the new nodes are listed in that conf file. Restart the head node (the node runnin

[slurm-dev] Re: sharing generic resources

2017-02-17 Thread Lachlan Musicman

I don't know if you can split it at a GRES level, but I would put the node in the two partitions, and then use QOS to only allow one partition access to the single card and the other partition 3 cards. cheers L. -- The most dangerous phrase in the language is, "We've always done it this way."

[slurm-dev] Re: slurmctld not pinging at regular interval

2017-02-16 Thread Lachlan Musicman

gt; > In this slurmctld there are a total of 226 nodes, in several different > > partitions. The cluster of 64 is the only one where I see this > > happening. Unless that number of nodes is pushing the limit for a single > > slurmctld (which I doubt) I'd be inclined

[slurm-dev] RE: Configuring slurm accounts

2017-02-16 Thread Lachlan Musicman

On 17 February 2017 at 03:02, Baker D.J. wrote: > Hello, > > > > Thank you for the reply. There are two accounts on this cluster. What I > was primarily trying to do was define a default QOS with a partition. My > idea was to use sacctmgr to create an association between the partition, > the QOS

[slurm-dev] Re: New User Creation Issue

2017-02-15 Thread Lachlan Musicman

On 16 February 2017 at 09:36, Christopher Samuel wrote: > > We also have all our partitions (other than our debug one reserved for > sysadmins) marked as "State=DOWN" in slurm.conf so that they won't start > jobs when slurmctld is brought back up again. > Chris, What's the reasoning behind this

[slurm-dev] RE: Configuring slurm accounts

2017-02-15 Thread Lachlan Musicman

If you are only in one account, you don't need to list it. What version of slurm are you using? Someone else mentioned needing to restart slurmctld to users to stick. Which is not something I've experienced, but try that maybe? I am presuming that your slurm.conf is set up correctly for accounts?

[slurm-dev] Re: Standard suspend/resume scripts?

2017-02-15 Thread Lachlan Musicman

If you are looking to suspend and resume jobs, use scontrol: scontrol suspend scontrol resume https://slurm.schedmd.com/scontrol.html The docs you are pointing to look more like taking nodes offline in times of low usage? cheers L. -- The most dangerous phrase in the language is, "We've

[slurm-dev] Re: Can't Specify Memory Constraint or Run Multiple Jobs per Node

2017-02-12 Thread Lachlan Musicman

hange to all nodes, restarting slurmctld then running scontrol reconfigure? cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper > On Sat, Feb 11, 2017 at 3:26 AM Lachlan Musicman > wrote: > >> 1. As EV no

[slurm-dev] Re: Can't Specify Memory Constraint or Run Multiple Jobs per Node

2017-02-11 Thread Lachlan Musicman

1. As EV noted, to get Memory as a consumable resource, you will need to add it to the line that says CR_CPU - change to CR_CPU_Memory https://slurm.schedmd.com/slurm.conf.html 2. That's because of the CR_CPU combined with cons_res. Change to CR_CORE for per core or CR_SOCKET for per socket. For d

[slurm-dev] Re: let sbatch run a script

2017-01-31 Thread Lachlan Musicman

There's always the --dependency flag for sbatch. So yes, depending on what you wanted, you could line up another sbatch after the first if you liked. cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 1 February 2017 at 08:38, TO_We

[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

2017-01-31 Thread Lachlan Musicman

trival questions: does node has correct time wrt head node? and is node correctly configured in slurm.conf? (# of cpus, amount of memory, etc) cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 1 February 2017 at 08:03, E V wrote:

[slurm-dev] Re: slurmctld not pinging at regular interval

2017-01-24 Thread Lachlan Musicman

Check they are all in the same time or ntpd against the same server. I found that the nodes that kept going down had the time out of sync. Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 25 January 2017 at 05:49, Allan Streib wr

[slurm-dev] Re: New User Creation Issue

2017-01-23 Thread Lachlan Musicman

_clu+ > > > > one thing that I did notice when I add the user I see this error in the > slurmctld log > > > > [2017-01-23T16:47:34.351] error: Update Association request from non-super > user uid=450 > > > > UID 450 happens to be the slurm user > &

[slurm-dev] Re: New User Creation Issue

2017-01-23 Thread Lachlan Musicman

Interesting. To the best of my knowledge, if you are using Accounting, all users actually need to be in an association - ie having a user account is insufficient. An Association is a tuple consisting of: cluster, user, account and (optional) partition. Is that the problem? cheers L. -- The

[slurm-dev] Re: Job temporary directory

2017-01-22 Thread Lachlan Musicman

We use the SPANK plugin found here https://github.com/hpc2n/spank-private-tmp and find it works very well. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 21 January 2017 at 03:15, John Hearns wrote: > As I remember, in SGE and in PbsPr

[slurm-dev] Re: mail job status to user

2017-01-09 Thread Lachlan Musicman

Not 100% sure what you are asking? The mail options are available from within an sbatch script by using the commands you mention. They can also be passed directly to slurm when invoking the commands sbatch --mail-type=ALL --mail-user=e...@mail.com Are you asking if there is a default "always ma

[slurm-dev] Re: Trying to figure out if I need to use "associations" on my cluster

2016-12-28 Thread Lachlan Musicman

Will, I believe you do. While they aren't necessary in your case, I believe the software has been built for maximum extensibility, and as such there needs to be: at least one cluster at least one account at least one user and an association is the "grouping" of those three. The relevant part of

[slurm-dev] Re: How to remove node temporal files

2016-12-28 Thread Lachlan Musicman

Hi David, I dealt with this recently (see https://groups.google.com/forum/#!topic/slurm-devel/DKcFng8c1zE for instance ) In the end we went with this solution that has worked well for us: https://slurm.schedmd.com/SUG14/private_tmp.pdf which describes this plugin: https://github.com/hpc2n/span

[slurm-dev] Re: submitting jobs based in saccmgr info

2016-12-07 Thread Lachlan Musicman

On 8 December 2016 at 07:54, Mark R. Piercy wrote: > > Is it ever possible to submit jobs based on a users org affiliation? So > if a user is in org (PI) "smith" then their jobs would automatically be > sent to a particular partition. So no need to use the -p option in > sbatch/srun job. > M

[slurm-dev] Email differentials

2016-12-01 Thread Lachlan Musicman

Hi, I've had a request from a user about the email system in SLURM. Basically, there's a team collaboration and the request was: is there an sbatch command such that two groups will get different sets of emails. Group 1: only get the email if the jobs FAIL Group 2: get Begin, End and Fail Cheer

[slurm-dev] New design on schedmd site!

2016-11-15 Thread Lachlan Musicman

Hey Devs, The new design on the schedmd site is pretty - thanks! L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper

[slurm-dev] Re: Using slurm to control container images?

2016-11-15 Thread Lachlan Musicman

lenv's if that is the case the switch to a container with rkt seems > "normal" instead of a more intrusive one all mighty process to rule > everything that docker had the last time I check, its probably better now. > > Saludos. > Jean > > On Tue, Nov 15, 2016 a

[slurm-dev] Using slurm to control container images?

2016-11-15 Thread Lachlan Musicman

Hola, We were looking for the ability to make jobs perfectly reproducible - while the system is set up with environment modules with the increasing number of package management tools - pip/conda; npm; CRAN/Bioconductor - and people building increasingly more complex software stacks, our users have

[slurm-dev] Re: Re:

2016-11-08 Thread Lachlan Musicman

On 9 November 2016 at 09:36, Christopher Samuel wrote: > > But /tmp is almost certainly the second worst place (after /dev/shm). > I don't know Chris, I think that /dev/null would rate tbh. :) cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - G

[slurm-dev] Re: sinfo man page

2016-11-07 Thread Lachlan Musicman

Arg, I see now (hit send too soon). My parsing of the man page was wrong. cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 8 November 2016 at 11:39, Lachlan Musicman wrote: > Priority: Minor > > I notice

[slurm-dev] sinfo man page

2016-11-07 Thread Lachlan Musicman

Priority: Minor I notice that this command works well: sinfo -Nle -o '%C %t' Tue Nov 8 11:38:09 2016 CPUS(A/I/O/T) STATE 40/0/0/40 alloc 38/2/0/40 mix 36/4/0/40 mix 36/4/0/40 mix 6/34/0/40 mix 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idle 0/40/0/40 idl

[slurm-dev] Re:

2016-11-07 Thread Lachlan Musicman

Peixin, Again, depends on your OS and deployment methods, but essentially: In slurm.conf set SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm SlurmctldLogFile=/var/log/slurm/slurm-ctld.log SlurmdLogFile=/var/log/slur

[slurm-dev] Re: start munge again after boot?

2016-11-07 Thread Lachlan Musicman

On 8 November 2016 at 07:11, Peixin Qiao wrote: > Hi, > > I install munge and restart my computer, then munge stopped work and > restarting munge didn't work. It says: > > munged: Error: Failed to check pidfile dir "/var/run/munge": cannot > canonicalize "/var/run/munge": No such file or director

[slurm-dev] Re: Can slurm work on one node?

2016-10-30 Thread Lachlan Musicman

I think it should. Can you send through your slurm.conf? Also, the logs usually explicitly say why slurmctld/slurmd don't start, and the best way to judge if slurm is running is with systemd: systemctl status slurmctl systemctl status slurmd cheers L. -- The most dangerous phrase in the l

[slurm-dev] Re: How to restart a job "(launch failed requeued held)"

2016-10-27 Thread Lachlan Musicman

On 28 October 2016 at 09:20, Christopher Samuel wrote: > > On 28/10/16 08:44, Lachlan Musicman wrote: > > > So I checked the system, noticed that one node was drained, resumed it. > > Then I tried both > > > > scontrol requeue 230591 > > scontrol resume 2

[slurm-dev] How to restart a job "(launch failed requeued held)"

2016-10-27 Thread Lachlan Musicman

Morning, Yesterday we had some internal network issues that caused havoc on our system. By the end of the day everything was ok on the whole. This morning I came in to see one job on the queue (which was otherwise relatively quiet) with the error message/Nodelist Reason (launch failed requeued he

[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-24 Thread Lachlan Musicman

On 25 October 2016 at 09:17, Tuo Chen Peng wrote: > Oh ok thanks for pointing this out. > > I thought ‘scontrol update’ command is for letting slurmctld to pick up > any change in slurm.conf. > > But after reading the manual again, it seems this command is instead to > change the setting at runti

[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-24 Thread Lachlan Musicman

On 25 October 2016 at 08:42, Tuo Chen Peng wrote: > Hello all, > > This is my first post in the mailing list - nice to join the community! > Welcome! > > > I have a general question regarding slurm partition change: > > If I move one node from one partition to the other, will it cause any > im

[slurm-dev] Re: sreport "duplicate" lines

2016-10-20 Thread Lachlan Musicman

On 21 October 2016 at 12:39, Christopher Samuel wrote: > > On 21/10/16 12:29, Andrew Elwell wrote: > > > When running sreport (both 14.11 and 16.05) I'm seeing "duplicate" > > user info with different timings. Can someone say what's being added > > up separately here - it seems to be summing some

[slurm-dev] Re: Packaging for fedora (and EPEL)

2016-10-17 Thread Lachlan Musicman

I've had consistent success with the documented system - "rpmbulid slurm-.tgz" then yum installing the resulting files, using 15.x, 16.05 and 17.02. Have on occasion needed to recompile - hdf5 support and for non main line plugins, but otherwise it's been pretty easy. Will happily support/debug y

[slurm-dev] Re: ulimit issue I'm sure someone has seen before

2016-10-13 Thread Lachlan Musicman

Mike, I would suggest that the limit is a SLURM limit rather than a ulimit. What is the result of scontrol show config | grep Mem ? Because you have set your SelectTypeParameters=CR_Core_Memory Memory will cause jobs to fail if they go over the default memory limit. The SLURM head will kill j

[slurm-dev] Re: Draining, Maint or ?

2016-10-11 Thread Lachlan Musicman

partition - jobs running on that partition will continue to do so cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 12 October 2016 at 10:35, Lachlan Musicman wrote: > Hola, > > For reasons, our IT team ne

[slurm-dev] Draining, Maint or ?

2016-10-11 Thread Lachlan Musicman

Hola, For reasons, our IT team needs some downtime on our authentication server (FreeIPA/sssd). We would like to minimize the disruption, but also not lose any work. The current plan is for the nodes to be set to DRAIN on Friday afternoon and on Monday morning we will suspend any running jobs, m

[slurm-dev] Re: slurm build options

2016-10-07 Thread Lachlan Musicman

Check against the installed libs? check *-devel? Otherwise I'm not 100% sure - unless the rpmbuild folder with all files still exists and there's something in there? FWIW, it's relatively easy to install all the libs that SLURM needs without causing too much problems. The hardest I've found so far

[slurm-dev] slurm 16.05.5

2016-10-04 Thread Lachlan Musicman

Hola, Just built the rpms as per the installation docs. Noted that there were three new rpms: slurm-openlava-16.05.5-1.el7.centos.x86_64.rpm slurm-pam_slurm-16.05.5-1.el7.centos.x86_64.rpm slurm-seff-16.05.5-1.el7.centos.x86_64.rpm Is that due to a more sophisticated build machine or due to a

[slurm-dev] Re: cons_res / CR_CPU - we don't have select plugin type 102

2016-10-04 Thread Lachlan Musicman

Jose, Do all the nodes have access to either a shared /usr/lib64/slurm or do they each have their own? And is there a file in that dir (on each machine) called select_cons_res.so? Also, when changing slurm.conf here's a quick and easy workflow: 1. change slurm.conf 2. deploy to all machines in c

[slurm-dev] Re: QOS, Limits, CPUs and threads - something is wrong?

2016-10-03 Thread Lachlan Musicman

27;ve always done it this way." - Grace Hopper > > Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > National Energy Research Scientific Computing Center > <http://www.nersc.gov> > dmjacob...@lbl.gov > > - __o > -- _ '\

[slurm-dev] Re: Job Accounting for sstat

2016-10-02 Thread Lachlan Musicman

wrote: > > On 30/08/16 12:39, Lachlan Musicman wrote: > > > Oh! Thanks. > > > > I presume that includes sruns that are in an sbatch file. > > Yup, that's right. > > cheers! > Chris > -- > Christopher SamuelSenior Systems Administrator >

1 2 >

1 - 100 of 170 matches

Mail list logo