On 5 November 2017 at 11:08, ايمان <435204...@student.ksu.edu.sa> wrote:
> Good morning;
>
>
> I want to run parallel java code on more than one nodes , but it
> executed only on one nodes ?
>
> How can I run it on more than one nodes?
>
Good morning Eman,
Without more details, it's hard to an
ide you are referring to - do you need a
guide for SLURM or for pj2? Both can be found quickly with Google.
Cheers
L.
>
> --
> *من:* Lachlan Musicman
> *تم الإرسال:* 10/صفر/1439 02:45 ص
> * إلى: slurm-dev *
> * الموضوع: [slurm-dev] Re: Slurm an
On 30 October 2017 at 10:04, ايمان <435204...@student.ksu.edu.sa> wrote:
> I am programming a parallel program that use parallel java library pj2 .
>
> I want to run it using slurm but I did not know if slurm support this
> library.
>
> And what is the correct commands to run java on cluster
>
> T
On 26 October 2017 at 13:27, Alex Chekholko wrote:
> Why can't you just do
>
> for fasta_file in `ls /path/to/fasta_files`; do sbatch
> --output=$fasta_file.out --error=$fasta_file.err myscript.sbatch
> $fasta_file; done
>
Because it was staring me in the face and I ignored it. Thank you.
che
Hi All,
I've now been asked twice in two days if there is any way to intelligently
name slurm output files.
Sometimes our users will do something like
for fasta_file in `ls /path/to/fasta_files`; do sbatch myscript.sbatch
$fasta_file; done
They would like their output and error files to be like
On 19 October 2017 at 20:37, Chris Samuel wrote:
>
> On Thursday, 19 October 2017 7:41:37 PM AEDT Nadav Toledo wrote:
>
> > running : id -u domain_name\\username , does return its uid
>
> So your system is not finding users as just "username", but instead only as
> domain_name\\username which is
On 9 October 2017 at 22:06, cyberseawolf . wrote:
> Hello everybody,
> I'm a young system administrator that is moving from Torque/MAUI to Slurm.
> I set up a pretty peculiar resource management in the previous queue system
> and I would like to port it in the new one.
>
> - I have the following
9857
> On Thu, Oct 5, 2017 at 3:34 PM, Lachlan Musicman
> wrote:
>
>> On 6 October 2017 at 07:35, Doug Meyer wrote:
>>
>>> Within the cluster we have partitions that are shared and some that are
>>> dedicated to specific groups. Is there a way to config
On 6 October 2017 at 07:29, Jacob Chappell wrote:
> Is there a way (via scontrol for example) to disable accounting/qos policy
> enforcement for a single job? We'd like to be able to allow a job to go
> ahead and run, even though it may violate policy (MaxTRES) on a
> case-by-case basis.
>
>
Yes
On 6 October 2017 at 07:35, Doug Meyer wrote:
> Within the cluster we have partitions that are shared and some that are
> dedicated to specific groups. Is there a way to configure slurm so the
> private use partitions do not impact the priority system nor are they
> counted against the account c
On 24 September 2017 at 16:20, Daniel Letai wrote:
> Hello,
>
> B. We have active directory(AD) in our faculty, and We prefer manage
> users/groups from there , is it possible? any guide available somewhere?
>
>
> Search this mailing list, this question pops up every now and again, there
> is no
On 21 September 2017 at 17:55, Fabrice Nininahazwe
wrote:
>
> Dear developer,
>
> I have encountered some of the nodes that are down, I can ping to node
> n003 and not node n001, I have run scontrol update to change the state with
> no success below is the result after running scontrol show nodes
On 18 September 2017 at 11:13, Christopher Samuel
wrote:
>
> On 14/09/17 16:04, Lachlan Musicman wrote:
>
> > It's worth noting that before this change cgroups couldn't get down to
> > the thread level. We would only consume at the core level - ie, all jobs
>
On 15 September 2017 at 17:09, Dr. Thomas Orgis wrote:
> Hi Zhang,
>
> the default behaviour of slurm is to try to keep the environment
> variables from the submit node. I do not like that and in our
> installation, we urge users to always specify
>
> #SBATCH --export=NONE
>
> to avoid that (or o
e/ /home/user1/.bashrc has already define
> many variable, I think these are default variables, currently, every time,
> I also need to source them before using, it is not reasonable from my view.
>
> Whether there is a way to configure slurm to use running node env, not
> submit node
On 14 September 2017 at 19:41, Chaofeng Zhang wrote:
> On node A, I submit job file using sbatch command, the job is running on
> the node B, you will find that the output is not the env of node B, it is
> the env of node A.
>
>
>
> *#!/bin/bash*
>
> *#SBATCH --job-name=mnist10*
>
> *#SBATCH --pa
On 14 September 2017 at 11:06, Lachlan Musicman wrote:
>
> I've just implemented the change from
>
> NodeName=papr-res-compute[34-36] CPUs=8 RealMemory=31000 State=UNKNOWN
>
> to
>
> NodeName=papr-res-compute[34-36] CPUs=8 RealMemory=31000 Sockets=1
> CoresPe
On 14 September 2017 at 11:10, Christopher Samuel
wrote:
>
> On 14/09/17 11:07, Lachlan Musicman wrote:
>
> > Node configuration differs from hardware: CPUs=8:8(hw) Boards=1:1(hw)
> > SocketsPerBoard=8:1(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:2(hw)
>
> Hmm,
On 13 September 2017 at 07:21, Patrick Goetz wrote:
>
> On 09/12/2017 04:21 AM, Gennaro Oliva wrote:
>
>>
>> On Mon, Sep 11, 2017 at 04:51:04PM -0600, Lachlan Musicman wrote:
>>
>>> "Note also if you are running with more than 1 thread per core and
>&
On 13 September 2017 at 10:36, Christopher Samuel
wrote:
>
> On 13/09/17 07:22, Patrick Goetz wrote:
>
> > All I have to say to this is: um, what?
>
> My take has always been that ThreadsPerCore is really for HPC workloads
> where you've decided not to disable HT full stop but want to allocate
>
On 11 September 2017 at 20:11, Gennaro Oliva wrote:
>
> Hi Patrick,
>
> On Fri, Sep 08, 2017 at 01:17:33PM -0600, Patrick Goetz wrote:
> > After some
> > discussion on this list, someone convinced me that setting
> > "ThreadsPerCore=2" informs Slurm that each CPU actually has 8 cores, so I
> > se
Hola,
I was under the impression that environments travelled with slurm when
sbatch was executed - so any node could execute any code as if it was the
env I executed from or built within my sbatch scripts.
We use Environment Modules and this has all worked just great. Very pleased.
Recently I le
On 16 August 2017 at 00:14, Will French wrote:
> > On Aug 15, 2017, at 5:29 AM, Chris Samuel wrote:
> >
> >
> > On Tuesday, 15 August 2017 4:34:55 PM AEST John Hearns wrote:
> >
> >> For the /proc/self you need to start an interactive job under Slurm.
> >
> > You can actually use srun to join an
On 15 August 2017 at 11:38, Christopher Samuel
wrote:
> On 15/08/17 09:41, Lachlan Musicman wrote:
>
> > I guess I'm not 100% sure what I'm looking for, but I do see that there
> > is a
> >
> > 1:name=systemd:/user.slice/user-0.slice/session-373.scope
&g
On 15 August 2017 at 07:41, Robbert Eggermont
wrote:
>
> On 14-08-17 07:50, Lachlan Musicman wrote:
>
>> We have TaskPlugin=task/cgroup and when testing I noticed that the # of
>> threads/cpus being allocated was rounded up to the nearest even. I presume
>> this was du
On 14 August 2017 at 16:22, John Hearns wrote:
> Lachlan, forgive me if I am teaching granny to suck eggs..,,
> I have recently been workign with cgroups.
> If you run an interactive job what do you see when cat /proc/self/cgroups
> Also have you explored in /sys/fs/cgroups and checked what res
Hola,
Slurm is complicated software, and sometimes the docs can be dense - I'm
looking for some clarification please.
We have a system set up with Threads as CPUs. 1 socket, 4 cores, 2 threads
= 8 cpus
I would like to implement CGroups because some of our users are quite happy
to utilise all thr
Hola,
Two things: in the documentation for slurm.conf the reference to ProcTrack
= proctrack/cgroup tells people to see `man cgroup.conf` for more details.
That man page holds no details re proctrack.
https://slurm.schedmd.com/slurm.conf.html
The details in question are on https://slurm.schedmd.
Yep, thanks Chris. I went with regular reboot and have now successfully used
scontrol reboot ASAP
Very handy!
L.
--
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics
is the insistence that we cannot ignore the truth, nor should we panic
about it. It is a shared co
the Name of the Wee Man, so 'reboot' is now a 'legacy tool'
> https://access.redhat.com/solutions/1580343
>
> Jeez... Look HPC compute node - I'm in charge, gottit? Yeah, fight back
> all you like with systemd, but I can pull the power plug.
> Let's se
I've just been asked about implementing a "drain and reboot" for
nodes/partitions.
In slurm.conf, there is a RebootProgram - does this need to be a direct
link to a bin or can it be a command?
RebootProgram=/usr/sbin/reboot
or
RebootProgram='systemctl disable reboot-guard; reboot'
Cheers
L.
291
> Yuseong-gu, Daejeon
> Republic of Korea 305-701
> Tel. +82-10-2075-6911 <+82%2010-2075-6911>
>
> 2017-08-02 13:05 GMT+09:00 Lachlan Musicman :
>
>> [root@n6 /]# si
>>>
>>> PARTITIONNODES NODES(A/I/O/T) S:C:TMEMORY TMP_DISK
>&g
>
> [root@n6 /]# si
>
> PARTITIONNODES NODES(A/I/O/T) S:C:TMEMORY TMP_DISK
> TIMELIMIT AVAIL_FEATURES NODELIST
>
> debug* 6 0/6/0/61:4:27785 113264
> infinite(null) c[1-6]
>
> (for a moment)
>
> [root@n6 /]# si
>
> PARTITION
Sumin,
The error message is saying that the node is down.
When you say "works with sinfo", you need to show us what that means -
sinfo is a command that interrogates the state of nodes, whereas srun sends
commands *to* nodes. So sinfo is meant to work - even if the nodes are
down. It is hte softw
On 28 July 2017 at 14:30, 허웅 wrote:
> I modified my slurm.conf like :
>
>
>
> NodeName=GO[1-5]
>
>
>
> PartitionName=party Default=yes Nodes=GO[1-5]
>
>
>
> and I restarted slurmctld and slurmd services.
>
>
>
> [root@GO1]~# systemctl start slurmctld
>
> [root@GO1]~# systemctl status slurmctld
>
ty* idle
>
> sgo4 1party* idle
>
> sgo5 1party* idle
>
> [root@GO1]~# sn
> Fri Jul 28 09:55:53 2017
>HOSTNAMES
> GO1
> GO2
> GO3
> GO4
>
y out is through, and the only way through is
together. "
*Greg Bloom* @greggish
https://twitter.com/greggish/status/873177525903609857
On 28 July 2017 at 10:47, Lachlan Musicman wrote:
> I think it's because hostname is so undemanding.
>
> How many CPUs does each host have?
I think it's because hostname is so undemanding.
How many CPUs does each host have?
You may need to use ((number of cpus per host) + 1) to see action on
another node.
You can try using stress-ng to test higher loads?
https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/
c
; WDIR=$PWD
> #SBATCH -t 1:00
>
> the -t 1:00 will get ignored by sbatch
>
>
> On Thu, 29 Jun 2017, Lachlan Musicman wrote:
>
>> We have a 40min default time on our main partition.
>>
>> We are finding that researchers that use
>>
>> #SBATCH -
We have a 40min default time on our main partition.
We are finding that researchers that use
#SBATCH --time=0-07:00:00
are still having their jobs terminated at 40 minutes.
Using slurm 17.2.04 on Centos 7.3
Has anyone else experienced this?
Cheers
L.
--
"Mission Statement: To provide ho
We did it in place, worked as noted on the tin. It was less painful than I
expected. TBH, your procedures are admirable, but you shouldn't worry -
it's a relatively smooth process.
cheers
L.
--
"Mission Statement: To provide hope and inspiration for collective action,
to build collective powe
On 9 June 2017 at 15:26, Lachlan Musicman wrote:
> On 9 June 2017 at 14:53, Nicholas C Santucci wrote:
>
>> Those first two of your Gone list I noticed when 17.02.0 was released on
>> Feb 23.
>> A patch was released on Feb 27 due to the change in dependencies.
>>
wait for it's accounting information to be available and include
>> that
>>
>> On 02/27/2017 01:35 PM, dani wrote:
>>
>> Seems like no obsoletes was set on slurm-contribs, so yum complains of
>> conflicts with slurm-sjobs and friends.
>>
>>
Hola,
I followed the instructions for building the 16.05.0 bz2 and installed the
resulting rpms as follows:
Each node got:
slurm.x86_64
slurm-devel.x86_64
slurm-munge.x86_64
slurm-perlapi.x86_64
slurm-plugins.x86_64
slurm-sjobexit.x86_64
slurm-sjstat.x86_64
slurm-torque.x86_64
The head n
I'm pretty sure you need the MDCS.
Having said that, I know people run GNU Octave on clusters, can't speak to
it though.
R works on a cluster quite nicely.
cheers
L.
--
"Mission Statement: To provide hope and inspiration for collective action,
to build collective power, to achieve collectiv
Hi and welcome to SLURM.
It is late and I am tired, but:
1. SLURM is a cluster
2. front end will run the slurm-ctld service. Compute nodes will run slurmd
service. How that is divided is up to you.
cheers
L.
--
"Mission Statement: To provide hope and inspiration for collective action,
t
Hola,
I'd like some advice on QOS for clusters.
Currently, we have a broad 90CPU limit (MaxTRESPU across all associations.
I have a special partition on which I want no limits to apply, it has 80
CPUs. So, I have created a special QOS, named after the partition, with a
MaxTRESPU of 180 CPUS.
Th
On 24 May 2017 at 13:18, Christopher Samuel wrote:
>
> Hiya,
>
> On 24/05/17 13:10, Lachlan Musicman wrote:
>
> > Occasionally I'll see a bunch of processes "running" (sleeping) on a
> > node well after the job they are associated with has finished.
>
Hola,
Occasionally I'll see a bunch of processes "running" (sleeping) on a node
well after the job they are associated with has finished.
How does this happen - does slurm not make sure all processes spawned by a
job have finished at completion?
cheers
L.
--
"Mission Statement: To provide
One user has recently started to see their jobs killed after roughly 40
minutes, even though they have asked for four hours.
40 minutes is partitions' default, but this user has
#SBATCH --time=04:00:00
in their sbatch file?
I have found this: https://bugs.schedmd.com/show_bug.cgi?id=2353 and w
- Patrice Cullors, *Black Lives Matter founder*
On 23 May 2017 at 09:43, Lachlan Musicman wrote:
> Hola,
>
> One of my users has been given the PartitionTimeLimit reason for his jobs
> not running.
>
> He has requested 20 days for the job, but I don't remember setting a time
&
Hola,
One of my users has been given the PartitionTimeLimit reason for his jobs
not running.
He has requested 20 days for the job, but I don't remember setting a time
limit on any partition?
I do recall setting a default time, but not a time limit.
The docs claim:
https://slurm.schedmd.com/squ
On 11 May 2017 at 08:33, Batsirai Mabvakure wrote:
> Is there a command i can execute for slurm to update automatically without
> having to download it again?
>
>
Not really. Ubuntu packages SLURM IIRC, but you would need to wait until
they do their packaging and push the new version. Currently
ed in
grief and rage but pointed towards vision and dreams."
- Patrice Cullors, *Black Lives Matter founder*
On 10 May 2017 at 10:57, Lachlan Musicman wrote:
> Running Slurm 16.05 on CentOS 7.3 I'm trying to start an interactive
> session with
>
> srun -w papr-expanded01 --p
Running Slurm 16.05 on CentOS 7.3 I'm trying to start an interactive
session with
srun -w papr-expanded01 --pty --mem 8192 -t 06:00 /bin/bash
--partition=expanded
srun -w papr-expanded01 --pty -t 06:00 /bin/bash --partition=expanded
srun -w papr-expanded01 --pty --mem 8192 /bin/bash --partition=ex
On 11 April 2017 at 02:36, Raymond Wan wrote:
>
> For SLURM to work, I understand from web pages such as
> https://slurm.schedmd.com/accounting.html that UIDs need to be shared
> across nodes. Based on this web page, it seems sharing /etc/passwd
> between nodes appears sufficient. The word LDAP
sirai mabvakure
wrote:
> Thank you so much for the reply. Is there another way I can configure the
> nodes other than using mpich that allows me only to update the slurm.conf
> file and not install slurm on every new node every time I scale up?
>
>
> On 2017/02/18, 01:21, "La
You will need to install slurm on the new nodes, as it is installed on the
other nodes.
Since they all must have the same conf file, grab the canonical conf file
and put it on all of the machines - making sure that the new nodes are
listed in that conf file.
Restart the head node (the node runnin
I don't know if you can split it at a GRES level, but I would put the node
in the two partitions, and then use QOS to only allow one partition access
to the single card and the other partition 3 cards.
cheers
L.
--
The most dangerous phrase in the language is, "We've always done it this
way."
gt; > In this slurmctld there are a total of 226 nodes, in several different
> > partitions. The cluster of 64 is the only one where I see this
> > happening. Unless that number of nodes is pushing the limit for a single
> > slurmctld (which I doubt) I'd be inclined
On 17 February 2017 at 03:02, Baker D.J. wrote:
> Hello,
>
>
>
> Thank you for the reply. There are two accounts on this cluster. What I
> was primarily trying to do was define a default QOS with a partition. My
> idea was to use sacctmgr to create an association between the partition,
> the QOS
On 16 February 2017 at 09:36, Christopher Samuel
wrote:
>
> We also have all our partitions (other than our debug one reserved for
> sysadmins) marked as "State=DOWN" in slurm.conf so that they won't start
> jobs when slurmctld is brought back up again.
>
Chris,
What's the reasoning behind this
If you are only in one account, you don't need to list it.
What version of slurm are you using? Someone else mentioned needing to
restart slurmctld to users to stick. Which is not something I've
experienced, but try that maybe?
I am presuming that your slurm.conf is set up correctly for accounts?
If you are looking to suspend and resume jobs, use scontrol:
scontrol suspend
scontrol resume
https://slurm.schedmd.com/scontrol.html
The docs you are pointing to look more like taking nodes offline in times
of low usage?
cheers
L.
--
The most dangerous phrase in the language is, "We've
hange to all nodes, restarting slurmctld then running scontrol
reconfigure?
cheers
L.
--
The most dangerous phrase in the language is, "We've always done it this
way."
- Grace Hopper
> On Sat, Feb 11, 2017 at 3:26 AM Lachlan Musicman
> wrote:
>
>> 1. As EV no
1. As EV noted, to get Memory as a consumable resource, you will need to
add it to the line that says CR_CPU - change to CR_CPU_Memory
https://slurm.schedmd.com/slurm.conf.html
2. That's because of the CR_CPU combined with cons_res. Change to CR_CORE
for per core or CR_SOCKET for per socket. For d
There's always the --dependency flag for sbatch. So yes, depending on what
you wanted, you could line up another sbatch after the first if you liked.
cheers
L.
--
The most dangerous phrase in the language is, "We've always done it this
way."
- Grace Hopper
On 1 February 2017 at 08:38, TO_We
trival questions: does node has correct time wrt head node? and is node
correctly configured in slurm.conf? (# of cpus, amount of memory, etc)
cheers
L.
--
The most dangerous phrase in the language is, "We've always done it this
way."
- Grace Hopper
On 1 February 2017 at 08:03, E V wrote:
Check they are all in the same time or ntpd against the same server. I
found that the nodes that kept going down had the time out of sync.
Cheers
L.
--
The most dangerous phrase in the language is, "We've always done it this
way."
- Grace Hopper
On 25 January 2017 at 05:49, Allan Streib wr
_clu+
>
>
>
> one thing that I did notice when I add the user I see this error in the
> slurmctld log
>
>
>
> [2017-01-23T16:47:34.351] error: Update Association request from non-super
> user uid=450
>
>
>
> UID 450 happens to be the slurm user
>
&
Interesting. To the best of my knowledge, if you are using Accounting, all
users actually need to be in an association - ie having a user account is
insufficient.
An Association is a tuple consisting of: cluster, user, account and
(optional) partition.
Is that the problem?
cheers
L.
--
The
We use the SPANK plugin found here
https://github.com/hpc2n/spank-private-tmp
and find it works very well.
--
The most dangerous phrase in the language is, "We've always done it this
way."
- Grace Hopper
On 21 January 2017 at 03:15, John Hearns wrote:
> As I remember, in SGE and in PbsPr
Not 100% sure what you are asking? The mail options are available from
within an sbatch script by using the commands you mention.
They can also be passed directly to slurm when invoking the commands
sbatch --mail-type=ALL --mail-user=e...@mail.com
Are you asking if there is a default "always ma
Will,
I believe you do. While they aren't necessary in your case, I believe the
software has been built for maximum extensibility, and as such there needs
to be:
at least one cluster
at least one account
at least one user
and an association is the "grouping" of those three. The relevant part of
Hi David,
I dealt with this recently (see
https://groups.google.com/forum/#!topic/slurm-devel/DKcFng8c1zE for
instance )
In the end we went with this solution that has worked well for us:
https://slurm.schedmd.com/SUG14/private_tmp.pdf
which describes this plugin:
https://github.com/hpc2n/span
On 8 December 2016 at 07:54, Mark R. Piercy wrote:
>
> Is it ever possible to submit jobs based on a users org affiliation? So
> if a user is in org (PI) "smith" then their jobs would automatically be
> sent to a particular partition. So no need to use the -p option in
> sbatch/srun job.
>
M
Hi,
I've had a request from a user about the email system in SLURM. Basically,
there's a team collaboration and the request was:
is there an sbatch command such that two groups will get different sets of
emails.
Group 1: only get the email if the jobs FAIL
Group 2: get Begin, End and Fail
Cheer
Hey Devs,
The new design on the schedmd site is pretty - thanks!
L.
--
The most dangerous phrase in the language is, "We've always done it this
way."
- Grace Hopper
lenv's if that is the case the switch to a container with rkt seems
> "normal" instead of a more intrusive one all mighty process to rule
> everything that docker had the last time I check, its probably better now.
>
> Saludos.
> Jean
>
> On Tue, Nov 15, 2016 a
Hola,
We were looking for the ability to make jobs perfectly reproducible - while
the system is set up with environment modules with the increasing number of
package management tools - pip/conda; npm; CRAN/Bioconductor - and people
building increasingly more complex software stacks, our users have
On 9 November 2016 at 09:36, Christopher Samuel
wrote:
>
> But /tmp is almost certainly the second worst place (after /dev/shm).
>
I don't know Chris, I think that /dev/null would rate tbh. :)
cheers
L.
--
The most dangerous phrase in the language is, "We've always done it this
way."
- G
Arg, I see now (hit send too soon). My parsing of the man page was wrong.
cheers
L.
--
The most dangerous phrase in the language is, "We've always done it this
way."
- Grace Hopper
On 8 November 2016 at 11:39, Lachlan Musicman wrote:
> Priority: Minor
>
> I notice
Priority: Minor
I notice that this command works well:
sinfo -Nle -o '%C %t'
Tue Nov 8 11:38:09 2016
CPUS(A/I/O/T) STATE
40/0/0/40 alloc
38/2/0/40 mix
36/4/0/40 mix
36/4/0/40 mix
6/34/0/40 mix
0/40/0/40 idle
0/40/0/40 idle
0/40/0/40 idle
0/40/0/40 idle
0/40/0/40 idle
0/40/0/40 idle
0/40/0/40 idl
Peixin,
Again, depends on your OS and deployment methods, but essentially:
In slurm.conf set
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
SlurmctldLogFile=/var/log/slurm/slurm-ctld.log
SlurmdLogFile=/var/log/slur
On 8 November 2016 at 07:11, Peixin Qiao wrote:
> Hi,
>
> I install munge and restart my computer, then munge stopped work and
> restarting munge didn't work. It says:
>
> munged: Error: Failed to check pidfile dir "/var/run/munge": cannot
> canonicalize "/var/run/munge": No such file or director
I think it should. Can you send through your slurm.conf?
Also, the logs usually explicitly say why slurmctld/slurmd don't start, and
the best way to judge if slurm is running is with systemd:
systemctl status slurmctl
systemctl status slurmd
cheers
L.
--
The most dangerous phrase in the l
On 28 October 2016 at 09:20, Christopher Samuel
wrote:
>
> On 28/10/16 08:44, Lachlan Musicman wrote:
>
> > So I checked the system, noticed that one node was drained, resumed it.
> > Then I tried both
> >
> > scontrol requeue 230591
> > scontrol resume 2
Morning,
Yesterday we had some internal network issues that caused havoc on our
system. By the end of the day everything was ok on the whole.
This morning I came in to see one job on the queue (which was otherwise
relatively quiet) with the error message/Nodelist Reason (launch failed
requeued he
On 25 October 2016 at 09:17, Tuo Chen Peng wrote:
> Oh ok thanks for pointing this out.
>
> I thought ‘scontrol update’ command is for letting slurmctld to pick up
> any change in slurm.conf.
>
> But after reading the manual again, it seems this command is instead to
> change the setting at runti
On 25 October 2016 at 08:42, Tuo Chen Peng wrote:
> Hello all,
>
> This is my first post in the mailing list - nice to join the community!
>
Welcome!
>
>
> I have a general question regarding slurm partition change:
>
> If I move one node from one partition to the other, will it cause any
> im
On 21 October 2016 at 12:39, Christopher Samuel
wrote:
>
> On 21/10/16 12:29, Andrew Elwell wrote:
>
> > When running sreport (both 14.11 and 16.05) I'm seeing "duplicate"
> > user info with different timings. Can someone say what's being added
> > up separately here - it seems to be summing some
I've had consistent success with the documented system - "rpmbulid
slurm-.tgz" then yum installing the resulting files, using 15.x,
16.05 and 17.02.
Have on occasion needed to recompile - hdf5 support and for non main line
plugins, but otherwise it's been pretty easy.
Will happily support/debug y
Mike, I would suggest that the limit is a SLURM limit rather than a ulimit.
What is the result of
scontrol show config | grep Mem
?
Because you have set your
SelectTypeParameters=CR_Core_Memory
Memory will cause jobs to fail if they go over the default memory limit.
The SLURM head will kill j
partition
- jobs running on that partition will continue to do so
cheers
L.
--
The most dangerous phrase in the language is, "We've always done it this
way."
- Grace Hopper
On 12 October 2016 at 10:35, Lachlan Musicman wrote:
> Hola,
>
> For reasons, our IT team ne
Hola,
For reasons, our IT team needs some downtime on our authentication server
(FreeIPA/sssd).
We would like to minimize the disruption, but also not lose any work.
The current plan is for the nodes to be set to DRAIN on Friday afternoon
and on Monday morning we will suspend any running jobs, m
Check against the installed libs? check *-devel? Otherwise I'm not 100%
sure - unless the rpmbuild folder with all files still exists and there's
something in there?
FWIW, it's relatively easy to install all the libs that SLURM needs without
causing too much problems. The hardest I've found so far
Hola,
Just built the rpms as per the installation docs.
Noted that there were three new rpms:
slurm-openlava-16.05.5-1.el7.centos.x86_64.rpm
slurm-pam_slurm-16.05.5-1.el7.centos.x86_64.rpm
slurm-seff-16.05.5-1.el7.centos.x86_64.rpm
Is that due to a more sophisticated build machine or due to a
Jose,
Do all the nodes have access to either a shared /usr/lib64/slurm or do they
each have their own? And is there a file in that dir (on each machine)
called select_cons_res.so?
Also, when changing slurm.conf here's a quick and easy workflow:
1. change slurm.conf
2. deploy to all machines in c
27;ve always done it this
way."
- Grace Hopper
>
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> National Energy Research Scientific Computing Center
> <http://www.nersc.gov>
> dmjacob...@lbl.gov
>
> - __o
> -- _ '\
wrote:
>
> On 30/08/16 12:39, Lachlan Musicman wrote:
>
> > Oh! Thanks.
> >
> > I presume that includes sruns that are in an sbatch file.
>
> Yup, that's right.
>
> cheers!
> Chris
> --
> Christopher SamuelSenior Systems Administrator
>
1 - 100 of 170 matches
Mail list logo