to minutes */
> for (i=0; i diff -Naru slurm-16.05.9/src/sshare/sshare.h
> slurm-16.05.9.change/src/sshare/sshare.h
> --- slurm-16.05.9/src/sshare/sshare.h 2017-01-31 11:55:41.0 -0800
> +++ slurm-16.05.9.change/src/sshare/sshare.h2017-02-08 15:
[2017-01-31T09:45:22.329] debug2: node_did_resp r1-02
> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-04
> > > [2017-01-31T09:45:22.329] debug2: node_did_resp r1-01
> > > [2017-01-31T09:45:22.341] debug2: Processing RPC:
> > > MESSAGE_NODE_REGISTRATI
running 'sinfo' for instance, it merely hangs.
> The interfaces for both slurmctld controllers are in the 'trusted' firewall
> group and there is no filtering between them.
> Is there something I am missing to make the backup controller 'kick in' and
> star
This is on our testing/development grid systems so we can easily make
> changes to debug/fix the problem.
>
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
n the DB. Is it safe to run "arbitrary" commands in the
> DB, bypassing slurmdbd?
>
> Thanks in advance.
>
>
> -- lv.
>
>
> -- lv.
>
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
about mpi:
> >srun -n 2 echo "Hello"
> Hello
> Hello
>
> How can I resolve the problem of srun, and let it behaves like sbatch or
> salloc, where the program executed only one time?
>
> The version of slurm is 16.05.3, and
Thanks,
Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
tput
> is too verbose and gets intermingled with the actual job output (the head
> node seams to execute before the user???s job runs but prolog seems to
> execute in parallel with the user???s job on the other nodes). The bigger
> problem is all the nodes ssh-ing to all the other node
On Thu, Jan 05, 2017 at 02:06:58AM -0800, Riccardo Murri wrote:
>
> (Paddy Doyle, Thu, Jan 05, 2017 at 01:19:57AM -0800:)
> >
> > On Wed, Jan 04, 2017 at 10:26:06PM -0800, Riccardo Murri wrote:
> > >
> > > Thanks for all the suggestions. Everything worked
ve already gone into production with the new setup, why bother? (Ok
the users might notice and get worried; and yes if they are sorting
slurm-nnn.out files based on name and not date, then it's a pain)
Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College
r Science IT
> University of Zurich
> Winterthurerstrasse 190, CH-8057 Zürich (Switzerland)
> Tel: +41 44 635 4208
> Fax: +41 44 635 6888
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
terra 122858328+1
>yangliu hprc Operator terra hprc
>
>
> 1
>
>
> ]$ sacctmgr list account
>AccountDescr Org
> -- ----
>
> 1228
hieve that?
>
> If yes, is there an other way to ensure that for each normal job a full
> node is allocated?
>
> Thanks
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
error: slurmdbd: DBD_SEND_MULT_JOB_START failure:
> Connection refused
>
> This was a running system and we just pushed out an update from 15.8.10 to
> 15.8.12
As a wild guess, was the old daemon still running (and still listening on port
6819)?
Paddy
--
Paddy Doyle
Trinity Centre for High P
m is to run native applications, as we are not
> interested in the offload mode.
This old thread might be of help:
https://groups.google.com/forum/#!searchin/slurm-devel/xeon$20phi|sort:relevance/slurm-devel/0bnMLfV1qA8/de1a89yEl4sJ
Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Ll
ute-7-1
>
>
> Clearly user clwalton is a valid user and has jobs running, but if I try to
> specify him, squeue isn't happy. It is fine with other users...
> What would cause this?
>
> Brian Andrus
> ITACS/Research Computing
> Naval Postgraduate School
> Monterey,
(# jobs x # tasks) into account; perhaps a job
submit plugin?
Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
.
PurgeStepAfter=1month
PurgeJobAfter=12month
Do you need to run detailed historical job reports, or push the data to another
system (you mentioned XDMoD in a different thread)?
Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
nts it can perform relaying for). For postfix, that
would be something like 'relay_domains'.
Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
www.bcamath.org/eanasagasti <http://www.bcamath.org/eanasagasti>
> */
> /*
> */(/*///matematika mugaz bestalde *)
>
> */
> On 09/09/16 10:17, Paddy Doyle wrote:
> >Hi Eneko,
> >
> >On Fri, Sep 09, 2016 at 12:12:32AM -0700, Eneko Anasagasti wrote:
>
> >>>>> >> >> >> and
> >>>>> >> >> >> script parameters to run a job array on a single node/server
> >>>>> >> >> >> workstation, with more than one concurrent task of the job
> >>>
t know what 'slurmreport' is, but that looks like a Perl include path
error. Maybe you need to install the slurm-perlapi package?
Thanks,
Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
> > `'
> >
>
>
> ____
> *** The information contained in this communication may be confidential, is
> intended only for the use of the recipient(s) named above, and may be legally
> privileged. If the reader of this message is not the intended recipient, you
> are hereby notified that any dissemination, distribution, or copying of this
> communication, or any of its contents, is strictly prohibited. If you have
> received this communication in error, please return it to the sender
> immediately and delete the original message and any copies of it. If you have
> any questions concerning this message, please contact the sender. ***
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
On Wed, Aug 17, 2016 at 03:20:20AM -0700, Adrian Sevcenco wrote:
>
> On 08/17/2016 08:41 AM, Paddy Doyle wrote:
> >
> > Hi Adrian,
> Hi!
>
> yeah, i thought that if i set CPUs=8 this will set that that machine
> have 8 job slots
>
> > You should define
Hi Adrian,
On Tue, Aug 16, 2016 at 08:11:41AM -0700, Adrian Sevcenco wrote:
>
> Hi! I have trouble understanding the definition of job slots for each node ...
> and i am running slurm only on my desktop to get used to it until i move it
> to the clusters .. it is not clear to me how one can def
On Fri, Jul 22, 2016 at 09:42:38AM -0700, P. Larry Nelson wrote:
>
> Paddy, that was the problem! Many thanks!
Good stuff. :)
> However, slurmd -C still reports (null) for ClusterName.
>
> Not a big issue, but I'm curious why, since slurm.conf has
> ClusterName=dorfman
>
> Something else I'
7-244-9855) | IT Administrator
> 457 Loomis Lab | High Energy Physics Group
> 1110 W. Green St., Urbana, IL | Physics Dept., Univ. of Ill.
> MailTo: lnel...@illinois.edu | http://hep.physics.illinois.edu/home/lnelson/
> --
> "Information without accountability is just noise." - P.L. Nelson
>
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
negative
> after they reach zero but the assocation usage decays and the actual usage as
> seen by sreport is
> allowed to continue to increase?
>
> Best regards
>
> Stuart
>
> On 23/04/15 10:24, Paddy Doyle wrote:
> > This works reasonably well for us, but w
e the job executed on the cluster (on which compute nodes) and
> how many (much) resources the job used.
>
> I???ve set up some prototype slurm prolog and epilog scripts and included
> some write (echo) statements, however I don???t see any of the information in
> job output fi
Doh! Sorry, please ignore: we have a reservation in place for a user starting
today, and so obviously those nodes are left idle as backfill can't start the
longer jobs.
Paddy
On Fri, May 20, 2016 at 09:12:29AM +0100, Paddy Doyle wrote:
> Forgot to mention, it's slurm version
Forgot to mention, it's slurm version slurm-15.08.7
On Fri, May 20, 2016 at 09:03:06AM +0100, Paddy Doyle wrote:
> Hi all,
>
> We're seeing a really strange scheduling issue on one of our clusters, whereby
> jobs are not being scheduled, even though there are many id
0:00
6 kelvin-n[014-015,025-026,028-029]
86677 compute GdDC_25_ lucida RUNNING 13:16:34 2-00:00:00
6 kelvin-n[007-012]
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://
hought I???d mention it anyway as no one
> likes errors in their logs! :)
>
> We were using GrpCPURunMins to limit resource use to accounts, that seems to
> be handled by GrpTRESRunMin now.
>
> Let me know if you need more info!
>
> Chris
>
> ???
> Christophe
uld make a difference with
shared nodes.
I think for now we'll just have to watch the users and ask them to play nicely
if it comes up again. :)
Thanks,
Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
me group in this
instance, so we can ask them to play nicely with each other).
But it would be good if there was a way to harden the priority system against
it. I've looked in slurm.conf and can't see any parameter which might be
relevant.
Or is the current behaviour a desired feature f
--
> >>Maciej Olchowik
> >>HPC Systems Administrator
> >>KAUST Supercomputing Laboratory (KSL)
> >>Al Khawarizmi Bldg. (1) Room 0134
> >>Thuwal, Kingdom of Saudi Arabia
> >>tel +966 2 808 0684
> >>
> >>____
rnoon,
> >>
> >>I apologies for the newb question but I'm setting up slurm
> >>for the first time in a very long time. I've got a small cluster
> >>of a master node and 4 compute nodes. I'd like to install
> >>slurm on an NFS file system th
launcher, which takes care of those details I
think.
mpiexec.hydra -bootstrap slurm ...
Hopefully it will be of use to you.
Thanks,
Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://
Hi Gerben,
On Sun, Aug 17, 2014 at 01:26:12PM -0700, Gerben Roest wrote:
>
> I run a slurmctld and slurmdbd on a Scientific Linux (SL) 5 server and
> have three SL6 nodes, all running Slurm 14.03.6, with one node behind
> another slurmctld on another cluster. The whole slurm setup seems to run
On Fri, Aug 15, 2014 at 01:49:58AM -0700, Bj?rn-Helge Mevik wrote:
>
> We're testing slurm 14.03.6 on a Rocks 6.1/CentOS 6.3 system. configure
> and make succeeds, but make check fails:
>
> /bin/sh: ../../../auxdir/test-driver: No such file or directory
> make[6]: *** [api-test.log] Error 127
t; Would you mind posting here the steps to slurm cross-compile to xeon phi or
> any URL? In the linkedin discussion isn't describe the procedure
>
>
> Thanks in advance
>
>
> El 05/03/2014, a las 11:57, Paddy Doyle escribió:
>
> >
> > Hi all,
> >
>
obvious question, but have you set the nodes to be 'resume' or 'idle'
using scontrol since then? In our setup at least, once a node is marked 'down',
we have to manually clear it to either 'resume' or 'idle'.
Paddy
--
Paddy Doyle
Trinity Centre
ORD": No such
> file or directory (retrying ...)
> sbatch: error: Munge encode failed: Failed to access "PASSWORD": No such
> file or directory (retrying ...)
> sbatch: error: Munge encode failed: Failed to access "PASSWORD": No such
> file or directory
>
or
> >> VLSCI - Victorian Life Sciences Computation Initiative
> >> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> >> http://www.vlsci.org.au/ http://twitter.com/vlsci
> >>
> >> -BEGIN PGP SIGNATURE-
> >> Version: GnuPG v1.4.14 (GNU/Linux)
> >> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> >>
> >> iEYEARECAAYFAlMCuvIACgkQO2KABBYQAh+pwgCcCLPvoUJamArfmpxY5igcJm3I
> >> 0p0AnjF51qUgZfoZtIsKTDLCK+pJe+bf
> >> =7HO3
> >> -END PGP SIGNATURE-
>
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
-- -- --
>
> 57 hostname production root 1 COMPLETED
> 0:0
> 58 hostname production root 1 COMPLETED
> 0:0
> 59 hostname production 1 COMPLETED
> 0:0
>
Comment=YES
>
> ClusterName=cluster
>
> JobCompType=jobcomp/none
>
> JobAcctGatherFrequency=30
>
> JobAcctGatherType=jobacct_gather/none
>
> SlurmctldDebug=3
>
> SlurmdDebug=3
>
> NodeName=ssfslurmc0[1] Procs=2 RealMemory=2006 State=UNKNOWN
>
> PartitionName=debug Nodes=ssfslurmc0[1] Default=NO MaxTime=INFINITE State=UP
>
> PartitionName=production Nodes=ssfslurmc0[1] Default=YES MaxTime=INFINITE
> State=UP
>
> Thanks for your assistance with this.
>
> Regards,
> -J
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient should
> check this email and any attachments for the presence of viruses. The company
> accepts no liability for any damage caused by any virus transmitted by this
> email.
>
> ww
the same issue. The patch should work with 2.4 if you didn't want to
> wait for 2.5.
Thanks Danny, that looks just like what I was shooting for. Will try it out.
Thanks!
Paddy
>
> Danny
>
> On 11/06/2012 10:35 AM, Paddy Doyle wrote:
> > Hi again,
> >
> >
ge checks, similar to my previously proposed
patch.
Any thoughts / comments?
Thanks,
Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
't actually checking to see if limits were being enforced before killing the
job.
See attached a patch which checks to see if limits or qos are enforced before
killing the job. I've tested it with 2.4.3 and it does what I expect - haven't
tried 2.4.4, but the job_time_limit() logic
ture=thin,mem24GB,ibsw8
> Weight=1
> 89 NodeName=q[253-284] RealMemory=24000 Feature=thin,mem24GB,ibsw9
> Weight=1
> 90 NodeName=q[285-316] RealMemory=24000 Feature=thin,mem24GB,ibsw10
> Weight=1
> 91 NodeName=q[317-348] RealMemory=24000 Feature=thin,mem24GB,ibsw11
> Weight=1
> 92
> 93 PartitionName=all Nodes=q[1-348] Shared=EXCLUSIVE
> DefaultTime=00:00:01 MaxTime=14400 State=DOWN
> 94 PartitionName=core Nodes=q[45-348] Default=YES Shared=NO
> MaxTime=14400 MaxNodes=1 State=UP
> 95 PartitionName=node Nodes=q[1-32,45-348] Shared=EXCLUSIVE
> DefaultTime=00:00:01 MaxTime=14400 State=UP
> 96 PartitionName=devel Nodes=q[33-44] Shared=EXCLUSIVE
> DefaultTime=00:00:01 MaxTime=60 MaxNodes=4 State=UP
>
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/
50 matches
Mail list logo