[slurm-dev] Re: Slurm versions 17.02.5 and 17.11.0-pre1 are now available
> Slurm version 17.11.0-pre1 is the first pre-release of version 17.11, to > be > released in November 2017. This version contains the support for > scheduling of > a workload across a set (federation) of clusters which is described in > some > detail here: > https://slurm.schedmd.com/SLUG16/FederatedScheduling.pdf Something that seems to be missing in the PDF (unless it's in the "Magic: TBD" part) is the ability for a federated job to have dependencies on sibling jobs - is this sill part of the workflow? ie Federation = MySite sibling cluster 1 = BigCray sibling cluster 2 = PrePostCluster ideallly we'd like a user who probably logged into BigCray as their local cluster to submiit a job with step1 - serial work on PrePostCluster step2 - large srun on BigCray, dependency = afterok: PrePostCluster:step1 step3 - small parallel cleanup on PrePostCluster, dependency=afterok: BigCray:step2 Is this still on schedule for the initial 17.11 release or will it land in a later update or release? Andrew (trying to work out if we'll have time to test 17.11 before upgrading all clusters the 1st week in Jan)
[slurm-dev] Re: rpmbuild from tarball
yes - The *schedmd* suppliied tarballs (as you point to in the wiki page) work fine. The *github* ones dont. if you compare them, the github ones have a different slurm.spec, hence my asking if there was another script needed to sed them On 19 June 2017 at 14:01, Ole Holm Nielsen wrote: > On 06/19/2017 05:36 AM, Andrew Elwell wrote> I've just tried and failed to > get the github release >> >> (https://github.com/SchedMD/slurm/releases) of 16.05.10-2 to build >> using the 'rpmbuild -ta tarball' trick - it's failing on line 88 of >> the spec >> >> ie >>> >>> Name:see META file >>> Version: see META file >>> Release: see META file > > > Works like a charm on CentOS 7.3! Do you have all the prerequisites > installed? See > https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms > > /Ole
[slurm-dev] rpmbuild from tarball
I've just tried and failed to get the github release (https://github.com/SchedMD/slurm/releases) of 16.05.10-2 to build using the 'rpmbuild -ta tarball' trick - it's failing on line 88 of the spec ie > Name:see META file > Version: see META file > Release: see META file however the tarball from the schedmd website has them hard coded aelwell@badger:~/compile$ diff -ur slurm-slurm-16-05-10-2/ slurm-16.05.10-2/ diff -ur slurm-slurm-16-05-10-2/META slurm-16.05.10-2/META --- slurm-slurm-16-05-10-2/META 2017-03-03 08:42:11.0 +0800 +++ slurm-16.05.10-2/META 2017-03-03 08:52:56.0 +0800 @@ -1,36 +1,11 @@ -## -# Metadata for RPM/TAR makefile targets -## -# See src/api/Makefile.am for guidance on setting API_ values -## - Meta: 1 - Name: slurm - Major: 16 - Minor: 05 - Micro: 10 - Version: 16.05.10 - Release: 2 -# Include leading zero for all pre-releases - -## -# When making a new Major/Minor version update -# src/common/slurm_protocol_common.h -# with a new SLURM_PROTOCOL_VERSION signifing the old one and the version -# it was so the slurmdbd can continue to send the old protocol version. -# In src/common/slurm_protocol_util.c check_header_version() -# need to be updated also when changes are added also. -# In src/plugins/slurmctld/nonstop/msg.c needs to have version_string updated. -# The META of libsmd needs to reflect this version and API_CURRENT as well. -# -# NOTE: The API version can not be the same as the Slurm version above. The -#version in the code is referenced as a uint16_t which if 1403 was the -#API_CURRENT it would go over the limit. So keep is a relatively -#small number. -# -# NOTE: The values below are used to set up environment variables in -#the config.h file that may be used throughout Slurm, so don't remove -# them. -## - API_CURRENT: 30 - API_AGE: 0 - API_REVISION: 0 + Api_age: 0 + Api_current: 30 + Api_revision: 0 + Major: 16 + Meta: 1 + Micro: 10 + Minor: 05 + Name: slurm + Release: 2 + Release_tags: dist + Version: 16.05.10 diff -ur slurm-slurm-16-05-10-2/slurm.spec slurm-16.05.10-2/slurm.spec --- slurm-slurm-16-05-10-2/slurm.spec 2017-03-03 08:42:11.0 +0800 +++ slurm-16.05.10-2/slurm.spec 2017-03-03 08:52:54.0 +0800 @@ -85,15 +85,15 @@ %slurm_with_opt sgijob %endif -Name:see META file -Version: see META file -Release: see META file +Name:slurm +Version: 16.05.10 +Release: 2%{?dist} Summary: Slurm Workload Manager License: GPL Group: System Environment/Base -Source: %{name}-%{version}-%{release}.tgz +Source: slurm-16.05.10-2.tar.bz2 BuildRoot: %{_tmppath}/%{name}-%{version}-%{release} URL: http://slurm.schedmd.com/ @@ -431,7 +431,7 @@ # %prep -%setup -n %{name}-%{version}-%{release} +%setup -n slurm-16.05.10-2 %build %configure \ @@ -648,8 +648,8 @@ Cflags: -I\${includedir} Libs: -L\${libdir} -lslurm Description: Slurm API -Name: %{name} -Version: %{version} +Name: slurm +Version: 16.05.10 EOF %if %{slurm_with bluegene} Is there a magic script that sets these? Andrew
[slurm-dev] Re: Fwd: how to perform a DB upgrade?
> 4. Start the new slurmdbd Do this part by hand (ie, slurmdbd -Dvvv) as it takes longer than an init script / systemctl allows for it to start due to the migration, and it'll be flagged as 'failed' When I did this for our test cluster, the initial startip of slurmdbd took ~30 mins or so to update from 14.11.x to 16.05.x Once that'd done, I ctrl-C'd the slurmdbd and started as normal via "systemctl start slurmdbd" (yeah I know. systemd...)
[slurm-dev] 404 on webpages
Paging schedmd peeps: http://slurm.schedmd.com/man_index.html leads to a bunch of 404's
[slurm-dev] slurmdbd user across multiple clusters
In the docs for slurmdbd 16.05 it states[1]: SlurmUser The name of the user that the slurmctld daemon executes as. This user must exist on the machine executing the Slurm Database Daemon and have the same user ID as the hosts on which slurmctld execute. For security purposes, a user other than "root" is recommended. The default value is "root". however, what should this be set to when we have a mixed cluster? The "normal" cluster nodes have slurmuser=slurm, but for the crays (running alps) this needs to be slurmuser=root [2] Will Bad Things(TM) happen if I leave this set to slurmuser=slurm on our DBD host? [1] http://slurm.schedmd.com/slurmdbd.conf.html [2] http://slurm.schedmd.com/cray_alps.html Andrew
[slurm-dev] Re: sreport "duplicate" lines
> Looks like you've somehow created partition specific associations for > some people - not something we do at all. ISTR this was because 2.6 didn't let us have an overall restriction for the cluster and a sub-restriction on the number of jobs to run in a (debug) partition I could understand that in the case of 'maali' which only has a top-level assoc, but not wdavey who has the same as me but only one line showing in sreport. Couldn't see any obvious options to pass to sreport format="cluster,u,l," to display the missing difference, ie sreport cluster AccountUtilizationByUser -th start=2016-07-01 end=2016-10-01 cluster=magnus user=aelwell tree format="cluster,u,l,partition" Unknown field 'partition' The magic seems to be in format_list in https://github.com/SchedMD/slurm/blob/master/src/sreport/cluster_reports.c but I'm still trying to work out where... Andrew
[slurm-dev] Re: sreport "duplicate" lines
Yep, and for that particular account, not all of the members are showing twice - I can't work out what causes it Cluster/Account/User Utilization 1 Jul 00:00 - 30 Sep 23:59 (7948800 secs) Time reported in CPU Hours Cluster Account Login Proper Name Used Energy - --- - --- -- -- magnus pawsey0001 397565 0 magnus pawsey0001 achew Ashley Chew175 0 magnus pawsey0001 achew Ashley Chew136 0 magnus pawsey0001 aelwell Andrew Elwell 2 0 magnus pawsey0001 aelwell Andrew Elwell 2236 0 magnus pawsey0001 bskjerven Brian Skjerven275 0 magnus pawsey0001 bskjerven Brian Skjerven 9 0 magnus pawsey0001 charris Christopher Ha+309 0 magnus pawsey0001 charris Christopher Ha+ 12 0 magnus pawsey0001 cyang Charlene Yang830 0 magnus pawsey0001 cyang Charlene Yang 1912 0 magnus pawsey0001darranDarran Carey 7 0 magnus pawsey0001darranDarran Carey 30 0 magnus pawsey0001 ddeeptim+ Deva Deeptimah+ 2212 0 magnus pawsey0001 ddeeptim+ Deva Deeptimah+ 4130 0 magnus pawsey0001 maali Black Swan170 0 magnus pawsey0001moshea Mark O'Shea 1 0 magnus pawsey0001moshea Mark O'Shea 4 0 magnus pawsey0001 mshaikh Mohsin Ahmed S+ 1460 0 magnus pawsey0001 mshaikh Mohsin Ahmed S+ 4538 0 magnus pawsey0001 pryan Paul Ryan 1611 0 magnus pawsey0001 pryan Paul Ryan 14397 0 magnus pawsey0001reaper Daniel Grimwood 17 0 magnus pawsey0001reaper Daniel Grimwood225 0 magnus pawsey0001wdavey William Davey 0 0 $ sacctmgr show assoc user=maali account=pawsey0001 cluster=magnus ClusterAccount User Partition Share GrpJobs GrpNodes GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS GrpCPURunMins -- -- -- -- - --- --- - --- --- --- - --- --- - - magnus pawsey0001 maali parent 32 128 normal $ sacctmgr show assoc user=wdavey account=pawsey0001 cluster=magnus ClusterAccount User Partition Share GrpJobs GrpNodes GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS GrpCPURunMins -- -- -- -- - --- --- - --- --- --- - --- --- - - magnus pawsey0001 wdavey parent 32 128 normal magnus pawsey0001 wdavey debugqparent 1 4 normal magnus pawsey0001 wdavey workqparent 32 128 normal Should I log this with our vendor (for official support) or directly into the schedmd BZ.?
[slurm-dev] sreport "duplicate" lines
Hi folks, When running sreport (both 14.11 and 16.05) I'm seeing "duplicate" user info with different timings. Can someone say what's being added up separately here - it seems to be summing something differently for me and I can't work out what makes it split into two: $ sreport cluster AccountUtilizationByUser start=2016-07-01 end=2016-10-01 account=pawsey0001 user=aelwell cluster=magnus -t h Cluster/Account/User Utilization 1 Jul 00:00 - 30 Sep 23:59 (7948800 secs) Time reported in CPU Hours Cluster Account Login Proper Name Used Energy - --- - --- -- -- magnus pawsey0001 aelwell Andrew Elwell 2 0 magnus pawsey0001 aelwell Andrew Elwell 2236 0 $ sacctmgr show assoc user=aelwell account=pawsey0001 cluster=magnus ClusterAccount User Partition Share GrpJobs GrpNodes GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS GrpCPURunMins -- -- -- -- - --- --- - --- --- --- - --- --- - - magnus pawsey0001aelwell parent 32 128 normal magnus pawsey0001aelwell debugqparent 1 4 normal magnus pawsey0001aelwell workqparent 32 128 normal is there a way of getting sreport to just give an aggregated total for me, or if not, show why one "usage" is 2h, and the other 2236h? Andrew
[slurm-dev] Re: Packaging for fedora (and EPEL)
> I've had consistent success with the documented system - "rpmbulid > slurm-.tgz" then yum installing the resulting files, using 15.x, > 16.05 and 17.02. Yup, it seems to build well enough but then fails a few picky rpmlint rules - Nothing too major and *could* be worked around with patches but hey, I'd rather get em upstream (lazy future maintainer) Andrew
[slurm-dev] Packaging for fedora (and EPEL)
Hi folks, I see from https://bugzilla.redhat.com/show_bug.cgi?id=1149566 that there have been a few unsuccessful attempts to get slurm into fedora (and potentially EPEL) Is anyone on this list actively working on it at the moment? I'll update the bugzilla ticket to prod the last portential packager but failing that I'm offering to work on it. My plan is to get 16.05 into fedora, but not into EPEL itself (the supported life of a given release is just too short to match with the RHEL timeline), however I'll probably put "unofficial" srpms publicly available that shoud meet all the epel packaging requirements. schedmd people - as some of this may involve patches to the spec file amongst other things, what's the best way to progress this - attach a diff to something on your bugzilla page rather than git pull req? Andrew
[slurm-dev] Re: rpm dependencies in 16.05.5
> I have a Wiki page describing how to install Munge and Slurm on CentOS 7: Thanks Ole, there's some good notes in there I'll use. My original question was more a packaging issue - In this case I don't mind installing the rest of the slurm binaries, but ideally I'd like our slurmdbd host to be just that. slurmdbd alone (and as few other installed applications as possible). %if %{slurm_with munge} %package munge Summary: Slurm authentication and crypto implementation using Munge Group: System Environment/Base Requires: slurm munge BuildRequires: munge-devel munge-libs Obsoletes: slurm-auth-munge %description munge Slurm authentication and crypto implementation using Munge. Used to authenticate user originating an RPC, digitally sign and/or encrypt messages %endif > Requires: slurm munge seems to be the culprit
[slurm-dev] rpm dependencies in 16.05.5
Hi folks, I've just built 16.05.5 into rpms (using the rpmbuild -ta slurm*.tar.bz2 method) to update a CentOS 7 slurmdbd host. According to http://slurm.schedmd.com/accounting.html "Note that SlurmDBD relies upon existing Slurm plugins for authentication and Slurm sql for database use, but the other Slurm commands and daemons are not required on the host where SlurmDBD is installed. Install the slurmdbd, slurm-plugins, and slurm-sql RPMs on the computer when SlurmDBD is to execute. If you want munge authentication, which is highly recommended, you will also need to install the slurm-munge RPM." so just installing slurmdbd, slurm-plugins, and slurm-sql works (yum localinstall), but as expected fails to start: [2016-10-13T20:19:46.931] error: Couldn't find the specified plugin name for auth/munge looking at all files [2016-10-13T20:19:46.931] error: cannot find auth plugin for auth/munge [2016-10-13T20:19:46.931] error: cannot create auth context for auth/munge [2016-10-13T20:19:46.931] fatal: Unable to initialize auth/munge authentication plugin however it's not possible to cleanly install slurm-munge without slurm: [root@ae-test01 ~]# yum localinstall rpmbuild/RPMS/x86_64/slurm-munge-16.05.5-1.el7.centos.x86_64.rpm Loaded plugins: fastestmirror Examining rpmbuild/RPMS/x86_64/slurm-munge-16.05.5-1.el7.centos.x86_64.rpm: slurm-munge-16.05.5-1.el7.centos.x86_64 Marking rpmbuild/RPMS/x86_64/slurm-munge-16.05.5-1.el7.centos.x86_64.rpm to be installed Resolving Dependencies --> Running transaction check ---> Package slurm-munge.x86_64 0:16.05.5-1.el7.centos will be installed --> Processing Dependency: slurm for package: slurm-munge-16.05.5-1.el7.centos.x86_64 base | 3.6 kB 00:00:00 extras | 3.4 kB 00:00:00 updates | 3.4 kB 00:00:00 --> Finished Dependency Resolution Error: Package: slurm-munge-16.05.5-1.el7.centos.x86_64 (/slurm-munge-16.05.5-1.el7.centos.x86_64) Requires: slurm You could try using --skip-broken to work around the problem You could try running: rpm -Va --nofiles --nodigest [root@ae-test01 ~]# yum localinstall rpmbuild/RPMS/x86_64/slurm-munge-16.05.5-1.el7.centos.x86_64.rpm rpmbuild/RPMS/x86_64/slurm-16.05.5-1.el7.centos.x86_64.rpm Loaded plugins: fastestmirror Examining rpmbuild/RPMS/x86_64/slurm-munge-16.05.5-1.el7.centos.x86_64.rpm: slurm-munge-16.05.5-1.el7.centos.x86_64 Marking rpmbuild/RPMS/x86_64/slurm-munge-16.05.5-1.el7.centos.x86_64.rpm to be installed Examining rpmbuild/RPMS/x86_64/slurm-16.05.5-1.el7.centos.x86_64.rpm: slurm-16.05.5-1.el7.centos.x86_64 Marking rpmbuild/RPMS/x86_64/slurm-16.05.5-1.el7.centos.x86_64.rpm to be installed Resolving Dependencies --> Running transaction check ---> Package slurm.x86_64 0:16.05.5-1.el7.centos will be installed ---> Package slurm-munge.x86_64 0:16.05.5-1.el7.centos will be installed --> Finished Dependency Resolution Dependencies Resolved PackageArch Version Repository Size Installing: slurm x86_6416.05.5-1.el7.centos /slurm-16.05.5-1.el7.centos.x86_64 85 M slurm-mungex86_6416.05.5-1.el7.centos /slurm-munge-16.05.5-1.el7.centos.x86_64 44 k Transaction Summary Install 2 Packages Total size: 85 M Installed size: 85 M Is this ok [y/d/N]: y So - is this just a broken spec file that sets unneeded dependencies or are the docs wrong that you don't need to install slurm? Andrew
[slurm-dev] Re: Remote Visualization and Slurm
> If anyone has a working remote visualization cluster that integrates well > with slurm, I would love to hear from you. We're using 'strudel' https://www.massive.org.au/userguide/cluster-instructions/strudel and our local instructions are https://support.pawsey.org.au/documentation/display/US/Getting+started%3A+Remote+visualisation+with+Strudel Andrew
[slurm-dev] Re: Cray Resource Utilization Reporting (RUR) via plugin
Thanks Danny, Native is on our roadmap, but we're rolling out 14.11.x first onto the production systems first. We'll see what RUR offers vs native when we get the next set of changes through our test/dev system Andrew
[slurm-dev] Cray Resource Utilization Reporting (RUR) via plugin
Hi All, We're investigating the possibility of enabling RUR on our XC30's, with the end goal of integrating this into the slurmdbd for jobs. Is anyone else working on this? if not, is anyone else interested? I know that there's already ./acct_gather_energy/cray/acct_gather_energy_cray.c but I don't see anything to interact with RUR. Andrew
[slurm-dev] FlexLM integration - roughly how much work?
Hi Folks, At the Lugano meeting last year, SchedMD said that the Flexlm integration had come off the short term roadmap due to other features. We’re interested in the possibility of holding jobs until certain licences are available (hello ansys) rather than them running and failing. Can anyone speculate roughly how much work is involved to finish the current implementation? Is this feature of interest to any other users if we got it into a fork or pull-requestable branch? Andrew
[slurm-dev] Re: ReqNodeNotAvail instead of not accepting at all
> > Is there anyway to configure slurm that it won't accept jobs with non > existent features ? > Do you have EnforcePartLimits=YES in your config file?
[slurm-dev] Re: sbatch --array question and a tale of job and task confusion
I'll add that this is (most likely) being seen on slurm 2.6.6 on a Cray using ALPS. /me waves to Balt
[slurm-dev] Re: including config files
> That's a new feature in SLurm v14.11. ah right, (digs out git blame, so it is) - Is there any equvalent functionality or variable parsing in older (2.6.9 or 14.03) releases prior to Natan's patch?
[slurm-dev] including config files
Hi Folks, According to the docs (http://slurm.schedmd.com/slurm.conf.html) it should be possible to have include otherconfig.conf in my slurm.conf, however I'd like to make this ${ClusterName}.conf - is this possible to do this? I see that in src/common/parse_config.c there seems to be some sort of hook for this static char *_parse_for_format(s_p_hashtbl_t *f_hashtbl, char *path) { char *filename = xstrdup(path); char *format = NULL; char *tmp_str = NULL; while (1) { if ((format = strstr(filename, "%c"))) { /* ClusterName */ if (!s_p_get_string(&tmp_str, "ClusterName",f_hashtbl)){ error("%s: Did not get ClusterName for include " "path", __func__); xfree(filename); break; } xstrtolower(tmp_str); but I can't work out how to do this related, I also notice that it seems to need a fully qualified path - is there a flexible shorthand for the same directory as the existing config files? Many thanks Andrew
[slurm-dev] Re: Error: Unable to contact slurm controller
Hi Gerry, > [2014-08-21T09:30:09.673] fatal: system has no usable batch compute nodes > We see this on our systems (running Slurm + Alps/basil rather than native) when the slurmctld starts before the sdb has a list of batch nodes. It's bitten us when we've set the nodes to interactive rather than batch, and more regularly when we've restarted the sdb and slurmctld has started too early in the boot process. (a quick 'service slurm restart' sorts that tho) Andrew
[slurm-dev] --parsable(2) option for squeue / sinfo
Hi folks, Wishlist item -- would it be easy to port in the parsable flags into squeue? from a very quick glance over the code, it seems that sreport and sacctmgr use common/print_fields.c but that's not used from squeue. Many thanks Andrew
[slurm-dev] Re: Pbs to slurmdbd
> You might take a look at the moab_2_slurmdb.pl script in > contribs/slurmdb-direct. Thanks - I figured that was a good start - my concern was the > use lib qw(/home/da/slurm/1.3/ line in the code - I wasn't sure how much bitrot had set in to make it work with 2.6.x :-)
[slurm-dev] Pbs to slurmdbd
Hi folks, We're migrating from pbs pro to slurm mid cpu accounting cycle. Since slurmdbd/sreport looks nicer than grepping through pbs logs for usage (no gold on this cluster), is there a way to populate slurmdbd records from pbs till we migrate? (I.e. has anyone done this already rather than me coding from scratch?) Andrew
[slurm-dev] Installation onto an XT30
Hi Folks, I'm trying to install slurm (2.6.2) onto our Cray XT30 -- I've been following the guide at http://slurm.schedmd.com/cray.html and Gerrit's paper from CUG11, but I've got a few questions about daemon placement and configuration. 1) we use eslogin nodes (and other external services) so the instructions to enable the cray job pam module will fail as /proc/cray/ -- we can (and have) enabled this on the internal service nodes -- is this an issue? 2) I'm basically replicating our current PBS Pro daemon placement -- using the mom nodes for the slurmd, and our aux node (currently running flexlm) as the slurmctld (I'm not using our sdb node as that can't mount the common NFS share with the slurm binaries/configs) -- is this the right approach? 3) we're using a central slurmdbd on a separate host (plan to get all the site accounting from the various clusters together) -- as we use LDAP for all our user details, is there a quick way to seed the initial 'sacctmgr add user <...>' stage already about (or do I hack up an ldapsearch script to do it)? Many thanks Andrew