[slurm-dev] Offlining Faulty GPU?

2017-11-03 Thread Ryan Novosielski
ly clever that makes shorter work of that. Thanks! - -- || \\UTGERS, |--*O* ||_// the State |Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus || \\of NJ | Office of Advan

[slurm-dev] RE: Selecting a network interface with srun

2017-10-26 Thread Ryan Novosielski
least for Infiniband/RDMA. - -- || \\UTGERS, |--*O*---- ||_// the State |Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus || \\of NJ | Office of Advanced Res. Comp.

[slurm-dev] RE: Selecting a network interface with srun

2017-10-25 Thread Ryan Novosielski
e > nodes. (Sorry - my head is in PBSPro world these days so that would > be a resources_available in that world) > - -- || \\UTGERS, |--*O*---- ||_// the State |Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Techn

[slurm-dev] Re: Thoughts on GrpCPURunMins as primary constraint?

2017-07-25 Thread Ryan Novosielski
I appreciate the write ups, thanks. Anyone using GrpCPURunMins (or any of the similar ones really) have any unanticipated negatives to report? -- || \UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos

[slurm-dev] Re: srun CPU use

2017-07-05 Thread Ryan Novosielski
Maybe this has something to do with the requirement that ports be open to use srun (if you don't have that port open, it won't work at all)? Perhaps there is some limit on each port, etc.? From: Craig Yoshioka Sent: Wednesday, July 5, 2017 1:37:00 PM To:

[slurm-dev] Re: KNL node down after reboot

2017-05-16 Thread Ryan Novosielski
SLURM has worked this way as long as I can remember. If you don't use scontrol reboot_nodes, nodes are "down" when they come back because SLURM wasn't notified about the reboot. This is configurable in slurm.conf. From: nico.faer...@id.unibe.ch Sent: Tue

[slurm-dev] Re: cgroups and memory accounting

2017-04-21 Thread Ryan Novosielski
I actually meant to send this to my local tech support group and accidentally did not change the recipient, but I’m now glad I did as that is also useful information. Sorry for resurrecting the dead thread though. Thanks, Oliver! > On Apr 21, 2017, at 13:54, Oliver Freyermuth > wrote: > > >

[slurm-dev] Re: cgroups and memory accounting

2017-04-21 Thread Ryan Novosielski
*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Jan 22, 2016, at 05:33, Felip Moll wrote: >

[slurm-dev] Re: Cannot get SlurmDBD service running (missing mysql plugin)

2017-04-15 Thread Ryan Novosielski
This is not something you should need to do. If you have the appropriate libraries around, this should just happen. I have gone through this same thing before. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski

[slurm-dev] Re: Error messages: find_node_record: lookup failure when setting FQDN for compute nodes

2017-04-15 Thread Ryan Novosielski
Read this slurm.conf manual, under the parameters that start with Node. They discuss this situation. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu<mailto:novos...@rutgers.

[slurm-dev] Node Weight and Topology

2017-03-06 Thread Ryan Novosielski
| \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `'

[slurm-dev] Re: Rename a SLURM account?

2017-02-13 Thread Ryan Novosielski
;> On Feb 10, 2017, at 5:37 PM, Ryan Novosielski wrote: >> >> Hope someone has an idea here: >> >> We apparently accidentally named an account incorrectly at my organization. >> Trying to update it, I got the error message “Can’t modify the name of an >&

[slurm-dev] Rename a SLURM account?

2017-02-10 Thread Ryan Novosielski
. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ryan Novosielski
---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu<mailto:novos...@rutgers.edu> || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \of NJ | Office of Advanced Research Computing - MSB C630, Newark

[slurm-dev] Re: mail job status to user

2017-01-15 Thread Ryan Novosielski
ail address, so you could share a script like this and it would work for anyone without modifying it. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu<mailto:novos...@rutgers.edu> ||

[slurm-dev] Re: mail job status to user

2017-01-13 Thread Ryan Novosielski
as the ability to mail user about the job status automatically > when > job exit. Does Slurm has the same feature without using the SBATCH command > in the job submission? > > #SBATCH --mail-type=ALL > #SBATCH --mail-user=u...@example.com > > > Thanks for suggestions

[slurm-dev] Re: how to change cluster control address?

2017-01-10 Thread Ryan Novosielski
> On Jan 10, 2017, at 23:45, Riccardo Murri wrote: > > > (Ryan Novosielski, Tue, Jan 10, 2017 at 02:16:07PM -0800:) >> I actually just answered this question the other day. Take a look in >> the archives. The gist of it appears to be that slurmdbd will fix that &

[slurm-dev] Re: how to change cluster control address?

2017-01-10 Thread Ryan Novosielski
the cluster control > address is changed back to 192.168.0.1. > > What am I doing wrong? > > Thanks, > R > > -- > Riccardo Murri, Schwerzenbacherstrasse 2, CH-8606 Nänikon, Switzerland -- || \\UTGERS, |---*O*---

[slurm-dev] slurm-dev Re: sacctmgr modify cluster controlhost?

2017-01-04 Thread Ryan Novosielski
kept getting set back to the wrong thing when I did finally try changing the database). > On Aug 9, 2016, at 6:55 PM, Ryan Novosielski > wrote: > > Is it really possible that no one has an answer to this? I guess I can start > looking through the source code, but I'd hope

[slurm-dev] Re: Advice for troubleshooting Weight= scheduling problem?

2016-12-13 Thread Ryan Novosielski
look? Recommended debug info I should turn on and look at? -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of

[slurm-dev] Advice for troubleshooting Weight= scheduling problem?

2016-12-08 Thread Ryan Novosielski
ll be added to the pool of nodes being considered for scheduling individually. The default value is 1. --- -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist

[slurm-dev] SLURM 15.08.12; disable sview?

2016-10-11 Thread Ryan Novosielski
process, even if I do have the dependencies available? I don’t see any config switch in the specfile, or any other obvious way to do it. TIA, -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-10-11 Thread Ryan Novosielski
Thanks for clearing that up. I was pretty sure there was no problem at all in using logrotate, and I know that restarting slurmctld does not ordinarily lose jobs. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski

[slurm-dev] Re: Accounting needs slurm daemon restart to apply changes

2016-10-11 Thread Ryan Novosielski
I suspect that you, like I, ended up with an incorrect "ControlHost" in "sacctmgr list clusters". This is the address that will be notified that a change has been made in the accounting database. I still haven't gotten a suggestion on how to fix it without losing my accounting data, though. :-

[slurm-dev] Re: Confusing JobState Reason for Pending due to TimeLimit

2016-09-20 Thread Ryan Novosielski
information as needed. -- || \\UTGERS, |-------*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' signature.asc Description: Message signed with OpenPGP using GPGMail

[slurm-dev] Prolog script (maybe) question?

2016-09-14 Thread Ryan Novosielski
||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `'

[slurm-dev] Re: external slurmdbd for multiple clusters

2016-09-01 Thread Ryan Novosielski
> Is it best to do a local setup then change the config once it is running > locally? > > -Chad > >> On Sep 1, 2016, at 3:07 PM, Ryan Novosielski wrote: >> >> Simply put: yes. That is our setup. >> >>> On Sep 1, 2016, at 3:50 PM, Chad Cropper

[slurm-dev] Re: external slurmdbd for multiple clusters

2016-09-01 Thread Ryan Novosielski
-*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' signature.asc Description: Message signe

[slurm-dev] Re: srun hanging when requesting --gres after cgroups configured, 15.08, CentOS 7

2016-08-15 Thread Ryan Novosielski
> On Aug 14, 2016, at 19:51, Christopher Samuel wrote: > > > On 12/08/16 14:44, Ryan Novosielski wrote: > >> [pid 11767] open("/dev/tty", O_RDWR|O_NOCTTY|O_NONBLOCK) = -1 ENXIO (No such >> device or address) > > Have you tried passing the --pty opti

[slurm-dev] Re: srun hanging when requesting --gres after cgroups configured, 15.08, CentOS 7

2016-08-12 Thread Ryan Novosielski
> On Aug 12, 2016, at 5:09 PM, Ryan Novosielski wrote: > >> On Aug 12, 2016, at 12:32 PM, Ryan Novosielski wrote: >> >>> On Aug 12, 2016, at 12:16, Kilian Cavalotti >>> wrote: >>> >>> On Thu, Aug 11, 2016 at 9:44 PM, Ryan Novosielski

[slurm-dev] Re: srun hanging when requesting --gres after cgroups configured, 15.08, CentOS 7

2016-08-12 Thread Ryan Novosielski
> On Aug 12, 2016, at 12:32 PM, Ryan Novosielski wrote: > >> On Aug 12, 2016, at 12:16, Kilian Cavalotti >> wrote: >> >> On Thu, Aug 11, 2016 at 9:44 PM, Ryan Novosielski >> wrote: >>> [pid 11767] >>> open("/sys/fs/cgroup/devices

[slurm-dev] Re: srun hanging when requesting --gres after cgroups configured, 15.08, CentOS 7

2016-08-12 Thread Ryan Novosielski
> On Aug 12, 2016, at 12:16, Kilian Cavalotti > wrote: > > On Thu, Aug 11, 2016 at 9:44 PM, Ryan Novosielski > wrote: >> [pid 11767] >> open("/sys/fs/cgroup/devices/slurm/uid_109366/job_5377709/devices.allow", >> O_WRONLY) = 10 >> [pid 11767]

[slurm-dev] Re: srun hanging when requesting --gres after cgroups configured, 15.08, CentOS 7

2016-08-11 Thread Ryan Novosielski
0 [pid 11769] nanosleep({1, 0}, NULL) = 0 [pid 11769] nanosleep({1, 0}, NULL) = 0 [pid 11769] nanosleep({1, 0}, NULL) = 0 Anyone got any ideas? I guess I’m going to have to try this on a second system to make sure that something about the system itself is not the problem (it’s state

[slurm-dev] Re: srun hanging when requesting --gres after cgroups configured, 15.08, CentOS 7

2016-08-11 Thread Ryan Novosielski
> On Aug 11, 2016, at 5:50 PM, Kilian Cavalotti > wrote: > > On Thu, Aug 11, 2016 at 12:46 PM, Ryan Novosielski > wrote: >> I’ll try adding the Gres debugging, but is there some way to figure out what >> this alleged device “819275” is (this number will change

[slurm-dev] Re: srun hanging when requesting --gres after cgroups configured, 15.08, CentOS 7

2016-08-11 Thread Ryan Novosielski
nd adding DebugFlags=Gres > to your slurm.conf and look at what the Slurm controller logs. > > Cheers, > -- > Kilian > -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `'

[slurm-dev] Re: srun hanging when requesting --gres after cgroups configured, 15.08, CentOS 7

2016-08-11 Thread Ryan Novosielski
Just to eliminate some part of the pty process as the problem, I pulled the pty and changed the command from “bash -i” to “hostname”. Same hang. > On Aug 11, 2016, at 2:31 PM, Ryan Novosielski wrote: > > I’m also noticing that I sometimes see this after a few minutes: > >

[slurm-dev] Re: srun hanging when requesting --gres after cgroups configured, 15.08, CentOS 7

2016-08-11 Thread Ryan Novosielski
] gres_cnt found:0 configured:0 avail:0 alloc:0 [2016-08-11T14:27:48.166] gres_bit_alloc:NULL [2016-08-11T14:27:48.166] gres_used:(null) > On Aug 11, 2016, at 2:11 PM, Ryan Novosielski wrote: > > Thanks very much for your reply — that part of the documentation was very

[slurm-dev] srun hanging when requesting --gres after cgroups configured, 15.08, CentOS 7

2016-08-11 Thread Ryan Novosielski
. I currently have /dev/nvidia* -- this covers everything including "nvidia" in the name. Are there maybe others that I'm missing? Let me know what would be helpful info to provide. Thank you! -- || \\UTGERS,|---*O*------- ||_//

[slurm-dev] Re: sacctmgr modify cluster controlhost?

2016-08-09 Thread Ryan Novosielski
r way! From: Ryan Novosielski Sent: Friday, August 5, 2016 2:44 AM To: slurm-dev Subject: [slurm-dev] Re: sacctmgr modify cluster controlhost? Hi all, I'd written about this some time ago -- I need to change the ControlHost for my cluster (it somehow got set to a machine that do

[slurm-dev] Re: sacctmgr modify cluster controlhost?

2016-08-04 Thread Ryan Novosielski
______ From: Ryan Novosielski Sent: Tuesday, May 10, 2016 11:36 AM To: slurm-dev Subject: [slurm-dev] sacctmgr modify cluster controlhost? -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi there, Using SLURM 15.08. Apparently our cluster ControlHost as shown by sacctmgr show cluster is incorrec

[slurm-dev] Re: Put node "idle" when node restart

2016-07-29 Thread Ryan Novosielski
You need to use the scontrol reboot_nodes functionality, or similar, to restart your nodes. I suspect you'll see "Node rebooted unexpectedly" in "sinfo -R" for these nodes when you reboot them. SLURM wasn't aware that this restart was going to happen and so treats it as a problem. As someone al

[slurm-dev] Re: SOLVED in 16.05.2: Re: Failed to build 16.05.1 RPMs on CentOS 5

2016-07-08 Thread Ryan Novosielski
Fixed in 16.05.2! Thanks to those involved. From: Ryan Novosielski Sent: Wednesday, June 29, 2016 11:55 PM To: slurm-dev Subject: Re: [slurm-dev] Re: Failed to build 16.05.1 RPMs on CentOS 5 Hi folks, This appears to not be fixed in 16.05.1 somehow; I&#

[slurm-dev] Re: Failed to build 16.05.1 RPMs on CentOS 5

2016-06-29 Thread Ryan Novosielski
Hi folks, This appears to not be fixed in 16.05.1 somehow; I'm guessing fixed in some places, missed here? make[4]: Leaving directory `/usr/src/redhat/BUILD/slurm-16.05.1/src/db_api' mv -f .deps/print.Tpo .deps/print.Po /bin/sh ../../libtool --tag=CC --mode=link gcc -O2 -g -pipe -Wall -Wp,

[slurm-dev] Failed to build 16.05 RPMs on CentOS 5

2016-06-16 Thread Ryan Novosielski
/slurm-16.05.0' make: *** [all] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.94019 (%build) RPM build errors: Bad exit status from /var/tmp/rpm-tmp.94019 (%build) -- || \\UTGERS, |---*O*--- ||_// the State |

[slurm-dev] Re: slurm package for Trusty?

2016-06-15 Thread Ryan Novosielski
---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' signature.asc Description: Message signed with OpenPGP using GPGMail

[slurm-dev] Increase size of running job/correcting incorrect resource allocations?

2016-05-24 Thread Ryan Novosielski
anyone has any other ideas. Thanks! -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced

[slurm-dev] Re: slurm equivalent of moab mdiag -p

2016-05-19 Thread Ryan Novosielski
tly you have to be using the multi-factor priority plugin. I’m assuming that the job also has to be invoking some part of the multi-factor priority, as on my other system, since it only shows one of many jobs running. -- || \\UTGERS, |---*O*----

[slurm-dev] Re: NUMA Nodes vs. Sockets

2016-05-19 Thread Ryan Novosielski
what SLURM thinks you ought to have in slurm.conf, you can do “slurmd -C” from a node. -- || \\UTGERS, |---*O*------- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~

[slurm-dev] Re: Difficulty using reboot_nodes or similar for maintenance, SLURM 15.08

2016-05-10 Thread Ryan Novosielski
se NHC. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `'

[slurm-dev] RE: sacctmgr modify cluster controlhost?

2016-05-10 Thread Ryan Novosielski
6 12:22 PM, Lipari, Don wrote: > You might consider this fix: > > https://github.com/SchedMD/slurm/commit/79c9a49913625a0f790f29897f0f09 9156f94268 > > It was committed to the slurm-15.08 branch yesterday. > > Don Lipari > >> -Original Message- From: Ryan Nov

[slurm-dev] sacctmgr modify cluster controlhost?

2016-05-10 Thread Ryan Novosielski
-*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' -BEGIN PGP SIGNATURE- Version: GnuPG

[slurm-dev] Re: Difficulty using reboot_nodes or similar for maintenance, SLURM 15.08

2016-05-10 Thread Ryan Novosielski
On 05/10/2016 02:39 AM, Marcin Stolarek wrote: > > > 2016-05-07 6:43 GMT+02:00 Ryan Novosielski <mailto:novos...@rutgers.edu>>: > > > Hi all, > > What I want to do is to be able to use reboot_nodes as it is > described in the manual. Th

[slurm-dev] Re: Local Disk

2016-05-09 Thread Ryan Novosielski
mable Resources, but I have yet to find a way > to specify local disk space in sbatch. This could be a moving target > depending how well jobs are cleaning up after a run. Maybe this is --tmp? -- || \\UTGERS, |---*O*------- ||_//

[slurm-dev] Re: SLURM accounting of "Queued time" on 15.08

2016-05-07 Thread Ryan Novosielski
On 05/07/2016 08:30 PM, Chris Samuel wrote: > > On Friday, 6 May 2016 10:15:10 PM AEST Ryan Novosielski wrote: > >> I was looking in the sacct man pages for some field that contains the >> "Queued time" for a job, as seen in the job status e-mails that are >>

[slurm-dev] SLURM accounting of "Queued time" on 15.08

2016-05-06 Thread Ryan Novosielski
ee it as one of the job accounting fields. Apologies in advance for wasting your time if I've just missed it. Seems like a useful metric. Thanks! -- || \\UTGERS,|---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.

[slurm-dev] Difficulty using reboot_nodes or similar for maintenance, SLURM 15.08

2016-05-06 Thread Ryan Novosielski
| \\UTGERS,|---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `'

[slurm-dev] Re: What cluster provisioning system do you use?

2016-03-15 Thread Ryan Novosielski
> On Mar 15, 2016, at 08:44, Chris Samuel wrote: > >> On Tue, 15 Mar 2016 05:40:29 AM Bjørn-Helge Mevik wrote: >> >> I apologize for the slightly off-topic subject, but I could not think of >> a better forum to ask. If you know of a more proper place to ask this, >> I'd be happy to know about

[slurm-dev] Re: Slurm on CentOS 7.x

2016-03-12 Thread Ryan Novosielski
Firewall on the slurmctld server. -- *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |-*O*- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922

[slurm-dev] Re: Patch for health check during slurmd start

2016-03-10 Thread Ryan Novosielski
s to know. I believe NHC on our system is not configured to take any action on a node that was drained by a anything other than itself. I don't recall whether that is configurable, or just the way it works. -- *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |-*O*- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark `'

[slurm-dev] Re: Patch for health check during slurmd start

2016-03-09 Thread Ryan Novosielski
s. -- *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences* || \\UTGERS |-----*O*----- ||_// Biomedical | Ryan Novosielski - Senior Technologist || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf & Res C

[slurm-dev] RE: Amber + MVAPICH2 slower with SLURM vs PBS

2015-02-11 Thread Ryan Novosielski
possible that will solve the > problem, but this is still peculiar. > > -- *Note: UMDNJ is now Rutgers-Biomedical and Health > Sciences* || \\UTGERS > |-*O*- ||_// Biomedical | > Ryan Novosielski - Senior Tec