ly clever that makes shorter work of that.
Thanks!
- --
|| \\UTGERS, |--*O*
||_// the State |Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
|| \\of NJ | Office of Advan
least for Infiniband/RDMA.
- --
|| \\UTGERS, |--*O*----
||_// the State |Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
|| \\of NJ | Office of Advanced Res. Comp.
e
> nodes. (Sorry - my head is in PBSPro world these days so that would
> be a resources_available in that world)
>
- --
|| \\UTGERS, |--*O*----
||_// the State |Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Techn
I appreciate the write ups, thanks.
Anyone using GrpCPURunMins (or any of the similar ones really) have any
unanticipated negatives to report?
--
|| \UTGERS, |---*O*---
||_// the State | Ryan Novosielski -
novos
Maybe this has something to do with the requirement that ports be open to use
srun (if you don't have that port open, it won't work at all)? Perhaps there is
some limit on each port, etc.?
From: Craig Yoshioka
Sent: Wednesday, July 5, 2017 1:37:00 PM
To:
SLURM has worked this way as long as I can remember. If you don't use scontrol
reboot_nodes, nodes are "down" when they come back because SLURM wasn't
notified about the reboot. This is configurable in slurm.conf.
From: nico.faer...@id.unibe.ch
Sent: Tue
I actually meant to send this to my local tech support group and accidentally
did not change the recipient, but I’m now glad I did as that is also useful
information. Sorry for resurrecting the dead thread though.
Thanks, Oliver!
> On Apr 21, 2017, at 13:54, Oliver Freyermuth
> wrote:
>
>
>
*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
> On Jan 22, 2016, at 05:33, Felip Moll wrote:
>
This is not something you should need to do. If you have the appropriate
libraries around, this should just happen. I have gone through this same thing
before.
--
|| \\UTGERS, |---*O*---
||_// the State | Ryan Novosielski
Read this slurm.conf manual, under the parameters that start with Node. They
discuss this situation.
--
|| \\UTGERS, |---*O*---
||_// the State | Ryan Novosielski -
novos...@rutgers.edu<mailto:novos...@rutgers.
| \\UTGERS, |---*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
;> On Feb 10, 2017, at 5:37 PM, Ryan Novosielski wrote:
>>
>> Hope someone has an idea here:
>>
>> We apparently accidentally named an account incorrectly at my organization.
>> Trying to update it, I got the error message “Can’t modify the name of an
>&
.
--
|| \\UTGERS, |---*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\of NJ | Office of Advanced Research Computing - MSB C630
---*O*---
||_// the State | Ryan Novosielski -
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \of NJ | Office of Advanced Research Computing - MSB C630, Newark
ail
address, so you could share a script like this and it would work for anyone
without modifying it.
--
|| \\UTGERS, |---*O*---
||_// the State | Ryan Novosielski -
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
||
as the ability to mail user about the job status automatically
> when
> job exit. Does Slurm has the same feature without using the SBATCH command
> in the job submission?
>
> #SBATCH --mail-type=ALL
> #SBATCH --mail-user=u...@example.com
>
>
> Thanks for suggestions
> On Jan 10, 2017, at 23:45, Riccardo Murri wrote:
>
>
> (Ryan Novosielski, Tue, Jan 10, 2017 at 02:16:07PM -0800:)
>> I actually just answered this question the other day. Take a look in
>> the archives. The gist of it appears to be that slurmdbd will fix that
&
the cluster control
> address is changed back to 192.168.0.1.
>
> What am I doing wrong?
>
> Thanks,
> R
>
> --
> Riccardo Murri, Schwerzenbacherstrasse 2, CH-8606 Nänikon, Switzerland
--
|| \\UTGERS, |---*O*---
kept getting set back to the
wrong thing when I did finally try changing the database).
> On Aug 9, 2016, at 6:55 PM, Ryan Novosielski
> wrote:
>
> Is it really possible that no one has an answer to this? I guess I can start
> looking through the source code, but I'd hope
look? Recommended debug info I should turn on and look at?
--
|| \\UTGERS, |---*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\of
ll be added to the
pool of nodes being considered for scheduling individually. The default value
is 1.
---
--
|| \\UTGERS, |---*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist
process, even if I do have
the dependencies available? I don’t see any config switch in the specfile, or
any other obvious way to do it.
TIA,
--
|| \\UTGERS, |---*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
Thanks for clearing that up. I was pretty sure there was no problem at all in
using logrotate, and I know that restarting slurmctld does not ordinarily lose
jobs.
--
|| \\UTGERS, |---*O*---
||_// the State | Ryan Novosielski
I suspect that you, like I, ended up with an incorrect "ControlHost" in
"sacctmgr list clusters". This is the address that will be notified that a
change has been made in the accounting database.
I still haven't gotten a suggestion on how to fix it without losing my
accounting data, though. :-
information as needed.
--
|| \\UTGERS, |-------*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
signature.asc
Description: Message signed with OpenPGP using GPGMail
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
> Is it best to do a local setup then change the config once it is running
> locally?
>
> -Chad
>
>> On Sep 1, 2016, at 3:07 PM, Ryan Novosielski wrote:
>>
>> Simply put: yes. That is our setup.
>>
>>> On Sep 1, 2016, at 3:50 PM, Chad Cropper
-*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
signature.asc
Description: Message signe
> On Aug 14, 2016, at 19:51, Christopher Samuel wrote:
>
>
> On 12/08/16 14:44, Ryan Novosielski wrote:
>
>> [pid 11767] open("/dev/tty", O_RDWR|O_NOCTTY|O_NONBLOCK) = -1 ENXIO (No such
>> device or address)
>
> Have you tried passing the --pty opti
> On Aug 12, 2016, at 5:09 PM, Ryan Novosielski wrote:
>
>> On Aug 12, 2016, at 12:32 PM, Ryan Novosielski wrote:
>>
>>> On Aug 12, 2016, at 12:16, Kilian Cavalotti
>>> wrote:
>>>
>>> On Thu, Aug 11, 2016 at 9:44 PM, Ryan Novosielski
> On Aug 12, 2016, at 12:32 PM, Ryan Novosielski wrote:
>
>> On Aug 12, 2016, at 12:16, Kilian Cavalotti
>> wrote:
>>
>> On Thu, Aug 11, 2016 at 9:44 PM, Ryan Novosielski
>> wrote:
>>> [pid 11767]
>>> open("/sys/fs/cgroup/devices
> On Aug 12, 2016, at 12:16, Kilian Cavalotti
> wrote:
>
> On Thu, Aug 11, 2016 at 9:44 PM, Ryan Novosielski
> wrote:
>> [pid 11767]
>> open("/sys/fs/cgroup/devices/slurm/uid_109366/job_5377709/devices.allow",
>> O_WRONLY) = 10
>> [pid 11767]
0
[pid 11769] nanosleep({1, 0}, NULL) = 0
[pid 11769] nanosleep({1, 0}, NULL) = 0
[pid 11769] nanosleep({1, 0}, NULL) = 0
Anyone got any ideas? I guess I’m going to have to try this on a second system
to make sure that something about the system itself is not the problem (it’s
state
> On Aug 11, 2016, at 5:50 PM, Kilian Cavalotti
> wrote:
>
> On Thu, Aug 11, 2016 at 12:46 PM, Ryan Novosielski
> wrote:
>> I’ll try adding the Gres debugging, but is there some way to figure out what
>> this alleged device “819275” is (this number will change
nd adding DebugFlags=Gres
> to your slurm.conf and look at what the Slurm controller logs.
>
> Cheers,
> --
> Kilian
>
--
|| \\UTGERS, |---*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
Just to eliminate some part of the pty process as the problem, I pulled the pty
and changed the command from “bash -i” to “hostname”. Same hang.
> On Aug 11, 2016, at 2:31 PM, Ryan Novosielski wrote:
>
> I’m also noticing that I sometimes see this after a few minutes:
>
>
] gres_cnt found:0 configured:0 avail:0 alloc:0
[2016-08-11T14:27:48.166] gres_bit_alloc:NULL
[2016-08-11T14:27:48.166] gres_used:(null)
> On Aug 11, 2016, at 2:11 PM, Ryan Novosielski wrote:
>
> Thanks very much for your reply — that part of the documentation was very
. I currently have /dev/nvidia* -- this covers everything including
"nvidia" in the name. Are there maybe others that I'm missing?
Let me know what would be helpful info to provide. Thank you!
--
|| \\UTGERS,|---*O*-------
||_//
r way!
From: Ryan Novosielski
Sent: Friday, August 5, 2016 2:44 AM
To: slurm-dev
Subject: [slurm-dev] Re: sacctmgr modify cluster controlhost?
Hi all,
I'd written about this some time ago -- I need to change the ControlHost for my
cluster (it somehow got set to a machine that do
______
From: Ryan Novosielski
Sent: Tuesday, May 10, 2016 11:36 AM
To: slurm-dev
Subject: [slurm-dev] sacctmgr modify cluster controlhost?
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi there,
Using SLURM 15.08. Apparently our cluster ControlHost as shown by
sacctmgr show cluster is incorrec
You need to use the scontrol reboot_nodes functionality, or similar, to restart
your nodes. I suspect you'll see "Node rebooted unexpectedly" in "sinfo -R" for
these nodes when you reboot them. SLURM wasn't aware that this restart was
going to happen and so treats it as a problem. As someone al
Fixed in 16.05.2! Thanks to those involved.
From: Ryan Novosielski
Sent: Wednesday, June 29, 2016 11:55 PM
To: slurm-dev
Subject: Re: [slurm-dev] Re: Failed to build 16.05.1 RPMs on CentOS 5
Hi folks,
This appears to not be fixed in 16.05.1 somehow; I
Hi folks,
This appears to not be fixed in 16.05.1 somehow; I'm guessing fixed in some
places, missed here?
make[4]: Leaving directory `/usr/src/redhat/BUILD/slurm-16.05.1/src/db_api'
mv -f .deps/print.Tpo .deps/print.Po
/bin/sh ../../libtool --tag=CC --mode=link gcc -O2 -g -pipe -Wall
-Wp,
/slurm-16.05.0'
make: *** [all] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.94019 (%build)
RPM build errors:
Bad exit status from /var/tmp/rpm-tmp.94019 (%build)
--
|| \\UTGERS, |---*O*---
||_// the State |
---*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
signature.asc
Description: Message signed with OpenPGP using GPGMail
anyone has any other ideas. Thanks!
--
|| \\UTGERS, |---*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\of NJ | Office of Advanced
tly you have to be using the multi-factor
priority plugin. I’m assuming that the job also has to be invoking some part of
the multi-factor priority, as on my other system, since it only shows one of
many jobs running.
--
|| \\UTGERS, |---*O*----
what SLURM thinks you
ought to have in slurm.conf, you can do “slurmd -C” from a node.
--
|| \\UTGERS, |---*O*-------
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~
se NHC.
--
|| \\UTGERS, |---*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
6 12:22 PM, Lipari, Don wrote:
> You might consider this fix:
>
> https://github.com/SchedMD/slurm/commit/79c9a49913625a0f790f29897f0f09
9156f94268
>
> It was committed to the slurm-15.08 branch yesterday.
>
> Don Lipari
>
>> -Original Message- From: Ryan Nov
-*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
Campus
|| \\of NJ | Office of Advanced Research Computing - MSB C630,
Newark
`'
-BEGIN PGP SIGNATURE-
Version: GnuPG
On 05/10/2016 02:39 AM, Marcin Stolarek wrote:
>
>
> 2016-05-07 6:43 GMT+02:00 Ryan Novosielski <mailto:novos...@rutgers.edu>>:
>
>
> Hi all,
>
> What I want to do is to be able to use reboot_nodes as it is
> described in the manual. Th
mable Resources, but I have yet to find a way
> to specify local disk space in sbatch. This could be a moving target
> depending how well jobs are cleaning up after a run.
Maybe this is --tmp?
--
|| \\UTGERS, |---*O*-------
||_//
On 05/07/2016 08:30 PM, Chris Samuel wrote:
>
> On Friday, 6 May 2016 10:15:10 PM AEST Ryan Novosielski wrote:
>
>> I was looking in the sacct man pages for some field that contains the
>> "Queued time" for a job, as seen in the job status e-mails that are
>>
ee it as one of the job
accounting fields. Apologies in advance for wasting your time if I've
just missed it. Seems like a useful metric.
Thanks!
--
|| \\UTGERS,|---*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.
| \\UTGERS,|---*O*---
||_// the State | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
> On Mar 15, 2016, at 08:44, Chris Samuel wrote:
>
>> On Tue, 15 Mar 2016 05:40:29 AM Bjørn-Helge Mevik wrote:
>>
>> I apologize for the slightly off-topic subject, but I could not think of
>> a better forum to ask. If you know of a more proper place to ask this,
>> I'd be happy to know about
Firewall on the slurmctld server.
--
*Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS |-*O*-
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922
s to
know.
I believe NHC on our system is not configured to take any action on a node that
was drained by a anything other than itself. I don't recall whether that is
configurable, or just the way it works.
--
*Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS |-*O*-
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922)
|| \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
`'
s.
--
*Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS |-----*O*-----
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922)
|| \\ Sciences | OIRT/High Perf & Res C
possible that will solve the
> problem, but this is still peculiar.
>
> -- *Note: UMDNJ is now Rutgers-Biomedical and Health
> Sciences* || \\UTGERS
> |-*O*- ||_// Biomedical |
> Ryan Novosielski - Senior Tec
61 matches
Mail list logo