Re: [slurm-users] job submit location :: restricted to HOME?

2021-03-03 Thread Brian Andrus
Looks like the job ran. You should look at the output logs. My guess: The node the job ran on does not have access to that path. Log on to that node and check it out. Brian Andrus On 3/3/2021 1:21 AM, Adrian Sevcenco wrote: Hi! I just encountered the situation that i cannot submit jobs from

Re: [slurm-users] fix missing accounting entries

2021-03-01 Thread Brian Andrus
runaway: sacctmgr show RunawayJobs *From:* slurm-users on behalf of Brian Andrus *Sent:* Monday, March 1, 2021 11:14 AM *To:* slurm-users@lists.schedmd.com *Subject:* [slurm-users] fix missing accounting entries All

[slurm-users] fix missing accounting entries

2021-03-01 Thread Brian Andrus
All, IIRC, there was a command that would repair the accounting tables when a job had no endtime. I can't seem to find the info for that. Does anyone recall such a thing? Brian Andrus

Re: [slurm-users] prolog not passing env var to job

2021-02-12 Thread Brian Andrus
Your prolog script is run by/as the same user as slurmd, so any environment variables you set there will not be available to the job being run. See: https://slurm.schedmd.com/prolog_epilog.html for info. Brian Andrus On 2/12/2021 1:27 PM, mercan wrote: Hi; Prolog and TaskProlog

Re: [slurm-users] Submitting jobs across multiple nodes fails

2021-02-04 Thread Brian Andrus
try: export SLURM_OVERLAP=1 export SLURM_WHOLE=1 before your salloc and see if that helps. I have seen some mpi issues that were resolved with that. You can also try it using just the regular mpirun on the nodes allocated. That will help with a datapoint as well. Brian Andrus On 2/4/2021

Re: [slurm-users] Submitting jobs across multiple nodes fails

2021-02-04 Thread Brian Andrus
Did you compile slurm with mpi support? Your mpi libraries should be the same as that version and they should be available in the same locations for all nodes. Also, ensure they are accessible (PATH, LD_LIBRARY_PATH, etc are set) Brian Andrus On 2/4/2021 1:20 PM, Andrej Prsa wrote: Gentle

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-02-03 Thread Brian Andrus
they can do a thing doesn't mean they should do a thing. There are many ways to achieve what is desired, most of which do not require anyone other than the system admin. If your issue can be solved without affecting others, leave them alone and fix your issue. Brian Andrus

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Brian Andrus
it, slurm assumes all memory on the node for the job. So, even if you are only using 1 cpu, all the memory is allocated, leaving none for any other job to run on the unallocated cpus. Brian Andrus On 1/28/2021 2:15 PM, Chandler wrote: Brian Andrus wrote on 1/28/21 13:59: What

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Brian Andrus
You are getting close :) You can see why n010 is able to have multiple jobs. It shows more resources available. What are the specific requests for resources from a job? Nodes, Cores, Memory, threads, etc? Brian Andrus On 1/28/2021 12:52 PM, Chandler wrote: OK I'm getting this same output

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Brian Andrus
Ahh. One one of the new nodes do: slurmd -C The output of that will tell you what those settings should be. I suspect they are off, which forces them into drain mode. Brian Andrus On 1/28/2021 12:25 PM, Chandler wrote: Andy Riebs wrote on 1/28/21 07:53: If the only changes to your system

Re: [slurm-users] only 1 job running

2021-01-28 Thread Brian Andrus
Heh. Your nodes are drained. do: scontrol update state=resume nodename=n[011-013] If they go back into a drained state, you need to look into why. That will be in the slurmctld log. You can also see it with 'sinfo -R' Brian Andrus On 1/27/2021 10:18 PM, Chandler wrote: Made a little bit

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-27 Thread Brian Andrus
have been able to deploy completely to cloud using only slurm. It has the ability to integrate into any cloud cli, so nothing else has been needed. Just for the heck of it, I am thinking of integrating it into Terraform, although not necessary. Brian Andrus On 1/26/2021 11:48 AM, Robert Kudyba

Re: [slurm-users] Using "Environment Modules"

2021-01-26 Thread Brian Andrus
The net effect is that the environment gets setup the same as if the user had opened a shell console. Brian Andrus On 1/26/2021 2:13 AM, Gestió Servidors wrote: Hi, My environment is this: * Users are using “bash” as the default shell * A sample of one of my environment modules

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-25 Thread Brian Andrus
customers for Tim to keep things running as well as he has. I'm pretty sure most folks that use slurm for any period of time has received more value that a small support contract would be. Brian Andrus On 1/25/2021 7:35 AM, Jeffrey T Frey wrote: ...I would say having SLURM rpms in EPEL could be very

Re: [slurm-users] Cluster nodes on multiple cluster networks

2021-01-22 Thread Brian Andrus
You would need to have a direct connect/vpn so the cloud nodes can connect to your head node. Brian Andrus On 1/22/2021 10:37 AM, Sajesh Singh wrote: We are looking at rolling out cloud bursting to our on-prem Slurm cluster and I am wondering how to deal with the slurm.conf variable

Re: [slurm-users] cpu core exclusion?

2021-01-20 Thread Brian Andrus
We would need more information. At a minimum, what client is it? As this is not a slurm issue, you would need to dig into what is causing that behavior with your storage system. Brian Andrus On 1/20/2021 10:53 AM, John McCulloch wrote: Our shared storage client daemon is utilizing 100

Re: [slurm-users] Parent account in AllowAccounts

2021-01-15 Thread Brian Andrus
mean their child can :) Brian Andrus On 1/15/2021 6:38 AM, Durai Arasan wrote: Hi, As you know for each partition you can specify AllowAccounts=account1,account2... I have a parent account say "parent1" with two child accounts "child1" and "child2" I expected that

Re: [slurm-users] Burst to AWS cloud

2020-12-15 Thread Brian Andrus
over a direct-connect or VPN. Brian Andrus On 12/15/2020 12:02 PM, Sajesh Singh wrote: We are currently investigating the use of the cloud scheduling features within an on-site Slurm installation and was wondering if anyone had any experiences that they wish to share of trying to use

Re: [slurm-users] slurmctld daemon error

2020-12-14 Thread Brian Andrus
Check your hosts file and ensure 'localhost' does not have an IPV6 address associated with it. Brian Andrus On 12/14/2020 4:19 PM, Alpha Experiment wrote: Hi, I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is running correctly; however the slurmctld daemon always errors

Re: [slurm-users] Trouble installing slurm-20.02.4-1.amzn2.x86_64 libnvidia-ml.so.1

2020-12-05 Thread Brian Andrus
That package looks to be built for a system with an nvidia gpu installed. Look for (or build) different packages if you are not going to use a gpu-based node. Brian Andrus On 12/4/2020 11:32 AM, Mullen, Drew wrote: Howdy Im getting this error installing slurm 20.02.4: Error: Package

[slurm-users] MinJobAge

2020-11-23 Thread Brian Andrus
in a completed state for a period of time, but they are not showing up at all on our cluster. How does one have jobs show up that are completed? Brian Andrus

Re: [slurm-users] Using hyperthreaded processors

2020-11-04 Thread Brian Andrus
to more fetches, wasting effort. This is a VERY simplistic description, but the point is that hyperthreading is not a silver bullet that will improve HPC performance if you are maximizing your resource utilization. Ok, I will get off my soapbox :) Brian Andrus On 11/4/2020 7:30 AM, Jean

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-20 Thread Brian Andrus
packages. Source control for me is just that spec file. Brian Andrus On 10/20/2020 8:46 AM, Michael Jennings wrote: On Tuesday, 20 October 2020, at 15:49:25 (+0800), Kevin Buckley wrote: On 2020/10/20 11:50, Christopher Samuel wrote: I forgot I do have access to a SLES15 SP1 system, that has

Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Brian Andrus
do you have your gres.conf on the nodes also? Brian Andrus On 10/8/2020 11:57 AM, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment

Re: [slurm-users] error: user not found

2020-09-29 Thread Brian Andrus
een places where that can take 24 hours. Brian Andrus On 9/29/2020 6:18 AM, Diego Zuccato wrote: Hello all. One of the users is unable to submit jobs to our cluster. The first time he tries, he gets $ sbatch test.job sbatch: fatal: Invalid user id: 621049927 then: $ sbatch test.job sbatch: er

Re: [slurm-users] Quickly throttling/limiting a specific user's jobs

2020-09-22 Thread Brian Andrus
on the node waiting to be resumed, but the node resources may get assigned to other jobs while they wait to resume. Brian Andrus On 9/22/2020 2:33 PM, Ransom, Geoffrey M. wrote: Hello    We had a user post a large number of array jobs with a short actual run time (20-80 seconds, but mostly

Re: [slurm-users] Slurmctld and log file

2020-09-08 Thread Brian Andrus
both. I do high debug to the journal and info to the log file. Brian Andrus On 9/8/2020 2:41 AM, Gestió Servidors wrote: Hello, I don’t know why, but my SLURM server (that is running fine) has its slurmdctl.log file with size 0 bytes... so... where is writting logs? It seems that log file has

Re: [slurm-users] Alternatives for MailProg

2020-08-28 Thread Brian Andrus
That is where you have it call a bash script and within the script you do as needed. Like Ahmet's suggested script. So use his as a template and add the headers you desire. Brian Andrus On 8/28/2020 11:36 AM, Chris Samuel wrote: On 8/27/20 3:42 pm, Brian Andrus wrote: Actually, you can add

Re: [slurm-users] is there a way to delay the scheduling.

2020-08-28 Thread Brian Andrus
to schedule them in that fashion outweighs the resources needed by far. Brian Andrus On 8/28/2020 3:30 AM, navin srivastava wrote: Hi Team, facing one issue. several users submitting 2 job in a single batch job which is very short jobs( says 1-2 sec). so while submitting more job slurmctld

Re: [slurm-users] Alternatives for MailProg

2020-08-27 Thread Brian Andrus
Actually, you can add headers of all kinds: Quick search of "sendmail add headers" discovers: https://serverfault.com/questions/347602/sending-e-mail-from-sendmail-with-headers Brian Andrus On 8/26/2020 10:02 PM, Andrew Elwell wrote: Hi folks, I'm getting fed up receiving out

Re: [slurm-users] [EXT] Slurmd problem on client

2020-08-24 Thread Brian Andrus
IIRC, that is because it is trying to do the 'configless' feature of slurm 20 where it uses DNS entries to find the config. This will happen if /etc/slurm.conf does not exist on the node. Check that you have that and that it is the same as the one on the master. Brian Andrus On 8/24/2020 7

Re: [slurm-users] Limit nodes of a partition without managing users

2020-08-18 Thread Brian Andrus
they will wait a relatively shorter amount of time. There are numerous other factors you can use. If you have accounting and associations configured, you can manipulate it all the way to the association and qos. Brian Andrus On 8/17/2020 11:23 PM, Gerhard Strangar wrote: Brian Andrus wrote: Most likely, b

Re: [slurm-users] Limit nodes of a partition without managing users

2020-08-17 Thread Brian Andrus
, the devil is in the details on how to define/get what you want. Brian Andrus On 8/17/2020 10:13 AM, Gerhard Strangar wrote: Hello, I'm wondering if it's possible to have slurm 19 run two partitions (low and high prio) that share all the nodes and limit the high prio partition in number of nodes

Re: [slurm-users] Internet connection loss with srun to a node

2020-08-02 Thread Brian Andrus
This is very likely by design of the cluster and/or network. Otherwise users could use the cluster to mine bitcoin and such. Brian Andrus On 8/2/2020 7:11 AM, Mahmood Naderan wrote: I thought that maybe srun doesn't transfer all settings from the head node to the compute node. The wget

Re: [slurm-users] know time limit from inside job

2020-07-27 Thread Brian Andrus
lua, if I may ask? Brian Andrus On 7/27/2020 9:52 AM, Baer, Troy wrote: There's an outstanding feature request for that: https://bugs.schedmd.com/show_bug.cgi?id=8383 While waiting on that, we've taken to injecting it into the job's environment ourselves in the Lua submit filter. --Troy

[slurm-users] know time limit from inside job

2020-07-27 Thread Brian Andrus
calls as too many of them can tip a system over. Brian Andrus

Re: [slurm-users] Running two multiprocessing jobs in one sbatch

2020-07-25 Thread Brian Andrus
Is there a reason to run them as a single job? It may be easier to just have 2 separate jobs of 16 cores each. If there are dependency requirements, that is addressed by adding any dependencies to the job submission. Brian Andrus On 7/25/2020 2:50 AM, Даниил Вахрамеев wrote: Hi everyone

[slurm-users] Jenkins integration

2020-07-24 Thread Brian Andrus
root could be quite useful. Especially for service accounts. Yes, there can be a workaround using sudo, but it seems better if we could track things in slurm to know a job was run 'on behalf of' another user. Thoughts, suggestions, current approaches? Thanks, Brian Andrus

Re: [slurm-users] Slurm MySQL database configuration

2020-07-21 Thread Brian Andrus
slurm daemons going down. Brian Andrus On 7/21/2020 7:44 AM, Peter Mayes wrote: Hi, My first post to the list, so apologies if this is a FAQ, My configuration has two nodes allocated for Slurm masters, with a highly-available NFS server mounting a filesystem across the two nodes. I need advice

Re: [slurm-users] slurm & rstudio

2020-07-20 Thread Brian Andrus
Ah, They are assuming you are running the web interface as root. If your environment is secure enough, you can do that. Or, grant your web server user privileges in slurm to be allowed to use the "--uid" option. Brian Andrus On 7/20/2020 8:39 AM, Sidhu, Khushwant wrote: H

Re: [slurm-users] slurm & rstudio

2020-07-20 Thread Brian Andrus
You are trying to use sbatch with the "--uid" option which is only allowed by root. Either run sbatch as the user doing the request (which should be the same user that is running rstudio) or use 'sudo -u ' to run sbatch. Brian Andrus On 7/20/2020 7:50 AM, Sidhu, Khushwant wrote:

Re: [slurm-users] changes in slurm.

2020-07-09 Thread Brian Andrus
, the partition is used to determine which node(s) and filter/order jobs. You should add the node to the new partition, but also leave it in the 'test' partition. If you are looking to remove the 'test' partition, set it to down and once all the running jobs that are in it finish, then remove it. Brian

Re: [slurm-users] Advice for merging accounting data

2020-07-08 Thread Brian Andrus
you set that in the slurm.conf to continue the numbering from where you left off so there are no entries in accounting that get replaced. Brian Andrus On 7/8/2020 3:15 AM, Simon Kainz wrote: Hello, we have a long-running slurm cluster, accounting into slurmdbd/mysql backend on the cluster

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-19 Thread Brian Andrus
thentication <https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Host-based_Authentication>because /*normal users have no business on those servers!*/ Brian Andrus On 6/17/2020 1:26 AM, Ole Holm Nielsen wrote: On 6/9/20 5:45 PM, Michael Jennings wrote: On Tuesday, 09 June 2020, at 12:43:34

Re: [slurm-users] Slurm and shared file systems

2020-06-19 Thread Brian Andrus
them outside the cluster. Brian Andrus On 6/19/2020 5:04 AM, David Baker wrote: Hello, We are currently helping a research group to set up their own Slurm cluster. They have asked a very interesting question about Slurm and file systems. That is, they are posing the question -- do you need

Re: [slurm-users] GUI application crash on first allocation, but runs fine on second allocation

2020-06-09 Thread Brian Andrus
Sounds like a race condition where slurmd is starting before the node is truly ready. You can try adding dependencies for slurmd so it will not start until some other needed service is running. The benefits of systemd :) Brian Andrus On 6/9/2020 10:53 AM, Dumont, Joey wrote: Hi, I

[slurm-users] configless DNS entries

2020-06-09 Thread Brian Andrus
/configless_slurm.html Brian Andrus

Re: [slurm-users] Problem with permisions. CentOS 7.8

2020-06-02 Thread Brian Andrus
are running. slurmd should be running as root. It needs to be able to do a few things including run the job as the user that submitted it. Things that only root should be doing. Brian Andrus On 6/2/2020 2:00 PM, Ferran Planas Padros wrote: Hi Ole, I run the same version of slurm in all

Re: [slurm-users] RAM "overbooking"

2020-05-27 Thread Brian Andrus
Heh. That is the on-going "user education" You could change the amount of ram requested using a job_sumit lua script, but that could bite those that are accurate with their requests. Or set a max ram for the partition. Brian Andrus On 5/27/2020 3:46 PM, Marcelo Z. Silva wrote:

Re: [slurm-users] Reset TMPDIR for All Jobs

2020-05-12 Thread Brian Andrus
Maybe too obvious, but have you checked your .bashrc, .bash_profile and such? Brian Andrus On 5/12/2020 10:27 AM, Ellestad, Erik wrote: Which SLURM prolog specifically? I’m not finding that to work for me in either task-prolog or prolog. SLURM_TMPDIR and TMPDIR are still both set to /tmp

Re: [slurm-users] Alternative to munge for use with slurm?

2020-04-20 Thread Brian Andrus
For CentOS/RHEL, it is in the OpenFusion repo: http://repo.openfusion.net/centos7-x86_64/ just     yum install http://repo.openfusion.net/centos7-x86_64/openfusion-release-0.7-1.of.el7.noarch.rpm then     yum install libjwt-devel Brian Andrus On 4/18/2020 2:27 PM, Daniel Letai wrote

Re: [slurm-users] Munge decode failing on new node

2020-04-19 Thread Brian Andrus
the next uid on any node. The error below looks like you may have a different uid for the slurm user on the node. What uid is slurmd running as on the bad node vs a good node? Brian Andrus On 4/17/2020 2:38 PM, Dean Schulze wrote: Just noticed this.  On the problem node the munged.log file

Re: [slurm-users] srun --reboot option is not working

2020-03-10 Thread Brian Andrus
. It could probably be worked around, but not in a simple way. Easier to upgrade to the newest release :) Brian Andrus On 3/9/2020 10:14 AM, MrBr @ GMail wrote: Hi Brian The nodes work with slurm without any issues till I try the "--reboot" option. I can successfully allocate the no

Re: [slurm-users] srun --reboot option is not working

2020-03-09 Thread Brian Andrus
normal users cannot use "--reboot" Brian Andrus On 3/9/2020 10:14 AM, MrBr @ GMail wrote: Hi Brian The nodes work with slurm without any issues till I try the "--reboot" option. I can successfully allocate the nodes or any other slurm related operation > You may want to dou

Re: [slurm-users] srun --reboot option is not working

2020-03-09 Thread Brian Andrus
' from the node and verify it is able to talk to slurmctld from the node and verify slurmd started successfully. Brian Andrus On 3/9/2020 4:38 AM, MrBr @ GMail wrote: Hi all I'm trying to use the --reboot option of srun to reboot the nodes before allocation. However the nodes not been

[slurm-users] Hybrid compiling options

2020-02-28 Thread Brian Andrus
on that are. Brian Andrus

Re: [slurm-users] Setup for backup slurmctld

2020-02-26 Thread Brian Andrus
I would say so. Certainly, if you have many nodes and/or many jobs being submitted, you will see an impact, but in my experience comparing Slurm to SGE, Slurm has much less overhead to cause as much impact. Brian Andrus On 2/26/2020 1:05 PM, Joshua Baker-LePain wrote: On Wed, 26 Feb 2020

Re: [slurm-users] Setup for backup slurmctld

2020-02-26 Thread Brian Andrus
easy to do. Just add the lines to your slurm.conf for the backup controller, start it up and reconfigure for all running nodes to be aware of it. Brian Andrus On 2/26/2020 12:48 PM, Joshua Baker-LePain wrote: We're planning the migration of our moderately sized cluster (~400 nodes, 40K jobs

Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-25 Thread Brian Andrus
Bright is not needed... for much of anything... On 2/25/2020 12:48 PM, Robert Kudyba wrote: I suppose I can ask Bright Computing but does anyone know what version of Bright is needed? I would guess 8.2 or 9.0. Definitely want to dive into this.

Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-10 Thread Brian Andrus
Usually means you updated the slurm.conf but have not done "scontrol reconfigure" yet. Brian Andrus On 2/10/2020 8:55 AM, Robert Kudyba wrote: We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12. We're getting the below errors when I restart the slurmct

Re: [slurm-users] problem running slurm

2020-02-07 Thread Brian Andrus
Your trying to run bash which, without special configuration, needs a pty Try srun -v -p debug --pty bash Brian Andrus On 2/6/2020 10:28 PM, Hector Yuen wrote: Hello, I am setting up a very simple configuration: one node running slurmd and another one running slurmctld. In the slurmctld

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Brian Andrus
Check the slurmd log file on the node. Ensure slurmd is still running. Sounds possible that OOM Killer or such may be killing slurmd Brian Andrus On 1/20/2020 1:12 PM, Dean Schulze wrote: If I restart slurmd the asterisk goes away.  Then I can run the job once and the asterisk is back

Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

2020-01-20 Thread Brian Andrus
ster generically, so their configs are not getting matched to the specific info in your main config Brian Andrus On 1/20/2020 10:37 AM, Robert Kudyba wrote: I've posted about this previously here <https://groups.google.com/forum/#!searchin/slurm-users/kudyba%7Csort:date/slurm-users/mMECjerUmFE/V

Re: [slurm-users] slurm elastic compute / power saving

2020-01-07 Thread Brian Andrus
I think we would need to see your SuspendScript to get a better idea of what is happening. That error indicates the nodes are likely not running slurmd and the control daemon things they are still up. What is the output of 'sinfo -R'? Brian Andrus On 1/7/2020 3:42 AM, Steve Brasier wrote

Re: [slurm-users] Partition question

2019-12-16 Thread Brian Andrus
depends on what best suits the specific needs. Brian Andrus On 12/16/2019 2:29 PM, Ransom, Geoffrey M. wrote: Hello    I am looking into switching from Univa (sge) to slurm and am figuring out how to implement some of our usage policy in slurm. We have a Univa queue which uses job classes

Re: [slurm-users] cleanup script after timeout

2019-12-11 Thread Brian Andrus
You prompted me to dig even deeper into my epilog. I was trying to access a semaphore file in the user's home directory. It seems that when the epilogue is run the ~ is not expanded in anyway. So I can't even use ~${SLURM_JOB_USER} to access their semaphore file. Potentially problematic for

[slurm-users] cleanup script after timeout

2019-12-11 Thread Brian Andrus
a cleanup script run on jobs that have timed out? Brian Andrus

[slurm-users] nss_slurm and sudo

2019-12-09 Thread Brian Andrus
So it seems nss_slurm does not play well with sudo. If I connect to a box that uses it and try to use sudo, I get: *sudo: PAM account management error: Authentication service cannot retrieve authentication info* Has anyone else seen this? Is there a workaround? Brian Andrus

Re: [slurm-users] Timeout and Epilogue

2019-12-09 Thread Brian Andrus
crickets.  I think in our case we were not able to ensure that the epilog always ran for different types of job failures, so we just had the users add some more cleanup code to the end of their jobs _and_ also run separate cleanup jobs. Regards, Alex On Wed, Dec 4, 2019 at 7:29 PM Brian Andrus

Re: [slurm-users] Slurm 19-05-4-1 and Centos8

2019-12-08 Thread Brian Andrus
s have had the same issue and even add to comments in the bugs, but no responses/resolution for this have been posted. FWIW, I also see the issue with the latest slurm 20.05 pre1 code. Brian Andrus On 12/5/2019 11:46 PM, von St. Vieth, Benedikt wrote: Hi again, I answered this question on Oct 2

Re: [slurm-users] Slurm 19-05-4-1 and Centos8

2019-12-05 Thread Brian Andrus
Tim claims it works... I have compiled it, but when you try to run slurmd, it throws some errors and will not start. From a previous thread: While I can successfully build/run slurmctld, slurmd is failing because ALL of the SelectType libraries are missing symbols. Example from

[slurm-users] Timeout and Epilogue

2019-12-04 Thread Brian Andrus
Quick question: Is the epilogue script run if a job exceeds its time limits and is being canceled? What about just cancelled? I need to be able to clean up some job-specific files regardless of how the job ends and I'm not sure epilogue is sufficient. Brian Andrus

Re: [slurm-users] Filter slurm e-mail notification

2019-11-26 Thread Brian Andrus
server you use. The best solution, of course, is to educate the users. You could create a job_submit plugin that removes mail options for arrays, but you may negatively impact users that do need that. Brian Andrus On 11/25/2019 10:55 PM, ichebo...@univ.haifa.ac.il wrote: I meant on the admin

Re: [slurm-users] Filter slurm e-mail notification

2019-11-25 Thread Brian Andrus
FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array./ Brian Andrus On 11/25/2019 1:48 AM, ichebo...@univ.haifa.ac.il wrote: Hi, I would like to ask if there is some options to configure the e-mail notification of slurm job

Re: [slurm-users] Environment modules

2019-11-24 Thread Brian Andrus
/openmpi), which forces only one version to be able to be loaded. I also set paths so specific versions of libraries become available depending on what environment you select (gcc vs intel for example). Is there something besides versioning that lmod shines at? Brian Andrus On 11/24/2019 12:48 AM

[slurm-users] nss_slurm not passing groups

2019-11-22 Thread Brian Andrus
, I get back 41 groups I am in. Bug? Brian Andrus

Re: [slurm-users] How to use a pyhon virtualenv with srun?

2019-11-17 Thread Brian Andrus
t actually sharing homes could be the cause. Brian Andrus On 11/17/2019 11:24 AM, Yann Bouteiller wrote: Hello, I am trying to do this on computecanada, which is managed by slurm: https://ray.readthedocs.io/en/latest/deploying-on-slurm.html However, on computecanada, you cannot inst

Re: [slurm-users] Limiting the number of CPU

2019-11-11 Thread Brian Andrus
You are trying to specifically run on node cn110, so you may want to check that out with sinfo A quick "sinfo -R" can list any down machines and the reasons. Brian Andrus On 11/10/2019 11:23 PM, Sukman wrote: Hi Brian, I see. Thank you for your suggestion. I definitely will try i

Re: [slurm-users] job priority keeping resources from being used?

2019-11-01 Thread Brian Andrus
Brian Andrus <mailto:toomuc...@gmail.com>> wrote: Are you specifying memory for each of the jobs? Can't run a small job if there isn't enough memory available for it. Brian Andrus On 11/1/2019 7:42 AM, c b wrote: I have: SelectType=select/cons_res SelectTypeP

Re: [slurm-users] job priority keeping resources from being used?

2019-11-01 Thread Brian Andrus
Are you specifying memory for each of the jobs? Can't run a small job if there isn't enough memory available for it. Brian Andrus On 11/1/2019 7:42 AM, c b wrote: I have: SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory On Fri, Nov 1, 2019 at 10:39 AM Mark Hahn <mailt

Re: [slurm-users] Store sstat information permanently on job completion?

2019-10-30 Thread Brian Andrus
Except sstat can give you the MaxRSS without having cgroups and it will give you a simple MaxRSS, whereas sacct provides a MaxRSS for every step... have to play with that data to get the high water mark grrr. I had tried to use sstat in an epilogue but apparently that is too late... Brian

Re: [slurm-users] RHEL8 support

2019-10-30 Thread Brian Andrus
ckages except pmix-devel. Haven't figured that one yet. Brian Andrus On 10/30/2019 11:18 AM, Christopher Benjamin Coffey wrote: Yes, I'd be interested too. Best, Chris

Re: [slurm-users] RHEL8 support - Missing Symbols in SelectType libraries

2019-10-29 Thread Brian Andrus
I prefer building packages. I did have to extract and change the .spec file to accommodate some of the changes as well as set up the environment to complete. Brian On 10/29/2019 8:11 AM, Christopher Benjamin Coffey wrote: Brian, I've actually just started attempting to build slurm 19 on

Re: [slurm-users] RHEL8 support - Missing Symbols in SelectType libraries

2019-10-28 Thread Brian Andrus
/libslurmfull.so|grep powercap_*//* *//*0010f7b8 T slurm_free_powercap_info_msg*//* *//*00060060 T slurm_print_powercap_info_msg*/ So, sure enough powercap_get_cluster_current_cap is not in there. Methinks the linking needs examined. Brian Andrus On 10/28/2019 2:32 AM, Benjamin Redling

Re: [slurm-users] RHEL8 support

2019-10-28 Thread Brian Andrus
-1.el8.x86_64.rpm slurm-slurmdbd-19.05.3-1.el8.x86_64.rpm slurm-torque-19.05.3-1.el8.x86_64.rpm Brian Andrus On 10/28/2019 2:32 AM, Benjamin Redling wrote: On 28/10/2019 08.26, Bjørn-Helge Mevik wrote: Taras Shapovalov writes: Do I understand correctly that Slurm19 is not compatible

Re: [slurm-users] jobacct_gather/linux vs jobacct_gather/cgroup

2019-10-24 Thread Brian Andrus
IIRC, the big difference is if you want to use cgroups on the nodes. You must use the cgroup plugin. Brian Andrus On 10/24/2019 3:54 PM, Christopher Benjamin Coffey wrote: Hi Juergen, From what I see so far, there is nothing missing from the jobacct_gather/linux plugin vs the cgroup

Re: [slurm-users] How to create a partition where only one job can run concurrently?

2019-10-18 Thread Brian Andrus
. Brian Andrus On 10/18/2019 1:03 PM, bbenede...@goodyear.com wrote: Greetings! I am trying to set up a partition that will only allow one job at a time to run, regardless of who submits it. So multiple jobs from multiple users can be in the queue. But I only want the partition to run one

[slurm-users] Sacct selecting jobs outside range

2019-10-16 Thread Brian Andrus
:34 2019-10-01T00:00:44 00:00:10 Brian Andrus

Re: [slurm-users] Execute scripts on suspend and cancel

2019-10-16 Thread Brian Andrus
tun Peksel* oytun.pek...@semcon.com <mailto:oytun.pek...@semcon.com> +46739205917 *From:*slurm-users *On Behalf Of *Brian Andrus *Sent:* den 15 oktober 2019 20:58 *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] Execute scripts on su

Re: [slurm-users] Execute scripts on suspend and cancel

2019-10-15 Thread Brian Andrus
handling until they have it as part of their app. Brian Andrus On 10/14/2019 4:40 AM, Oytun Peksel wrote: It is quite weird if slurm has no mechanism as described. I have been digging more into it and someone suggested a workaround using mail notifications. You use a script instead of the mail

[slurm-users] ResumeProgram not running

2019-10-10 Thread Brian Andrus
that are idle~ but no calls to the script. If I restart slurmctld, the backlog starts running and things work. Any ideas what could cause this? Brian Andrus

Re: [slurm-users] sacct command to show time for node to start

2019-09-21 Thread Brian Andrus
Lyn, That was it, thanks! sacct -o reserved Brian On 9/21/2019 9:26 AM, Lyn Gerner wrote: Hey Brian, I think the discussion was in the context of suspend/resume, and it was the Reserved value that effectively represents that time. Regards, Lyn On Sat, Sep 21, 2019 at 9:15 AM Brian Andrus

[slurm-users] sacct command to show time for node to start

2019-09-21 Thread Brian Andrus
There was a command shared at the SLUG that showed how long it took a node to go from a power_down (idle~) state to up and having a job running on it, but I cannot remember what it was. Does anyone recall that? Brian Andrus

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-16 Thread Brian Andrus
=18446744073709551614,4=1,5=4 | ++-++ Brian Andrus On Mon, Sep 16, 2019 at 2:58 PM Brian Andrus wrote: > I have > JobAcctGatherType = jobacct_gather/linux > > Brian > > On Mon, Sep 16, 2019 at 12:40 PM Antony Cleave

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-16 Thread Brian Andrus
is used to collect accounting information. Supported values are > *jobacct_gather/linux* (recommended), *jobacct_gather/cgroup* and > *jobacct_gather/none* (no information collected). > > Antony > > > On Mon, 16 Sep 2019, 14:07 Brian Andrus, wrote: > >> Yep, the ma

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-16 Thread Brian Andrus
, Christopher Samuel wrote: On 9/15/19 4:17 PM, Brian Andrus wrote: Are steps required to capture Max RSS? No, you should see a MaxRSS reported for the batch step, for instance: $ sacct -j $JOBID -o jobid,jobname,maxrss All the best, Chris

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-15 Thread Brian Andrus
The jobs have definitely completed when I try to gather the info. Brian On 9/15/2019 4:01 PM, Steven Dick wrote: I don't think it shows up until the job completes. On Sat, Sep 14, 2019 at 2:25 AM Brian Andrus wrote: Quick question? When I use sacct to show job stats, it always has a blank

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-15 Thread Brian Andrus
Hmm. We are only using allocations and have slurm.conf configured with: AccountingStorageEnforce=associations,nosteps Are steps required to capture Max RSS? Brian On 9/15/2019 1:48 PM, Mark Hahn wrote: When I use sacct to show job stats, it always has a blank entry for the MaxRSS field. Is

[slurm-users] MaxRSS not showing up in sacct

2019-09-14 Thread Brian Andrus
Quick question? When I use sacct to show job stats, it always has a blank entry for the MaxRSS field. Is there something that needs enabled to get that in? I do see it if I use sstat while the job is running. Brian Andrus

Re: [slurm-users] SLURM in Virtual Machine

2019-09-12 Thread Brian Andrus
. However, there are definite use cases that make it worthwhile. So long as you allocate enough resources for the node (be it the controller or other) you will be fine. Brian Andrus On 9/12/2019 7:23 AM, Jose A wrote: Dear all, In the expansion of our Cluster we are considering to install SLURM

<    1   2   3   4   >