[slurm-dev] Re: An issue about slurm on CentOS 7.3

2017-08-25 Thread Nicholas McCollum
away while the daemon has crashed or failed to start. -- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority On Fri, 2017-08-25 at 06:08 -0600, Ole Holm Nielsen wrote: > On 08/25/2017 01:37 PM, Huijun HJ1 Ni wrote:> I installed > slurm on my cluste

[slurm-dev] Re: Is there a way to prevent the usage of gres for a specific partititon?

2017-07-20 Thread Nicholas McCollum
ROR end I'm not sure what the exact return of job_desc.gres, it may not be nil. You'll have to test that part. There are probably other ways to do this, but I like to use the lua plugin in order to communicate to my users what they have done wrong. -- Nicholas McCollum HPC Systems Admin

[slurm-dev] Re: Dynamic, small partition?

2017-07-19 Thread Nicholas McCollum
e MaxNodes submitted at job submission to 20. if string.match(job_desc.qos, "special") then job_desc.max_nodes = 20 end Just a couple idea's for you, there's probably a way better way to do it! -- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authorit

[slurm-dev] Re: Job Submit Lua Plugin

2017-06-28 Thread Nicholas McCollum
%s.\n", test_min_nodes) slurm.log_user("\n%s", error_verbose) return slurm.ERROR end -- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority On Wed, 2017-06-28 at 13:51 -0600, Nathan Vance wrote: > Correction (copy/pasted wrong thing): It was the > &quo

[slurm-dev] Re: Job Submit Lua Plugin

2017-06-27 Thread Nicholas McCollum
nJob must request a QoS using the --qos= flag.\n",asc_error_verbose) asc_qos = "invalid" end I'd be more than happy to share my job_submit.lua if anyone is interested. I only ask that you share yours back. -- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Autho

[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Nicholas McCollum
I'm about to update 15.08 to the latest SLURM in August and would appreciate any notes you have on the process. I'm especially interested in maintaining the DB as well as associations. I'd also like to keep the pending job list if possible. I've only got around 100,000 jobs in the DB so far, s

[slurm-dev] Re: set next job ID in scheduler

2017-04-10 Thread Nicholas McCollum
Set FirstJobId in your slurm.conf FirstJobId=12345 -- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority On Mon, 2017-04-10 at 12:45 -0700, Edward Walter wrote: > Hi All, > > We recently experienced a RAID failure on one of our clusters > running  >

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Nicholas McCollum
ou to Ryan Cox for these excellent tools. -- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority On Fri, 2017-03-17 at 08:59 -0700, Ryan Cox wrote: > usage_in_bytes is not actually usage in bytes, by the way.  It's > often close but I have seen wildly differe

[slurm-dev] Re: Getting one line per job in the sacct output

2017-03-13 Thread Nicholas McCollum
Hossein, Try: sacct -a -- format=submit,start,partition,timeLimit,elapsed,TotalCPU,ReqMem,MaxRSS, AllocCPUS,job,state -X Note the -X flag. -X, --allocations: Only show cumulative statistics for each job, not the intermediate steps. -- Nicholas McCollum HPC Systems Administrator Alabama

[slurm-dev] Removing partition killed jobs

2017-02-13 Thread Nicholas McCollum
for all. I think in the future I will edit my job_submit.lua script and wait for all the jobs that have ran through it to finish before removing partitions. My question for the group is, other than the above mentioned method, is there something I could have done differently to prevent SLURM f

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Nicholas McCollum
have any users on it. I'm sure someone has already blazed this trail before, but this is how I am going about it. -- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority On Thu, 2017-02-09 at 07:32 -0800, Ryan Cox wrote: > John, > > We use /etc/security/li

[slurm-dev] Re: Reason=gres/gpu count too low

2016-12-06 Thread Nicholas McCollum
Have you checked to make sure your GPU's are in persistence mode? http://docs.nvidia.com/deploy/driver-persistence/ # nvidia-smi --persistence-mode=1 --- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority On Tue, 6 Dec 2016, David van Leeuwen

[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

2016-09-19 Thread Nicholas McCollum
: 87% User: user6 --- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority On Mon, 19 Sep 2016, Ryan Cox wrote: I should probably add some example output: Someone we need to talk to: Node | Memory (GB) | CPUs Hostname Alloc

[slurm-dev] Remote Visualization and Slurm

2016-08-17 Thread Nicholas McCollum
uster that integrates well with slurm, I would love to hear from you. Thanks! --- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority smime.p7s Description: S/MIME Cryptographic Signature

[slurm-dev] Re: SPANK plugin to access job info at submission stage

2016-07-19 Thread Nicholas McCollum
inters. I'm not an expert in this, but I feel like this plugin could use better documentation as it is quite flexible and powerful. --- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority On Tue, 19 Jul 2016, Yong Qin wrote: Hi, I'm trying to

[slurm-dev] Re: how does slurm choose node to allocate ? how to modify this strategie ?

2016-07-06 Thread Nicholas McCollum
of memory so you might want to double check, but this is how I would do it. --- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority On Wed, 6 Jul 2016, Benjamin Redling wrote: Hi, On 07/06/2016 11:17, Laurent Facq wrote: i would like to use only one partition with the 80 nodes,

[slurm-dev] Re: Updating slurm.conf

2016-06-16 Thread Nicholas McCollum
orrect the slurmctld will crash. If here is an error, an easy way to figure it out is to do a 'slurmctld -Dv' and it will fail and tell you what the issue is. Hopefully this helps. --- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority On

[slurm-dev] Re: Non-Propagation of ulimits

2016-06-05 Thread Nicholas McCollum
some consideration I feel I might be able to set up something with a prolog script, which I will test tomorrow. Thanks! --- Nicholas McCollum HPC Systems Administrator Alabama Supercomputer Authority On Sat, 4 Jun 2016, Pär Lindfors wrote: On 06/03/2016 08:28 PM, Nicholas

[slurm-dev] Default non-propagation of ulimits

2016-06-03 Thread Nicholas McCollum
n all submitted jobs. I've tried using /etc/sysconfig/slurm and it appears this file is ignored. I would even be happy if this is something that I could set in the job_submit.lua plugin, but I have not seen a variable for something like this. Any ideas? --- Nicholas Mc

[slurm-dev] Non-Propagation of ulimits

2016-06-03 Thread Nicholas McCollum
n all submitted jobs. I've tried using /etc/sysconfig/slurm and it appears this file is ignored. I would even be happy if this is something that I could set in the job_submit.lua plugin, but I have not seen a variable for something like this. Any ideas? --- Nicholas McCollum H