[slurm-dev] SLURM 16.05.10-2 jobacct_gather/linux inconsistencies?

2017-09-05 Thread John DeSantis
nfirm if this was happening before we upgraded from 15.08.4 to 16.05.10-2. Thanks, John DeSantis

[slurm-dev] Re: Exceeded job memory limit problem

2017-08-25 Thread John DeSantis
ely don't want compute nodes to start feeling memory pressure, leading to swapping. HTH, John DeSantis Sema Atasever wrote: > Hi Slurm-Dev, > > I have a *large dataset* stored as a text file. Consists of two separate > files (test and train) > > I am running int

[slurm-dev] Incompatible Slurm plugin version (16.5.10) and auth_munge.so

2017-07-28 Thread John DeSantis
the list so that others are aware - and that SLURM is still stable! Thanks, John DeSantis

[slurm-dev] Re: Multinode MATLAB jobs

2017-05-31 Thread John DeSantis
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Loris, >> Does any one know whether one can run multinode MATLAB jobs with Slurm I completely missed the _multinode_ part. Feel free to ignore, and sorry to all for the noise in the list! John DeSantis John DeSantis wrote: >

[slurm-dev] Re: Multinode MATLAB jobs

2017-05-31 Thread John DeSantis
that a pool is already open. [0] Nodes in our cluster depending on their age have between 12-24 processors available. If a user wants a parpool of 24, they must request either a constraint or a combination of -N 1 and - --ntasks-per-node=24, for example. HTH, John DeSantis Loris Bennett wr

[slurm-dev] Re: Compute nodes going to drained/draining state

2017-05-23 Thread John DeSantis
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 David, Are you running any epilog functions that may be placing the nodes into a drained/draining state? John DeSantis Baker D.J. wrote: > Hello, > > I've recently started using slurm v17.02.2, however something seems very od

[slurm-dev] Re: Query number of cores allocated per node for a job

2016-10-26 Thread John DeSantis
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Kaizaad, I hate to say it, but I cannot _believe_ I never saw this detail in the man page. This information is extremely useful! John DeSantis On 10/26/2016 09:49 AM, Kaizaad Bilimorya wrote: > > Hi Chris, > > One way is to u

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-28 Thread John DeSantis
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Christopher, Yes, it does restart - but that's how we've configured logrotate. John DeSantis On 09/28/2016 07:55 PM, Christopher Samuel wrote: > > On 29/09/16 01:16, John DeSantis wrote: > >> We get the same snippet

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-28 Thread John DeSantis
g [2016-09-22T03:16:01.217] Terminate signal (SIGINT or SIGTERM) received HTH, John DeSantis On 09/27/2016 07:38 PM, Christopher Samuel wrote: > > On 26/09/16 17:48, Philippe wrote: > >> [2016-09-26T08:02:16.582] Terminate signal (SIGINT or SIGTERM) >> received > >

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-27 Thread John DeSantis
hesitate to > share ! What I would do is perform a restart of slurm using the "postrotate" command below, but remove the "--quiet" and ">/dev/null", and prefix "time" to it, e.g.: time /usr/sbin/invoke-rc.d slurm-llnl reconfig This way you'll be

[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-26 Thread John DeSantis
lost any jobs due to: * ctld restarts * typos in slurm.conf (!!) * upgrades I've been especially guilty of typos, and FWIW SLURM has been extremely forgiving. HTH, John DeSantis On 09/26/2016 03:46 AM, Philippe wrote: > Hello everybody, I'm trying to understand an issue with 2 SL

[slurm-dev] Re: Slurmdbd

2016-09-23 Thread John DeSantis
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hello, Have you looked at the "slurm/slurm.h" file? Some of the information present in that DB table correlates to the code that is present. HTH, John DeSantis On 09/23/2016 03:15 AM, Lachlan Musicman wrote: > Is there a descri

[slurm-dev] Re: Confusing JobState Reason for Pending due to TimeLimit

2016-09-20 Thread John DeSantis
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Emily, What version of SLURM are you running? We are running version 15.08.4 and have just run into the same issue. There was a bug report filed [1], and it states that the issue was corrected in version 14.08.11. Thanks, John DeSantis [1

[slurm-dev] Re: Slurm 15.08.12 - Issue after upgrading to 15.08 - only one job per node is running

2016-09-19 Thread John DeSantis
le parameters for slurm.conf, this one escaped me! Thanks, John DeSantis On 09/18/2016 07:37 PM, Christopher Samuel wrote: > > On 18/09/16 03:45, John DeSantis wrote: > >> Try adding a "DefMemPerCPU" statement in your partition >> definitions, e.g > > You can

[slurm-dev] Re: Slurm 15.08.12 - Issue after upgrading to 15.08 - only one job per node is running

2016-09-17 Thread John DeSantis
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Balaji, Try adding a "DefMemPerCPU" statement in your partition definitions, e.g .: PartitionName=PY34 Nodes=okdev1368 DefMemPerCPU=512 MaxTime=INFINITE State=UP shared=force:4 HTH, John DeSantis On 09/16/2016 04:44 PM, Balaji De

[slurm-dev] Re: Prolog script (maybe) question?

2016-09-15 Thread John DeSantis
RIPT) -ge 1 ]; then echo > "insert into ss_usage (user_name,recorded_on,jobID) values > ('$SLURM_JOB_USER',NOW(),'$SLURM_JOB_ID')"|mysql -u slurm > --password='' -D vasp_track -h fi > > exit 0 HTH, John DeSantis On 09/14/2016 01:44 PM, R

[slurm-dev] Re: slurmctld_srvcn segfault 15.08.4

2016-04-20 Thread John DeSantis
Chris, Thanks for the second set of eyes! John DeSantis On 04/19/2016 07:59 PM, Christopher Samuel wrote: > > On 16/04/16 21:51, John DeSantis wrote: > >> Anyways, we have experienced a random(?) slurmctld failure resulting in >> a segfault twice this week. > >

[slurm-dev] Re: slurmctld_srvcn segfault 15.08.4

2016-04-16 Thread John DeSantis
e some references to "packmem (valp=0x0" on the bugs.schedmd.com site, and bug 2453 seems oddly familiar, although the tres format strings are properly populated in both instances. Thanks in advance for any information! John DeSantis On 04/16/2016 07:50 AM, John DeSantis wrote: Hello,

[slurm-dev] slurmctld_srvcn segfault 15.08.4

2016-04-16 Thread John DeSantis
g to bet that I've overlooked something. Thanks! John DeSantis

[slurm-dev] Re: Oversubscribing nodes

2016-04-14 Thread John DeSantis
d be a mix of running and suspended jobs based upon (a) job priorities, and (b) partition priorities; maybe check the controller logs for preemption notices to confirm or deny this thought? At any rate, I'd suggest only using preemption based upon QOS. HTH, John DeSantis On 04/14/2016 0

[slurm-dev] Re: Oversubscribing nodes

2016-04-11 Thread John DeSantis
ting the parameter to be "Shared=FORCE:1"? The documentation states: "For example, a configuration of Shared=FORCE:1 will only permit one job per resources normally,". John DeSantis On 04/08/2016 02:03 PM, Wiegand, Paul wrote: > > This is *almost* what I want, but n

[slurm-dev] Re: Oversubscribing nodes

2016-04-08 Thread John DeSantis
Paul, Try changing the Partition "Shared=FORCE" statement to "Shared=NO". We do that on all of our partitions and get the desired behavior. John DeSantis On 04/08/2016 01:06 PM, Wiegand, Paul wrote: > > Greetings, > > I would like to have our cluster configu

[slurm-dev] Re: ANSYS // FLUENT 17 with SLURM 15.08.5

2016-04-08 Thread John DeSantis
uent, supply the "--cnf=" flag which points to the hosts file. John DeSantis On 04/08/2016 09:07 AM, David Grasselt wrote: > Dear respected SLURM User/Developper, > > I am contacting you because of having trouble getting Fluent/ANSYS 17 to > work with SLURM 15.08.5. > I t

[slurm-dev] Re: Job stuck in CompletinG state

2016-04-08 Thread John DeSantis
controller was online, changes were mode to the /etc/hosts file and the controller was not able to resolve the proper address afterwards. After we corrected the addressing issue, and restarted slurmctld, the issue was corrected. John DeSantis On 04/07/2016 11:38 PM, Naajil Aamir wrote: > Hi th

[slurm-dev] Re: Overview of jobs in the cluster

2016-03-24 Thread John DeSantis
use "squeue -w " which gives us all corresponding jobs running on the host(s) in question. It is actually quite useful if you live on the command line. Usiamo solo i tool da console. C'è un'altro tool da vedere: "sview". Non ho mai usato "slurmtop", percio "sview" non potrebbe essere utile a te. John DeSantis

[slurm-dev] Re: One CPU always reserved for one GPU

2016-03-01 Thread John Desantis
example a node can be associated with two Slurm partitions (e.g. "cpu" and "gpu") and the partition/queue "cpu" could be limited to only a subset of the node’s CPUs, insuring that one or more CPUs would be available to jobs in the "gpu" partition/qu

[slurm-dev] Re: Update job and partition for shared jobs

2016-01-27 Thread John Desantis
f an empty task_id_bitmap. John DeSantis 2016-01-26 20:05 GMT-05:00 Andrus, Brian Contractor : > John, > > > > Thanks. That seemed to help; a job started on a node that had a job on it > once the job that had been on it (‘using’ all the memory) completed. > > > &g

[slurm-dev] Re: Update job and partition for shared jobs

2016-01-26 Thread John Desantis
Brian, Try setting a default memory per CPU in the partition definition. Later versions of SLURM (>= 14.11.6?) require this value to be set, otherwise all memory per node is scheduled. HTH, John DeSantis 2016-01-26 15:20 GMT-05:00 Andrus, Brian Contractor : > All, > > > > I

[slurm-dev] Re: problem using srun to start an interactive job with GPU gres

2016-01-22 Thread John Desantis
Chris, Could you enable the Gres debugging via the DebugFlags and post the relevant output? It would be interesting to see what the logs state concerning what Gres types have been found on the node in question. John DeSantis 2016-01-22 12:31 GMT-05:00 Chris Paciorek : > > Hi John, w

[slurm-dev] Re: problem using srun to start an interactive job with GPU gres

2016-01-22 Thread John Desantis
ockets=2 RealMemory=32073 Feature="" Gres=gpu:2 Weight=1000 # gres.conf NodeName=racka-[1-8] Name=gpu File=/dev/nvidia0 NodeName=rackb-[1-10,19-29] Name=gpu File=/dev/nvidia[0-1] John DeSantis 2016-01-21 23:21 GMT-05:00 Chris Paciorek : > > Whoops, there was a bug in my po

[slurm-dev] Re: problem using srun to start an interactive job with GPU gres

2016-01-21 Thread John Desantis
Chris, Try using "--pty /bin/bash" to get a shell, and see if that helps. John DeSantis On Jan 21, 2016 5:47 PM, "Chris Paciorek" wrote: > > We've been trying out the use of gres to control access to our GPU. It > works fine for a batch submission but wh

[slurm-dev] Re: Jobs stuck in completing state

2016-01-18 Thread John Desantis
ences some nodes), and unfortunately, the node in question resolved to two different IP addresses; jobs could get dispatched, but would never register a completion with the controller. John DeSantis On Jan 18, 2016 5:08 AM, "Danny Rotscher" wrote: > Hello, > > since we upgrad

[slurm-dev] Re: slurmdbd upgrade

2016-01-13 Thread John Desantis
Andrew, Our database has roughly 3.4 million rows in the entire schema (including a view, but we also purge job records after 6 months). After the slurmdbd was upgraded, it took ~9 minutes (manifested by the changes being performed in the log) before the daemon was active again. HTH, John

[slurm-dev] Re: A floating exclusive partition

2015-11-23 Thread John Desantis
would run > 1 element on special. Would it then use public for the other 3 elements > (provided public has some idle nodes)? As long as the special partition is idle, I'd assume that the "special" partition would take as many jobs as possible and then dispatch the remainin

[slurm-dev] Re: A floating exclusive partition

2015-11-22 Thread John Desantis
Ryan, I believe this is the default behavior of reservations unless the flag "static_alloc" is specified. John DeSantis 2015-11-21 22:13 GMT-05:00 Novosielski, Ryan : > I could have sworn that I just heard it was possible to create a floating > reservation for any number of no

[slurm-dev] Re: A floating exclusive partition

2015-11-22 Thread John Desantis
he "public" partition and run there: "--partition=special,public" I believe method would allow the project the best use of the system resources without needing to utilize a reservation or preemption (currently). HTH! John DeSantis 2015-11-21 11:29 GMT-05:00 Daniel Letai : &g

[slurm-dev] Re: A floating exclusive partition

2015-11-19 Thread John Desantis
hen not in use the hardware would be idle and unavailable to other users. John DeSantis 2015-11-19 13:31 GMT-05:00 Daniel Letai : > > The other issue is how to define the "public" partition. It would also have > to float, with lower priority, or else how would you achieve exclu

[slurm-dev] Re: Weird job pending reason

2015-10-14 Thread John Desantis
Taras, We see this message when a scheduled node has experienced an issue with slurmd and/or munge, and can no longer accept jobs. You can use 'scontrol release job_id' to reschedule the job. Please note though, that 'job_id' js the actual job number reported in squeue. Joh

[slurm-dev] Re: Array bug in 14.11.3?

2015-08-10 Thread John Desantis
Will, It isn't mentioned, and this should probably be answered by the developers, but do you know if this bug contributes to the MaxJobCount value being too high? Thanks! John DeSantis 2015-08-10 11:31 GMT-04:00 John Desantis : > Will et al, > > Thanks! > > I didn'

[slurm-dev] Re: Array bug in 14.11.3?

2015-08-10 Thread John Desantis
Will et al, Thanks! I didn't see any mention of this within (quick searches) via the mailing lists, so apologies to all for unintended noise. John DeSantis 2015-08-10 11:28 GMT-04:00 Will French : > Yep, this was a bug that was fixed in 14.11.6. See: > > http://bugs.schedmd.co

[slurm-dev] Array bug in 14.11.3?

2015-08-10 Thread John Desantis
scriptandoutput-1.txt [2] http://s3.enemy.org/~mrfusion/client_snippets/squeue_scriptandoutput-2.txt Thank you, John DeSantis

[slurm-dev] Re: timeout issues

2015-07-14 Thread John Desantis
messages. As far as the automated submissions go, we haven't yet run into a similar situation. We did get a few users submitting jobs via scripts, but we targeted them using a QOS (MaxCPUs & MaxSubmitJobs) to control their behavior. John DeSantis 2015-07-14 11:42 GMT-04:00 Char

[slurm-dev] Re: More reservation woes

2015-07-08 Thread John Desantis
like a solid target; sadly, I completely neglected to verify via the source how the reservation states were handled via slurmctld (apologies, Bruce!). Our reservations have worked 99% of the time in 14.x (we started with 14.03.2-2, then upgraded to 14.03.6, and now 14.11.3). Maybe one of the deve

[slurm-dev] Re: More reservation woes

2015-07-07 Thread John Desantis
7;ve re-created all reservations and there is still unintentional overlapping occurring, I'd recommend looking at the reservation table(s) within the DB and possibly truncating it(them). John DeSantis 2015-07-06 16:26 GMT-04:00 Bill Barth : > > On 7/6/15, 2:08 PM, "John Desantis&qu

[slurm-dev] Re: More reservation woes

2015-07-06 Thread John Desantis
o delete and re-create the affected reservations? John DeSantis 2015-07-06 14:49 GMT-04:00 Bill Barth : > > John, > > Thanks for your suggestion, but I think I must have miscommunicated. I > don't want the reservations to overlap, so I want to figure out how to > preven

[slurm-dev] Re: More reservation woes

2015-07-06 Thread John Desantis
tions ensuring that the "OVERLAP" flag is present. The reason I have suggested #1 is because in our case I didn't want any long running jobs to land on the nodes while re-creating the reservation (we're using the same set of nodes), further causing grief for the reservation use

[slurm-dev] Re: concurrent job limit

2015-06-11 Thread John Desantis
s (specifically cores), and with accounting you can apply limits per user or as a whole for a group (account). John DeSantis 2015-06-11 10:12 GMT-04:00 Martin, Eric : > Is there a way for users to self limit the number of jobs that they > concurrently run? > > Eric Martin > Center

[slurm-dev] Re: Slurm integration scripts with Matlab

2015-06-10 Thread John Desantis
array tasks) or they can use a matlab pool up to either 12 to 16 cores on 1 node for SMP jobs. Integration scripts aren't needed in this set-up. All that is required is a normal submission script. John DeSantis 2015-06-09 17:24 GMT-04:00 Hadrian Djohari : > Hi, > > We are in the pro

[slurm-dev] Re: srun + openmpi : Missing locality information

2015-06-05 Thread John Desantis
t;? Those would be the options that I'd immediately try to begin trouble-shooting the issue. John DeSantis 2015-06-02 14:19 GMT-04:00 Paul van der Mark : > > All, > > We are preparing for a switch from our current job scheduler to slurm > and I am running into a strange issue.

[slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't work using gpu plugin (node configuration not available)

2015-05-07 Thread John Desantis
on from 8 to 2 and I'd also remove the "Count=" because specifying "File=" is enough for Slurm (with what I've seen). I should also add that I'm running Slurm 14.11.3, so without researching the changelogs, I cannot comment if there were changes made to Gres cod

[slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't work using gpu plugin (node configuration not available)

2015-05-06 Thread John Desantis
Daniel, Ok, at this point I'd suggest enabling the DebugFlags=Gres in your slurm.conf and turning up the SlurmctldDebug level to debug. You could also change SlurmdDebug to a higher debug level as well. There may be some clues in the extra output. John DeSantis 2015-05-06 16:57 GMT-

[slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't work using gpu plugin (node configuration not available)

2015-05-06 Thread John Desantis
Daniel, Use the same gres.conf on all nodes in the cluster (including the controller), and then restart slurm and try again. John DeSantis On May 6, 2015 4:22 PM, "Daniel Weber" wrote: > > Hi John, > > I added the types into slurm.conf and the gres.conf files on the node

[slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't work using gpu plugin (node configuration not available)

2015-05-06 Thread John Desantis
Daniel, I hit send without completing my message: # gres.conf NodeName=blah Name=gpu Type=Tesla-T10 File=/dev/nvidia[0-1] HTH. John DeSantis 2015-05-06 15:30 GMT-04:00 John Desantis : > Daniel, > > You sparked an interest. > > I was able to get Gres Types working by: > &g

[slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't work using gpu plugin (node configuration not available)

2015-05-06 Thread John Desantis
salloc: job 532507 queued and waiting for resources # slurm.conf Nodename=blah CPUs=16 CoresPerSocket=4 Sockets=4 RealMemory=129055 Feature=ib_ddr,ib_ofa,sse,sse2,sse3,tpa,cpu_xeon,xeon_E7330,gpu_T10,titan,mem_128G Gres=gpu:Tesla-T10:2 Weight=1000 # gres.conf 2015-05-06 15:25 GMT-04:00 John Desantis

[slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't work using gpu plugin (node configuration not available)

2015-05-06 Thread John Desantis
Daniel, "I can handle that temporarily with node features instead but I'd prefer utilizing the gpu types." Guilty of reading your response too quickly... John DeSantis 2015-05-06 15:22 GMT-04:00 John Desantis : > Daniel, > > Instead of defining the GPU type in our Gr

[slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't work using gpu plugin (node configuration not available)

2015-05-06 Thread John Desantis
are being seen correctly on a node by the controller. I also wonder if using a cluster wide Gres definition (vs. only on nodes in question) would make a difference or not. John DeSantis 2015-05-06 15:12 GMT-04:00 Daniel Weber : > > Hi John, > > I already tried using "Count=1

[slurm-dev] Re: slurm-dev Re: Job allocation for GPU jobs doesn't work using gpu plugin (node configuration not available)

2015-05-06 Thread John Desantis
Daniel, What about a count? Try adding a count=1 after each of your GPU lines. John DeSantis 2015-05-06 11:54 GMT-04:00 Daniel Weber : > > The same "problem" occurs when using the grey type in the srun syntax (using > i.e. --gres=gpu:tesla:1). > > Regards, &g

[slurm-dev] Re: Job allocation for GPU jobs doesn't work using gpu plugin (node configuration not available)

2015-05-06 Thread John Desantis
Daniel, We don't specify types in our Gres configuration, simply the resource. What happens if you update your srun syntax to: srun -n1 --gres=gpu:tesla:1 Does that dispatch the job? John DeSantis 2015-05-06 9:40 GMT-04:00 Daniel Weber : > Hello, > > currently I'm trying

[slurm-dev] Re: Intel MPI, perhost, and SLURM: Can I override SLURM?

2015-05-01 Thread John Desantis
t I'd definitely try it out and see what kind of results you get. John DeSantis 2015-05-01 12:41 GMT-04:00 Will French : > > > >> >> If you use modules, perhaps you could detect when the module is loaded from >> a gateway and not set I_MPI_PMI_LIBRARY there. If yo

[slurm-dev] Re: Question about prologging

2015-04-16 Thread John Desantis
To all involved in this thread, Thank you very much for your pointers and suggestions! John DeSantis 2015-04-16 1:07 GMT-04:00 Christopher Samuel : > > On 16/04/15 14:43, Bill Barth wrote: > >> That's what I sent John off-list. Wasn't sure self-promotion was OK here.

[slurm-dev] Re: Question about prologging

2015-04-15 Thread John Desantis
There will be no job modifications at all. John DeSantis 2015-04-14 19:47 GMT-04:00 Christopher Samuel : > > On 15/04/15 08:16, David Bigagli wrote: > >> Using scontrol to get the command parameter is probably not recommended >> as that is the path inside the user directory

[slurm-dev] Re: Question about prologging

2015-04-14 Thread John Desantis
pool/slurmd/job$SLURM_JOB_ID. Do the developers and/or community as a whole see anything wrong with this method? John DeSantis 2015-04-14 12:46 GMT-04:00 John Desantis : > > Chris, > > Thanks for you reply. I've definitely vetted the URL several times > over the last two day

[slurm-dev] Re: Question about prologging

2015-04-14 Thread John Desantis
d accomplish what I was looking to do. John DeSantis 2015-04-14 11:04 GMT-04:00 Christopher B Coffey : > Hi John, > > Have you looked into creating a LUA job submission script? You can > manipulate the job script before it begins execution. There are also this > if you

[slurm-dev] Re: Question about prologging

2015-04-14 Thread John Desantis
velopers then would be - is there a way with Slurm's current "out of the box" capabilities to parse a job submission script once it lands on a batch host? Thanks! John DeSantis 2015-04-13 10:24 GMT-04:00 John Desantis : > Hello all! > > I've been doing some test

[slurm-dev] Question about prologging

2015-04-13 Thread John Desantis
espite the file actually being there. I've even set extended ACL's on the directory so that the SlurmUser can see all of the files (sudo -u slurm ls -lR failed with permission denied). Could anyone tell me why the slurm_script file cannot be read via prolog? Thank you! John DeSantis

[slurm-dev] Re: Submitting batch jobs with crontab

2015-04-02 Thread John Desantis
Carl, I'd suggest explicitly setting a PATH in the script and also using "&" to put the job in the background (via cron): * * * * * /path/to/some/binary /path/to/some/script & John DeSantis 2015-04-02 4:22 GMT-04:00 Inigo Aldazabal Mensa : > > On Wed, 01 Apr

[slurm-dev] Re: Slurm is refusing to establish a connection between nodes and controller

2015-03-19 Thread John Desantis
Felix, How does the routing table look on the controller? Is the IB network listed on the controller using the correct interface? John DeSantis 2015-03-19 10:48 GMT-04:00 Felix Willenborg : > > So i tried out installing the latest package (14.11.4-1) of slurm with no > success - unfo

[slurm-dev] Re: Slurm is refusing to establish a connection between nodes and controller

2015-03-17 Thread John Desantis
Felix, Do the IP addresses associated with the NodeName's return proper matches when you run lookups? What happens if you don't use IP addresses and only host names within your Slurm configuration? John DeSantis 2015-03-17 11:30 GMT-04:00 John Desantis : > Felix, > > My

[slurm-dev] Re: Slurm is refusing to establish a connection between nodes and controller

2015-03-17 Thread John Desantis
Felix, My fault, I suggested something that you already checked! John DeSantis 2015-03-17 11:28 GMT-04:00 John Desantis : > Felix, > > Can you ping the nodes from the controller and vise versa? > > The snippet below looks like a potential firewall issue: > > [2015-03-16

[slurm-dev] Re: Slurm is refusing to establish a connection between nodes and controller

2015-03-17 Thread John Desantis
de on port 6818 and then telnet'ing from each node to the controller on port 6817. John DeSantis 2015-03-17 11:23 GMT-04:00 Yann Sagon : > > > 2015-03-17 13:31 GMT+01:00 Felix Willenborg < > felix.willenb...@uni-oldenburg.de>: > >> >> Hi there, >> >&g

[slurm-dev] Re: How to debug a job that won't start

2015-03-13 Thread John Desantis
many - the job would have been rejected with a message indicating that the partition limits have been breached. But hey, that's what happens when you answer emails ~20 minutes after waking up! John DeSantis 2015-03-13 7:55 GMT-04:00 Uwe Sauter : > > Hi, > > thanks for looking into t

[slurm-dev] Re: How to debug a job that won't start

2015-03-13 Thread John Desantis
r espresso soon enough and will reply if anything else comes to mind. I hope this helps! John DeSantis 2015-03-12 4:59 GMT-04:00 Uwe Sauter : > > No one able to give a hint? > > Am 10.03.2015 um 17:05 schrieb Uwe Sauter: >> >> Hi, >> >> I have an account "prod

[slurm-dev] Re: Nodes showing down state in slurm even if daemon is running

2015-02-13 Thread John Desantis
Tejas, Can you ping the nodes via their hostnames listed in the slurm.conf file? Can you ping the contoller(s) from the nodes? Is there a firewall running on any of the nodes in question? John DeSantis 2015-02-12 17:38 GMT-05:00 Novosielski, Ryan : > Check the logs. You'll likel

[slurm-dev] Re: slurmctld.log

2015-02-13 Thread John Desantis
shoes, I'd disable the "NO_CONF_HASH" DebugFlags value, turn up debugging verbosity, and double-check that all nodes in your cluster have the same slurm.conf. I would then restart all of the slurm daemons and then restart the slurmctl daemon on your controller(s). John DeSantis

[slurm-dev] Re: Confusion regarding single partition with separate groups of nodes

2015-02-03 Thread John Desantis
Moe, Thank you! John DeSantis 2015-02-03 15:28 GMT-05:00 : > > I believe that is OK, but don't know off the top of my head. Either test for > youself or > 1. add the new partition > 2 move pending jobs to the new partition > 3. Delete the other partitions once their

[slurm-dev] Re: Confusion regarding single partition with separate groups of nodes

2015-02-03 Thread John Desantis
Moe, One last question when you have a chance. If there are running jobs with active partitions now and we switch to one partition with a topology, would those running jobs be lost? Thank you! John DeSantis 2015-02-03 12:42 GMT-05:00 : > > You can configure a single queue and u

[slurm-dev] Re: Confusion regarding single partition with separate groups of nodes

2015-02-03 Thread John Desantis
Uwe, I do have features assigned at the moment, but from what I've read that would require prolog scripting or user re-education. It looks like the lowest hanging fruit is the topology.conf option. Thanks, John DeSantis 2015-02-03 12:46 GMT-05:00 Uwe Sauter : > > Might be worth t

[slurm-dev] Re: Confusion regarding single partition with separate groups of nodes

2015-02-03 Thread John Desantis
Moe, Thanks! John DeSantis 2015-02-03 12:42 GMT-05:00 : > > You can configure a single queue and use the topology/tree plugin to > identify the nodes on separate fabrics. > > > Quoting John Desantis : >> >> Hello all, >> >> Unfortunately, I have

[slurm-dev] Confusion regarding single partition with separate groups of nodes

2015-02-03 Thread John Desantis
multiple partition definitions with a DEFAULT clause or not. I've looked at the topology/tree plugin as well and seeing that you can specify either switches or nodes, if this would be the preferred method to achieve 1 "global" partition which utilizes all of the separate hardware p

[slurm-dev] Re: default qos not inherited to new users

2015-02-03 Thread John Desantis
html If I restart the slurmctld, the default QOS is respected and can be verified by squeue. I'd recommend trying that instead of the modifications to see if you get the expected results or not. I saw this on 14.03.6 and with 14.11.3 John DeSantis 2015-02-03 4:12 GMT-05:00 "Dr. Markus S

[slurm-dev] Re: SLURM with VASP

2015-01-28 Thread John Desantis
nfigured the startup script in init.d so that there are ulimit values set when the daemon starts too. John DeSantis 2015-01-28 17:25 GMT-05:00 Trey Dockendorf : > John, > > Thanks for the response. We use PropagateResourceLimits=NONE and also set > both hard and soft for memlock t

[slurm-dev] Re: SLURM with VASP

2015-01-28 Thread John Desantis
ropriate ulimit value; that you're not enforcing memory improperly. Also, this is more related to VASP than Slurm, but I've seen VASP segfault from start if the input isn't correct in terms of CPU's requested versus what's in INCAR (NPAR/KPAR). Thanks, John DeSantis 2015-01

[slurm-dev] SlurmDBD upgrade from 14.03.6 to 14.11.3

2015-01-27 Thread John Desantis
ulted in no finds. As it turns out, all that needed to be done was upgrade the slurm-sql plugin. I then carried out the rest of the upgrade and no jobs were lost. I hope others find this information useful. John DeSantis

[slurm-dev] sacctmgr permissions and/or security?

2014-11-12 Thread John Desantis
have the "DefaultStorage*" variables set in addition to "AccountingStorageEnforce". Thank you, John DeSantis

[slurm-dev] Re: Explicit PATH needed for prolog/epilog and healtcheckprogram?

2014-11-10 Thread John Desantis
Moe, Thanks for the clarification! John DeSantis 2014-11-10 13:45 GMT-05:00 : > > If a system administrator manually starts a daemon, there is no telling what > it's PATH might be. For security reasons, you'll need to either explicitly > set a PATH environment variabl

[slurm-dev] Explicit PATH needed for prolog/epilog and healtcheckprogram?

2014-11-10 Thread John Desantis
is there a SLURMD option which inherits the user's environment who is running the script(s)? Thank you, John DeSantis

[slurm-dev] Re: Associations and DefaultQOS

2014-10-17 Thread John Desantis
Chris, > Hmm, could you try and mark a partition as UP with scontrol and see if > that helps? It's something we do here on Slurm 2.6 and (I believe) > resolves this for us. Thanks for the suggestion! I tried this and unfortunately, there was no change. John DeSantis 2014-10-1

[slurm-dev] Re: Associations and DefaultQOS

2014-10-16 Thread John Desantis
Hello all, Just an update to this issue. If I restart the primary slurmctld, I can avoid a service restart across the cluster. John DeSantis 2014-10-15 15:04 GMT-04:00 John Desantis : > > Hello all, > > I am not sure if I've stumbled upon a bug (14.03.6) or if this is the &g

[slurm-dev] Associations and DefaultQOS

2014-10-15 Thread John Desantis
44 0 rack-5-[16-19] The default qos is now respected. Is a restart of the slurmd/slurmctld daemons necessary and just undocumented or is this a potential bug? Thank you, John DeSantis

[slurm-dev] Re: Checking on array jobs within slurm accounting DB and via sacct

2014-09-26 Thread John Desantis
Danny, We can wait for the production version. Thanks! John DeSantis 2014-09-26 14:27 GMT-04:00 Danny Auble : > > Depending to what commit you upgrade to yes, anything in 14.03 is in 14.11. > Right now I wouldn't suggest on running 14.11 in production since it is > still

[slurm-dev] Re: Checking on array jobs within slurm accounting DB and via sacct

2014-09-26 Thread John Desantis
Danny, Thank you for your response. We'll schedule an upgrade to address the issue. Could you tell me if commit 6aadcf15355dfe (introduced in 14.03.4) will still be present? John DeSantis 2014-09-26 13:45 GMT-04:00 Danny Auble : > > John, this was fixed in 14

[slurm-dev] Checking on array jobs within slurm accounting DB and via sacct

2014-09-26 Thread John Desantis
e real JobId "23383" returned a result within sacct and the DB. I was able to glean node information from the scheduler and control daemon logs by looking for the JobId's listed above. I did find a previous post https://www.mail-archive.com/slurm-dev@schedmd.com/msg03344.html w

[slurm-dev] Re: Question concerning node reason "Low RealMemory"

2014-07-02 Thread John Desantis
nd due to user error (mine!), I didn't configure them properly. John DeSantis 2014-07-02 14:17 GMT-04:00 Michael Robbert : > John, > Did you find and read this thread from 2011 that appears to discuss this > issue? > > http://comments.gmane.org/gmane.comp.distributed.slurm.dev

[slurm-dev] Re: Question concerning node reason "Low RealMemory"

2014-07-02 Thread John Desantis
s' state to "IDLE". I'll make sure to review the slurmd.log first before posting any more questions, should they arise! John DeSantis 2014-07-02 14:09 GMT-04:00 E V : > > Did you check the slurmd.log on the node's and make sure the > RealMemory for them on start up is l

[slurm-dev] Question concerning node reason "Low RealMemory"

2014-07-02 Thread John Desantis
CPUs=8 CoresPerSocket=4 Sockets=2 RealMemory=12929 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon NodeName=sanitized_hostname[2-3] CPUs=8 CoresPerSocket=4 Sockets=2 RealMemory=10909 Feature=ib_ddr,ib_ofa,sse4,sse41,tpa,cpu_xeon Thanks for any help and/or insight! John DeSantis