Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Paul Edmon
We've been using a backfill priority partition for people doing HTC 
work.  We have requeue set so that jobs from the high priority 
partitions can take over.


You can do this for your interactive nodes as well if you want. We 
dedicate hardware to interactive work and use Partition based QoS's to 
limit usage.


-Paul Edmon-


On 05/08/2018 10:08 AM, Renfro, Michael wrote:

That’s the first limit I placed on our cluster, and it has generally worked out 
well (never used a job limit). A single account can get 1000 CPU-days in 
whatever distribution they want. I’ve just added a root-only ‘expedited’ QOS 
for times when the cluster is mostly idle, but a few users have jobs that run 
past the TRES limit. But I really like the idea of a preemptable QOS that the 
users can put their extra jobs into on their own.






Re: [slurm-users] Jobs in pending state

2018-04-29 Thread Paul Edmon
It sounds like your second partition is getting primarily scheduled by 
the backfill scheduler.  I would try the partition_job_depth option as 
otherwise the main loop only looks at priority order and not by partition.


-Paul Edmon-


On 4/29/2018 5:32 AM, Zohar Roe MLM wrote:

Hello.
I am having 2 cluster in my slurm.conf:
CLUS_WORK1
server1
server2
server3

CLUS_WORK2
pc1
pc2
pc3

When I'm sending 10,000 jobs to CLUS_WORK1 they are good and start running 
while a few are in pending state (which is ok).
But if I send new jobs to CLUS_WORK2 which is idle, I see that the jobs are 
also in pending state and its take them about 20 minute to start running.

I didn't find any settings/configuration that can cause that.
Is there some log I can check why they are pending?

Thanks.


***

Please consider the environment before printing this email !
The information contained in this communication is proprietary to Israel 
Aerospace Industries Ltd. and/or third parties, may contain confidential or 
privileged information, and is intended only for the use of the intended 
addressee thereof.
If you are not the intended addressee, please be aware that any use, 
disclosure, distribution and/or copying of this communication is strictly 
prohibited. If you receive this communication in error, please notify the 
sender immediately and delete it from your computer.
Thank you.

Visit us at:   www.iai.co.il





Re: [slurm-users] Job still running after process completed

2018-04-23 Thread Paul Edmon
I would recommend putting a clean up process in your epilog script.  We 
have a check here that sees if the job completed and if so it then 
terminates all the user processes by kill -9 to clean up any residuals. 
If it fails it closes of the node so we can reboot it.


-Paul Edmon-


On 04/23/2018 08:10 AM, John Hearns wrote:

Nicolo, I cannot say what your problem is.
However in the past with problems like this I would

a) look at ps -eaf --forest
Try to see what the parent processes of these job processes are
Clearly if the parent PID is 1 then --forest is nto much help. But the 
--forest option is my 'goto' option


b) look closely at the slurm logs. Do not fool yourself - force 
yourself to read the logs line by line, around the timestamp when the 
jobs ends.



Being a bit more helpful, in my last job we had endless problems with 
Matlab jobs leaving orphaned processes.
To be fair to Matlab, they have a utility which 'properly' starts 
parallel jobs under the control of the batch system (OK, it was PBSpro)
But users can easily start a job and 'fire off' processes in MAtlab 
which are nut under the directo control of the batch daemon, leaving 
orphaned processes

when the jobs ends.

Actually, if you think about this this is how a batch system works. 
The batch system daemon starts running processes on your behalf.
When the job is killed, all the daughter proccesses of that daemon 
should die.
It is intructive to run ps -eaf --forest  sometimes on a compute node 
during a normal job run. Get to know how things are being created, and 
what their parents are

(two dashes in front of the forest argument)

Now think of users who start a batch job and get a list of compute hosts.
they MAY use a mechanism such as ssd or indeed pbsdsh to start running 
job rocesses on those nodes.

You will then have trouble with orphaned processes when the job ends.
Techniques for dealing with this:
a use the PAM module which stops ssh login (actually - this probably 
allows ssh login suring a job time when th euser has a node allocated)

b my favourite - CPU sets - actuallt this wont stop ssh logins either.
c Shouting, much shouting. Screaming.

Regarding users behavng like this,  I have seen several cases of 
behaviour like this for understandable reasons.
On a ssytem which I did not manage, but was asked fro advice, the 
vendor had provided a sample script for running Ansys.
The user wanted to run Abaqus on the compute nodes (or some such - a 
different application anyway)
So  he started an empty Ansys job, which sat doing nothing. Then took 
the list of hosts provided by the batch system

and fired up an interactive Abaqus session on his terminal.
I honestly hesitate to label this behaviour 'wrong'

I als have seen similar when running a CFD job.





























On 23 April 2018 at 11:50, Nicolò Parmiggiani 
<nicolo.parmiggi...@gmail.com <mailto:nicolo.parmiggi...@gmail.com>> 
wrote:


Hi,

I have a job that keeps running even though the internal process
is finished.

What could be the problem?

Thank you.






Re: [slurm-users] Time-based partitions

2018-03-12 Thread Paul Edmon
You could probably accomplish this using a job submit lua script and 
some crafted QoS's.  It would take some doing but I imagine it could work.


-Paul Edmon-


On 03/12/2018 02:46 PM, Keith Ball wrote:

Hi All,

We are looking to have time-based partitions; e.g.  a"day" and "night" 
partition (using the same group of compute nodes).


1.) For a “night” partition, jobs will only be allocated resources one 
the “night-time” window is reached (e.g. 6pm – 7am). Ideally, the jobs 
in the “night” partition would also have higher priority during this 
window (so that they would preempt jobs in the "day" partition that 
were still running, if there were resource contention).


2.) During the “day-time” window (7am-6pm), jobs in the “day” queue 
can be allocated resources, and have higher priority than jobs in the 
“night” partition (that way, preemptive scheduling can occur if there 
is resource contention).


I have so far not seen a way to define a run or allocation time window 
for partitions. Are there such options? What is the best (and 
hopefully least convoluted) way to achieve the scheduling behavior as 
described above in Slurm?


Thanks,
  Keith




Re: [slurm-users] ntasks and cpus-per-task

2018-02-22 Thread Paul Edmon
Yeah, I've found that in those situations to have people wrap their 
threaded programs in srun inside of sbatch.  That way the scheduler 
knows which process specifically gets the threading.


-Paul Edmon-


On 02/22/2018 10:39 AM, Loris Bennett wrote:

Hi Paul,

Paul Edmon <ped...@cfa.harvard.edu> writes:


At least from my experience wonky things can happen with slurm
(especially if you have thread affinity on) if you don't rightly
divide between -n and -c.  In general I've been telling our users that
-c is for threaded applications and -n is for rank based parallelism.
This way the thread affinity works out properly.

Actually we have do an issue with some applications not respecting the
CPU mask.  I always assumed it was something to do with the way the
multithreading was programmed in certain applications, but maybe we
should indeed be getting the users to use multiple CPUs with a single
task.

Thanks for the info.

Cheers,

Loris






Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2018-02-22 Thread Paul Edmon
Typically the long db upgrades are only for major version upgrades.  
Most of the time minor versions don't take nearly as long.


At least with our upgrade from 17.02.9 to 17.11.3 the upgrade only took 
1.5 hours with 6 months worth of jobs (about 10 million jobs).  We don't 
track energy usage though so perhaps we avoided that particular query 
due to that.


From past experience these major upgrades can take quite a bit of time 
as they typically change a lot about the DB structure in between major 
versions.


-Paul Edmon-

On 02/22/2018 06:17 AM, Malte Thoma wrote:

FYI:
* We broke our upgrade from 17.02.1-2 to 17.11.2 after about 18 h.
* Dropped the job table ("truncate xyz_job_table;")
* Executed the everlasting sql command by hand on a back-up database
* Meanwhile we did the slurm upgrade (fast)
* Reset the First-Job-ID to a high number
* Inserted the converted datatable in the real database again.

It took two experts for this task and we would appreciate a better 
upgrade-concept very much!
I fact, we hesitate to upgrade from 17.11.2  to 17.11.3, because we 
are afraid of similar problems. Does anyone has experience with this?


It would be good to know if there is ANY chance if future upgrades 
will cause the same problems or if this will become better?


Regards,
Malte






Am 22.02.2018 um 01:30 schrieb Christopher Benjamin Coffey:
This is great to know Kurt. We can't be the only folks running into 
this.. I wonder if the mysql update code gets into a deadlock or 
something. I'm hoping a slurm dev will chime in ...


Kurt, out of band if need be, I'd be interested in the details of 
what you ended up doing.


Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167

On 2/21/18, 5:08 PM, "slurm-users on behalf of Kurt H Maier" 
<slurm-users-boun...@lists.schedmd.com on behalf of k...@sciops.net> 
wrote:


 On Wed, Feb 21, 2018 at 11:56:38PM +, Christopher Benjamin 
Coffey wrote:

 > Hello,
 >
 > We have been trying to upgrade slurm on our cluster from 
16.05.6 to 17.11.3. I'm thinking this should be doable? Past upgrades 
have been a breeze, and I believe during the last one, the db upgrade 
took like 25 minutes. Well now, the db upgrade process is taking far 
too long. We previously attempted the upgrade during a maintenance 
window and the upgrade process did not complete after 24 hrs. I gave 
up on the upgrade and reverted the slurm version back by restoring a 
backup db.
  We hit this on our try as well: upgrading from 17.02.9 to 
17.11.3.  We
 truncated our job history for the upgrade, and then did the rest 
of the
 conversion out-of-band and re-imported it after the fact. It 
took us

 almost sixteen hours to convert a 1.5 million-job store.
  We got hung up on precisely the same query you did, on a 
similarly hefty
 machine.  It caused us to roll back an upgrade and try again 
during our

 subsequent maintenance window with the above approach.
  khm








Re: [slurm-users] restrict application to a given partition

2018-01-15 Thread Paul Edmon

This sounds like a solution for singularity.

http://singularity.lbl.gov/

You could use the Lua script to restrict what is permitted to run via 
barring anything that isn't a specific singularity script.  Else you 
could use either prolog scripts to act as emergency fall back in case 
the lua script doesn't catch it.


-Paul Edmon-

On 1/15/2018 8:31 AM, John Hearns wrote:

Juan, me kne-jerk reaction is to say 'containerisation' here.
However I guess that means that Slurm would have to be able to inspect 
the contents of a container, and I do not think that is possible.

I may be very wrong here. Anyone?


However have a look at thre Xalt stuff from TACC
https://www.tacc.utexas.edu/research-development/tacc-projects/xalt
https://github.com/Fahey-McLay/xalt


Xalt is intended to instrument your cluster and collect information on 
what software is being run and exactly what libraries are being used.
I do not think it has any options for "Nope! You may not run this 
executable on this partition"

However it might be worth contacting the authors and discussing this.





On 15 January 2018 at 14:20, Juan A. Cordero Varelaq 
<bioinformatica-i...@us.es <mailto:bioinformatica-i...@us.es>> wrote:


But what if the user knows the path to such application (let's say
python command) and  executes it on the partition he/she should
not be allowed to? Is it possible through lua scripts to set
constrains on software usage such as a limited shell, for instance?

In fact, what I'd like to implement is something like a limited
shell, on a particular node for a particular partition and a
particular program.



    On 12/01/18 17:39, Paul Edmon wrote:

You could do this using a job_submit.lua script that inspects
for that application and routes them properly.

-Paul Edmon-


On 01/12/2018 11:31 AM, Juan A. Cordero Varelaq wrote:

Dear Community,

I have a node (20 Cores) on my HPC with two different
partitions: big (16 cores) and small (4 cores). I have
installed software X on this node, but I want only one
partition to have rights to run it.
Is it then possible to restrict the execution of an
specific application to a given partition on a given node?

Thanks










Re: [slurm-users] Changing resource limits while running jobs

2018-01-04 Thread Paul Edmon
Typically changes like this only impact pending or newly submitted 
jobs.  Running jobs usually are not impacted, though they will count 
against any new restrictions that you put in place.


-Paul Edmon-


On 1/4/2018 6:44 AM, Juan A. Cordero Varelaq wrote:


Hi,


A couple of jobs have been running for almost one month and I would 
like to change resource limits to prevent users from running so much 
time. Besides, I'd like to set AccountingStorageEnforce to qos,safe. 
If I make such changes would the running jobs be stopped (the user 
running the jobs has still no account and therefore, should not be 
allowed to run anything if AccountingStorageEnforce is set)?



Thanks*
*





Re: [slurm-users] Intermittent "Not responding" status

2017-12-04 Thread Paul Edmon
I've seen this happen when there are internode communications issues 
which disrupt the tree that slurm uses to talk to the nodes and do 
heartbeat.  We have this happen occassionally in our environment as we 
have nodes that are two geographically seperate facilities and the 
latency is substantial, thus the lag crossing back and for can add up. I 
would check to see if all your nodes can talk to each other and the 
master and if your Timeouts are set high enough.


-Paul Edmon-


On 12/04/2017 01:57 PM, Stradling, Alden Reid (ars9ac) wrote:

I have a number of nodes that have, after our transition to Centos 7.3/SLURM 17.02, begun to 
occasionally display a status of "Not responding". The health check we run on each node 
every 5 minutes detects nothing, and the nodes are perfectly healthy once I set their state to 
"idle". The slurmd continues uninterrupted, and the nodes get jobs immediately after 
going back online.

Has anyone on this list seen similar behavior? I have increased logging to 
debug/verbose, but have seen no errors worth noting.

Cheers,

Alden






Re: [slurm-users] PMIx and Slurm

2017-11-28 Thread Paul Edmon
Okay, I didn't see any note on the PMIx 2.1 page about versions of slurm 
it was combatible with so I assumed all of them.  My bad.  Thanks for 
the correction and the help.  I just naively used the rpm spec that was 
packaged with PMIx which does enable the legacy support.  It seems best 
then to let PMIx handle pmix solely and let slurm handle the rest.  Thanks!


Am I right in reading that you don't have to build slurm against PMIx?  
So it just interoperates with it fine if you just have it installed and 
specify pmix as the launch option?  That's neat.


-Paul Edmon-


On 11/28/2017 6:11 PM, Philip Kovacs wrote:
Actually if you're set on installing pmix/pmix-devel from the rpms and 
then configuring slurm manually,
you could just move the pmix-installed versions of libpmi.so* and 
libpmi2.so* to a safe place, configure
and install slurm which will drop in its versions pf those libs and 
then either use the slurm versions or move
the the pmix versions of libpmi and libpmi2 back into place in 
/usr/lib64.



On Tuesday, November 28, 2017 5:32 PM, Philip Kovacs 
<pkde...@yahoo.com> wrote:



This issue is that pmi 2.0+ provides a "backward compatibility" 
feature, enabled by default, which installs
both libpmi.so and libpmi2.so in addition to libpmix.so.  The route 
with the least friction for you would probably
be to uninstall pmix, then install slurm normally, letting it install 
its libpmi and libpmi2.  Next configure and compile
a custom pmix with that backward feature _disabled_, so it only 
installs libpmix.so.   Slurm will "see" the pmix library
after you install it and load it via its plugin when you use 
--mpi=pmix.  Again, just use the Slurm pmi and pmi2 and

install pmix separately with the backward compatible option disabled.

There is a packaging issue there in which two packages are trying to 
install their own versions of the same files.
That should be brought to attention of the packages. Meantime you can 
work around it.


For PMIX:

./configure --disable-pmi-backward-compatibility // ... etc ...



On Tuesday, November 28, 2017 4:44 PM, Artem Polyakov 
<artpo...@gmail.com> wrote:



Hello, Paul

Please see below.

2017-11-28 13:13 GMT-08:00 Paul Edmon <ped...@cfa.harvard.edu 
<mailto:ped...@cfa.harvard.edu>>:


So in an effort to future proof ourselves we are trying to build
Slurm against PMIx, but when I tried to do so I got the following:

Transaction check error:
  file /usr/lib64/libpmi.so from install of
slurm-17.02.9-1fasrc02.el7.cen tos.x86_64 conflicts with file from
package pmix-2.0.2-1.el7.centos.x86_64
  file /usr/lib64/libpmi2.so from install of
slurm-17.02.9-1fasrc02.el7.cen tos.x86_64 conflicts with file from
package pmix-2.0.2-1.el7.centos.x86_64

This is with compiling Slurm with the --with-pmix=/usr option. A
few things:

1. I'm surprised when I tell it to use PMIx it still builds its
own versions of libpmi and pmi2 given that PMIx handles that now.


PMIx is a plugin and from multiple perspectives it makes sense to keep 
the other versions available (i.e. backward compat or perf comparison)



2. Does this mean I have to install PMIx in a nondefault
location?  If so how does that work with user build codes?  I'd
rather not have multiple versions of PMI around for people to
build against.

When we introduced PMIx it was in the beta stage and we didn't want to 
build against it by default. Now it probably makes sense to assume 
--with-pmix by default.
I'm also thinking that we might need to solve it at the packagers 
level by distributing "slurm-pmix" package that is builded and depends 
on the pmix package that is currently shipped with particular Linux 
distro.



3.  What is the right way of building PMIx and Slurm such that
they interoperate properly?

As for now it is better to have a PMIx installed in the well-known 
location. And then build your MPIs or other apps against this PMIx 
installation.
Starting (I think) from PMIx v2.1 we will have a cross-version support 
that will give some flexibility about what installation to use with 
application,



Suffice it to say little to no documentation exists on how to
properly this, so any guidance would be much appreciated.

Indeed we have some problems with the documentation as PMIx technology 
is relatively new. Hopefully we can fix this in near future.
Being the original developer of the PMIx plugin I'll be happy to 
answer any questions and help to resolve the issues.





-Paul Edmon-






--
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov








[slurm-users] PMIx and Slurm

2017-11-28 Thread Paul Edmon
So in an effort to future proof ourselves we are trying to build Slurm 
against PMIx, but when I tried to do so I got the following:


Transaction check error:
  file /usr/lib64/libpmi.so from install of 
slurm-17.02.9-1fasrc02.el7.centos.x86_64 conflicts with file from 
package pmix-2.0.2-1.el7.centos.x86_64
  file /usr/lib64/libpmi2.so from install of 
slurm-17.02.9-1fasrc02.el7.centos.x86_64 conflicts with file from 
package pmix-2.0.2-1.el7.centos.x86_64


This is with compiling Slurm with the --with-pmix=/usr option.  A few 
things:


1. I'm surprised when I tell it to use PMIx it still builds its own 
versions of libpmi and pmi2 given that PMIx handles that now.


2. Does this mean I have to install PMIx in a nondefault location?  If 
so how does that work with user build codes?  I'd rather not have 
multiple versions of PMI around for people to build against.


3.  What is the right way of building PMIx and Slurm such that they 
interoperate properly?


Suffice it to say little to no documentation exists on how to properly 
this, so any guidance would be much appreciated.


-Paul Edmon-




<    1   2   3