than
satisfactory scheduling decisions - we use TopologyPlugin=topology/tree to
colocate jobs to as few switches as possible. Powered off nodes wouldn't
be considered, so jobs would be scattered over multiple switches, rather
than turning on a few nodes on the same switch.
--
*Nathan Harper*
cess to an account
also a coordinator of that account.
--
*Nathan Harper* // IT Systems Lead
*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1104 // *m: * 07875 510891 //
*w: * www.cfms.org.uk <http://www.cfms.org.uk%22> // [image: Linkedin grey
icon scaled] <http://uk.linkedin.com/p
Hi,
No solution, but a 'me too'.
--
*Nathan Harper* // IT Systems Lead
*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1104 // *m: * 07875 510891 //
*w: * www.cfms.org.uk <http://www.cfms.org.uk%22> // [image: Linkedin grey
icon scaled] <http://uk.linkedin.com/pub/natha
There is a fork of Dashing that is still relatively current. We have a
couple of Dashing dashboards which parses squeue/scontrol/sacct output to
show SLURM information alongside other cluster info (nodes up/down, power
usage etc)
--
*Nathan Harper* // IT Systems Lead
*e: * nathan.har...
Hi,
We've implemented QoS resource limits thanks to some past suggestions on
this list. However it does seem to have broken some of our scheduling.
Jobs that are held due to QOSGrpNodeLimit have a starttime=unknown, despite
all other jobs within the same limit having end times associated with t
Hi,
We are trying to get to the bottom of some TRES limits we have in place, to
work out if it should be expected behaviour.
We have two QoS configured, 'low' and 'normal'. Normal is the default QoS
and applys limits at the association level. The low QoS has it's own TRES
limits applied to it
Is there an equivalent to GridEngine's 'qquota' to view resource limit
status? For example, an account is limited to 20 nodes, how does a user
know the resource use?
different rates.
On 25 August 2015 at 13:33, Fahad Ibrahim Alzannan
wrote:
> Hi,
>
> Actually some working nodes are delayed by around 5 mins also the down
> nodes !
> ------
> *From:* Nathan Harper [nathan.har...@cfms.org.uk]
> *Sent:* Tuesday, August
Hi - can you check that your clocks are in sync between your compute nodes
and controllers?
--
*Nathan Harper*
On 25 August 2015 at 11:51, Fahad Ibrahim Alzannan
wrote:
> Hi,
>
>
> We have a cluster and some nodes are down we tried to set them idle using
> "scontrol upda
Has anyone looked at using LXC rather than Docker specifically? From what
I understand, it's possible to run unprivileged LXC containers, so no need
to be root.
--
*Nathan Harper* // IT Systems Architect
*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1104 // *m: * 07875 510891
I've looked at this in the past - however, when I use your example
(replacing with real data) I just get a 'Nothing Modified' response
--
*Nathan Harper* // IT Systems Architect
*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1104 // *m: * 07875 510891 //
*w: * www.cf
k and affecting other jobs. If a user is paying for core hours,
they won't be happy if a single core job on one of 'their' nodes slowed
their parallel job down.
I'd echo Marcin's comment: *a user could only hurt themselves when running
new buggy code.*
--
*Nathan Harpe
;t terrible wasteful, but
it would be nice to use con_res, then let users choose if they want to
share jobs, but only with themselves.
--
*Nathan Harper* // IT Systems Architect
*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1104 // *m: * 07875 510891 //
*w: * www.cfms.org.uk <http://w
I've got a real mixture of nodes, some older than Sandybridge, and I find
that IPMI does a better job of getting 'whole system' power use. I was
hoping to pick up power from my GPU nodes too, but those nodes don't report
via IPMI either
--
*Nathan Harper* // IT S
ght be
missing something obvious, but I can't find any documentation about this
--
*Nathan Harper* // IT Systems Architect
*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1104 // *m: * 07875 510891 //
*w: * www.cfms.org.uk <http://www.cfms.org.uk%22> // [image: Linkedin grey
icon
ser("your qos has been changed")
log_info("slurm_job_submit: job from uid %d,
setting qos value: %s", submit_uid,
job_desc.qos = qos
end
end
--
*Nathan Harper* // IT Systems Architect
*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1
After a suggestion in another thread, I have been toying with a lua job
submit plugin, and I've achieved my original goal.
I've been sucked in by the possibilities of the plugin, and the prospect of
presenting users with a comment on job submission is an attractive one.
I've not been able to get
We're using 14.03.7 with FreeIPMI 1.4.5 and didn't have to do anything
unusual to get it built.
FreeIPMI was built from source into an RPM, then SLURM itself built into an
RPM.
--
*Nathan Harper* // IT Systems Architect
*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1104 // *
s/documentation
out there?
--
*Nathan Harper* // IT Systems Architect
*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1104 // *m: * 07875 510891 //
*w: * www.cfms.org.uk <http://www.cfms.org.uk%22> // [image: Linkedin grey
icon scaled] <http://uk.linkedin.com/pub/nathan-harper/21/69
1 to slots=160
limit users GroupB queues partition1 to slots=96
limit users GroupC queues partition1 to slots=64
--
*Nathan Harper* // IT Systems Architect
*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1104 // *m: * 07875 510891 //
*w: * www.cfms.org.uk <http://www.cfms.org.uk%22> //
Hi,
Just a 'me-too' - also running 14.03.6 on compute nodes, with master nodes
running RHEL5 with -O0 and getting the same thing in the logs, so it's not
just you.
--
*Nathan Harper* // IT Systems Architect
*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1104 // *m: * 078
you can set the CFLAGS variable before
./configure
--
*Nathan Harper* // IT Systems Architect
*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1104 // *m: * 07875 510891 //
*w: * www.cfms.org.uk <http://www.cfms.org.uk%22> // [image: Linkedin grey
icon scaled] <http://uk.linkedin.com/pub/
Slight update - after changing the things in slurm.conf that it is
complaining about, 'service slurm start' now doesn't throw any errors.
However, attempting to run anything give a segfault:
# sacctmgr
Segmentation fault
--
*Nathan Harper* // IT Systems Architect
*
rocess configuration file
--
*Nathan Harper* // IT Systems Architect
*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1104 // *m: * 07875 510891 //
*w: * www.cfms.org.uk <http://www.cfms.org.uk%22> // [image: Linkedin grey
icon scaled] <http://uk.linkedin.com/pub/nathan-harper/21/696/b81&
24 matches
Mail list logo