[slurm-dev] Re: special job error state

2013-09-19 Thread Christopher Samuel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Stu,

On 19/09/13 17:19, Stu Midgley wrote:

> SGE has a special job error state of 100 (ie. exit 100) which puts
> the job in E state in the queue.

The first talk of the day today at the Slurm User Group was on fault
tolerance coming in future versions of Slurm and it seems to me that
using that framework to allow a job/user to report a node as bad
should be possible.

The slides are here:

http://slurm.schedmd.com/SUG13/nonstop.pdf

I suspect it'd be something that would need to be explicitly enabled
by a config option though, I reckon many sites would have conniptions
if users were able to take nodes out at random. ;-)

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlI7xNAACgkQO2KABBYQAh9fAACdEgLQXJILOxU2o+e0mhsgVIvu
CgEAn1f1qmJfOSKB+b3IHa5ulUUr4s+s
=2SHx
-END PGP SIGNATURE-


[slurm-dev] Re: slurmstepd using 100% cpu

2013-09-19 Thread Stu Midgley
building slurm with the intel compilers gets 2) to about 46MB/s




On Fri, Sep 20, 2013 at 10:57 AM, Stu Midgley  wrote:

>  Morning
>
> I've been evaluating slurm and I like it a lot.  It puts the HP back into
> HPC :)
>
> Anyway, as a simple test, I simulated a process that we will be doing a lot
>
>   1)  dd if=/dev/zero bs=1024k | pv | srun -N1 -n1 -c1 -- dd of=/dev/null
>
>   2)  dd if=/dev/zero bs=1024k | pv | srun -N1 -n1 -c1 -- cat > /dev/null
>
> Now, for 1) I see slurmstepd running at about 70% cpu utilisation on the
> cluster node and get about 110MB/s transfer speeds, which is awesome out of
> a single gig link.
>
> BUT for 2) I see slurmstepd at 100% cpu utilisation and get 31MB/s
> transfer speeds.  I suspect that slurmstepd is the bottle neck... anything
> I can do to speed it up?
>
> does slurmstepd using double-buffering copies?
>
> Thanks.
>
> --
> Dr Stuart Midgley
> sdm...@sdm900.com
>



-- 
Dr Stuart Midgley
sdm...@sdm900.com


[slurm-dev] slurmstepd using 100% cpu

2013-09-19 Thread Stu Midgley
Morning

I've been evaluating slurm and I like it a lot.  It puts the HP back into
HPC :)

Anyway, as a simple test, I simulated a process that we will be doing a lot

  1)  dd if=/dev/zero bs=1024k | pv | srun -N1 -n1 -c1 -- dd of=/dev/null

  2)  dd if=/dev/zero bs=1024k | pv | srun -N1 -n1 -c1 -- cat > /dev/null

Now, for 1) I see slurmstepd running at about 70% cpu utilisation on the
cluster node and get about 110MB/s transfer speeds, which is awesome out of
a single gig link.

BUT for 2) I see slurmstepd at 100% cpu utilisation and get 31MB/s transfer
speeds.  I suspect that slurmstepd is the bottle neck... anything I can do
to speed it up?

does slurmstepd using double-buffering copies?

Thanks.

-- 
Dr Stuart Midgley
sdm...@sdm900.com


[slurm-dev] Problems with priority multifactor being ignored.

2013-09-19 Thread Alan V. Cowles
Hey guys,

Hopefully this is an easy one that maybe others have encountered, we are 
curious if any of the multi-factor priority plugins have trump values over 
others if they are maxed out?

We are running slurm 2.5.4 on a cluster with 640 available slots.

We currently have fairshare set to 5000, counting down to 0, age at 0 counting 
up to 3000, and partition priority same for everyone on  partitions at 8000.

In our example case we are back to our classic problem user that submits 
thousands of jobs to the default partition, and walks away for a week. She 
takes all of the slots immediately available, and the rest of her jobs are 
queued. Her fairshare value drops and as these are lengthy jobs, her age 
increments up...

She hits her maxed value of 11000 (8000 + 3000 + 0) for her jobs waiting in the 
queue.

A new user comes in, submits to the local parition as well... should come in 
with a higher priority by default simply based on the idea that their values 
summed are 8000 for partition, 5000 for fairshare, and 0 for age, so 13000.

And yet we are seeing the jobs at 11000 still jumping the higher priority jobs 
and running... 

We thought perhaps there may be something about maxed out priority values 
jumping the queue, or what exactly we are missing here?

Sample output from sprio -l:

JOBID USER   PRIORITY    AGE  FAIRSHARE    JOBSIZE  PARTITION    
QOS   NICE

 202545    bem28  11000   3000  0  0   8000 
 0  0
 202546    bem28  11000   3000  0  0   8000 
 0  0
 202547    bem28  11000   3000  0  0   8000 
 0  0
 202548    bem28  11000   3000  0  0   8000 
 0  0
 202549    bem28  11000   3000  0  0   8000 
 0  0
 202550    bem28  11000   3000  0  0   8000 
 0  0
 202551    bem28  11000   3000  0  0   8000 
 0  0
 202552    bem28  11000   3000  0  0   8000 
 0  0
 202553    bem28  11000   3000  0  0   8000 
 0  0
 202554    bem28  11000   3000  0  0   8000 
 0  0
 202555    bem28  11000   3000  0  0   8000 
 0  0
 202556    bem28  11000   3000  0  0   8000 
 0  0
 202653    bem28  11000   3000  0  0   8000 
 0  0
 203965    ter18  12862    402   4460  0   8000 
 0  0
 203967    ter18  12862    402   4460  0   8000 
 0  0
 203969    ter18  12861    402   4460  0   8000 
 0  0
 203971    ter18  12861    402   4460  0   8000 
 0  0
 203973    ter18  12861    402   4460  0   8000 
 0  0
 203975    ter18  12861    402   4460  0   8000 
 0  0
 203977    ter18  12861    402   4460  0   8000 
 0  0
 203979    ter18  12861    402   4460  0   8000 
 0  0
 203981    ter18  12861    402   4460  0   8000 
 0  0


In the example his jobs have been waiting for about 7 hours even... so he has a 
time factor in play too... but as of even a few minutes ago, the first user's 
jobs are still jumping the 2nd users now. So there is something we are missing 
we just don't know what. 

Sample output of squeue:

197043    lowmem full_per    bem28  PD   0:00  1 (Priority)
 197044    lowmem full_per    bem28  PD   0:00  1 (Priority)
 197045    lowmem full_per    bem28  PD   0:00  1 (Priority)
 197046    lowmem full_per    bem28  PD   0:00  1 (Priority)
 197047    lowmem full_per    bem28  PD   0:00  1 (Priority)
 197048    lowmem full_per    bem28  PD   0:00  1 (Priority)
 197049    lowmem full_per    bem28  PD   0:00  1 (Priority)
 197050    lowmem full_per    bem28  PD   0:00  1 (Priority)
 196887    lowmem full_per    bem28   R   3:10  1 hardac-node01-1
 196888    lowmem full_per    bem28   R   3:10  1 hardac-node04-1
 196886    lowmem full_per    bem28   R   3:19  1 hardac-node07-2
 196885    lowmem full_per    bem28   R   7:04  1 hardac-node06-2
 196884    lowmem full_per    bem28   R  11:49  1 hardac-node06-1
 196883    lowmem full_per    bem28   R  13:40  1 hardac-node03-3


Thoughts from any other slurm users would be greatly appreciated.

AC


[slurm-dev] Re: restricting submit-hosts to a chosen few

2013-09-19 Thread Morris Jette


Morris Jette  wrote:
>See configuration parameter AllocNodes.
>
>Lech Nieroda  wrote:
>>
>>Dear list,
>>
>>I'm looking for a way to specify which nodes are allowed to submit
>>jobs. By default all nodes (compute nodes and frontends) may submit
>>jobs with sbatch or salloc, but I'd like to restrict that privilege to
>>the frontends only. 
>>We already have a job_submit_lua script running, but I haven't seen an
>>attribute there that would specify the submit host.
>>Any ideas?
>>
>>Regards,
>>Lech
>>
>>--
>>Dipl.-Wirt.-Inf. Lech Nieroda
>>Regionales Rechenzentrum der Universität zu Köln (RRZK)
>
>-- 
>Sent from my Android phone with K-9 Mail. Please excuse my brevity.

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

[slurm-dev] update job requirements

2013-09-19 Thread Ulf Markwardt

Dear all,

here is my job:

$ scontrol show job=3106554
JobId=3106554 Name=uptime
   UserId=mark(19423) GroupId=swtest(50147)
   Priority=99685 Account=swtest QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2013-09-19T21:51:59 EligibleTime=2013-09-19T21:51:59
   StartTime=2013-09-20T14:08:18 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=all AllocNode:Sid=tauruslogin1:5277
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=10M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/usr/bin/uptime
   WorkDir=/home/h3/mark

I want to lower the memory requirement and get:
$ scontrol update job=3106554 MinMemoryNode=1000
slurm_update error: Access/permission denied

This works for root, but not for the job owner. Is this an intended 
behaviour? (It works as root, though.)


Thanks,
Ulf

--
___
Dr. Ulf Markwardt

Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany



smime.p7s
Description: S/MIME Kryptografische Unterschrift


[slurm-dev] Re: restricting submit-hosts to a chosen few

2013-09-19 Thread Morris Jette
See configuration parameter AllocNodes.

Lech Nieroda  wrote:
>
>Dear list,
>
>I'm looking for a way to specify which nodes are allowed to submit
>jobs. By default all nodes (compute nodes and frontends) may submit
>jobs with sbatch or salloc, but I'd like to restrict that privilege to
>the frontends only. 
>We already have a job_submit_lua script running, but I haven't seen an
>attribute there that would specify the submit host.
>Any ideas?
>
>Regards,
>Lech
>
>--
>Dipl.-Wirt.-Inf. Lech Nieroda
>Regionales Rechenzentrum der Universität zu Köln (RRZK)

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

[slurm-dev] Re: special job error state

2013-09-19 Thread Christopher Samuel

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

G'day Stu! ;-)

On 19/09/13 17:19, Stu Midgley wrote:

> SGE has a special job error state of 100 (ie. exit 100) which puts
> the job in E state in the queue.  The job leaves the allocated
> node(s) and goes back into the queue in E state.  This means we can
> easily know which jobs have failed, look at their log, fix the
> problem (usually a system problem - like an unmounted file system
> or crashed ypbind) and then clear the error and the job goes into
> into Q state.

I can't comment on the special exit status, but we make much use of
the health check within Slurm (and Torque before it) to spot system
issues and mark nodes as DRAIN if we see something wrong.

With Torque we would run the health check scripts from cron and
pbs_mom would just run a script to cat the file the cron job produced
(in /dev/shm) to avoid any blocking, and for Slurm we've just ported
that directly across except changing the cat behaviour in the script
invoked by slurmd to use scontrol to knock the node offline (or online
if the checks are passing and it was an auto check that took it
offline last).

Works well for us and may help your situation.

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlI7GBwACgkQO2KABBYQAh+zAACeP+SPRJeLfroG9Za4rCzpR6Nw
mBwAoJMMlPeLGTDDAcVv6qqNeDok9x2f
=LEAV
-END PGP SIGNATURE-


[slurm-dev] restricting submit-hosts to a chosen few

2013-09-19 Thread Lech Nieroda

Dear list,

I'm looking for a way to specify which nodes are allowed to submit jobs. By 
default all nodes (compute nodes and frontends) may submit jobs with sbatch or 
salloc, but I'd like to restrict that privilege to the frontends only. 
We already have a job_submit_lua script running, but I haven't seen an 
attribute there that would specify the submit host.
Any ideas?

Regards,
Lech

--
Dipl.-Wirt.-Inf. Lech Nieroda
Regionales Rechenzentrum der Universität zu Köln (RRZK)

[slurm-dev] array jobs and --dependency

2013-09-19 Thread Loong, Andreas

Hello,

It seems like job dependency on array jobs doesn't work quite as I
expected.

If I submit an array job with 10 elements to it, and then a separate
"collect my data" job with a dependency=afterok:array-job, that job
starts when any of the array jobs finishes, and thus collects just part
of the result.

However, if I query slurm for the actual JOBIDs of the array jobs, and
put that into the --dependency, it works. That presents another problem
with large array jobs though - there's a 1024 character limit for
--dependency, as we saw in messages:
slurmctld[9962]: job_create_request: strlen(dependency) too big (1402 >
1024)

So I'm wondering if it should work with having the parent array jobid as
dependency?

Can the dependency list use ranges, and if yes, is there a way to query
slurm for the proper jobid range for an array job? This is to try and
get around the 1024 character limit.

I have scripts that will reproduce the above, if it would be of
interest?

With best regards,
Andreas Loong

--
Confidentiality Notice: This message is private and may contain confidential 
and proprietary information. If you have received this message in error, please 
notify us and remove it from your system and note that you must not copy, 
distribute or take any action in reliance on it. Any unauthorized use or 
disclosure of the contents of this message is not permitted and may be unlawful.
 

[slurm-dev] RE: can't make "sacct"

2013-09-19 Thread Sivasangari Nandy

I got this error when i try a sacct JOBID : SLURM accounting storage is 
disabled 

here my slurmctld log file ( tail -f /var/log/slurm-llnl/slurmctld.log) : 



[2013-09-19T10:58:20] sched: _slurm_rpc_allocate_resources JobId=179 
NodeList=VM-669 usec=70 
[2013-09-19T10:58:20] sched: _slurm_rpc_job_step_create: StepId=179.0 VM-669 
usec=197 
[2013-09-19T10:58:20] sched: _slurm_rpc_job_step_create: StepId=179.1 VM-669 
usec=187 
[2013-09-19T10:59:00] sched: _slurm_rpc_step_complete StepId=179.1 usec=17 
[2013-09-19T10:59:00] completing job 179 
[2013-09-19T10:59:00] sched: job_complete for JobId=179 successful 
[2013-09-19T10:59:00] sched: _slurm_rpc_step_complete StepId=179.0 usec=8 




and my conf file (/etc/slurm-llnl/slurm.conf) : 



# slurm.conf file generated by configurator.html. 
# Put this file on all nodes of your cluster. 
# See the slurm.conf man page for more information. 
# 
ControlMachine=VM-667 
ControlAddr=192.168.2.26 
#BackupController= 
#BackupAddr= 
# 
AuthType=auth/munge 
CacheGroups=0 
#CheckpointType=checkpoint/none 
CryptoType=crypto/munge 
#DisableRootJobs=NO 
#EnforcePartLimits=NO 
#Epilog= 
#PrologSlurmctld= 
#FirstJobId=1 
#MaxJobId=99 
#GresTypes= 
#GroupUpdateForce=0 
#GroupUpdateTime=600 
JobCheckpointDir=/var/lib/slurm-llnl/checkpoint 
#JobCredentialPrivateKey= 
#JobCredentialPublicCertificate= 
#JobFileAppend=0 
#JobRequeue=1 
#JobSubmitPlugins=1 
#KillOnBadExit=0 
#Licenses=foo*4,bar 
#MailProg=/usr/bin/mail 
#MaxJobCount=5000 
#MaxStepCount=4 
#MaxTasksPerNode=128 
MpiDefault=none 
#MpiParams=ports=#-# 
#PluginDir= 
#PlugStackConfig= 
#PrivateData=jobs 
ProctrackType=proctrack/pgid 
#Prolog= 
#PrologSlurmctld= 
#PropagatePrioProcess=0 
#PropagateResourceLimits= 
#PropagateResourceLimitsExcept= 
ReturnToService=1 
#SallocDefaultCommand= 
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid 
SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid 
SlurmdPort=6818 
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd 
SlurmUser=slurm 
#SrunEpilog= 
#SrunProlog= 
StateSaveLocation=/var/lib/slurm-llnl/slurmctld 
SwitchType=switch/none 
#TaskEpilog= 
TaskPlugin=task/none 
#TaskPluginParam= 
#TaskProlog= 
#TopologyPlugin=topology/tree 
#TmpFs=/tmp 
#TrackWCKey=no 
#TreeWidth= 
#UnkillableStepProgram= 
#UsePAM=0 
# 
# 
# TIMERS 
#BatchStartTimeout=10 
#CompleteWait=0 
#EpilogMsgTime=2000 
#GetEnvTimeout=2 
#HealthCheckInterval=0 
#HealthCheckProgram= 
InactiveLimit=0 
KillWait=30 
#MessageTimeout=10 
#ResvOverRun=0 
MinJobAge=300 
#OverTimeLimit=0 
SlurmctldTimeout=120 
SlurmdTimeout=300 
#UnkillableStepTimeout=60 
#VSizeFactor=0 
Waittime=0 
# 
# 
# SCHEDULING 
#DefMemPerCPU=0 
FastSchedule=1 
#MaxMemPerCPU=0 
#SchedulerRootFilter=1 
#SchedulerTimeSlice=30 
SchedulerType=sched/backfill 
SchedulerPort=7321 
SelectType=select/linear 
#SelectTypeParameters= 
# 
# 
# JOB PRIORITY 
#PriorityType=priority/basic 
#PriorityDecayHalfLife= 
#PriorityCalcPeriod= 
#PriorityFavorSmall= 
#PriorityMaxAge= 
#PriorityUsageResetPeriod= 
#PriorityWeightAge= 
#PriorityWeightFairshare= 
#PriorityWeightJobSize= 
#PriorityWeightPartition= 
#PriorityWeightQOS= 
# 
# 
# LOGGING AND ACCOUNTING 
#AccountingStorageEnforce=0 
#AccountingStorageHost= 
#AccountingStorageLoc= 
#AccountingStoragePass= 
#AccountingStoragePort= 
AccountingStorageType=accounting_storage/none 
#AccountingStorageUser= 
AccountingStoreJobComment=YES 
ClusterName=cluster 
#DebugFlags= 
#JobCompHost= 
#JobCompLoc= 
#JobCompPass= 
#JobCompPort= 
JobCompType=jobcomp/none 
#JobCompUser= 
JobAcctGatherFrequency=30 
JobAcctGatherType=jobacct_gather/none 
SlurmctldDebug=3 
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log 
SlurmdDebug=3 
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log 
#SlurmSchedLogFile= 
#SlurmSchedLogLevel= 
# 
# 
# POWER SAVE SUPPORT FOR IDLE NODES (optional) 
#SuspendProgram= 
#ResumeProgram= 
#SuspendTimeout= 
#ResumeTimeout= 
#ResumeRate= 
#SuspendExcNodes= 
#SuspendExcParts= 
#SuspendRate= 
#SuspendTime= 
# 
# 
# COMPUTE NODES 
NodeName=VM-[669-671] CPUs=1 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 
State=UNKNOWN 
PartitionName=SLURM-debug Nodes=VM-[669-671] Default=YES MaxTime=INFINITE 
State=UP 




- Mail original -


De: "Nancy Kritkausky"  
À: "slurm-dev"  
Envoyé: Mercredi 18 Septembre 2013 18:29:56 
Objet: [slurm-dev] RE: can't make "sacct" 



Hello Siva, 
There is not a lot of information to go on from your email. What type of 
accounting do you have configured? What does your slurm.conf and slurmdbd.conf 
file look like? I would also suggest looking at your slurmdbd.log and 
slurmd.log to see what is going on, or sending them to the dev list. 
Nancy 



From: Sivasangari Nandy [mailto:sivasangari.na...@irisa.fr] 
Sent: Wednesday, September 18, 2013 9:02 AM 
To: slurm-dev 
Subject: [slurm-dev] can't make "sacct" 


Hi, 



Hey does anyone know why my "sacct" command doesn't work ? 

I got this : 

root@VM-667:/omaha-beach/workflow# sacct 



JobID JobName Partition Account AllocCPUS State Exit

[slurm-dev] special job error state

2013-09-19 Thread Stu Midgley
We are looking at moving from SGE (old) to SLURM for our production
clusters.  We make heavy use of task arrays and the special job error state
of 100.

SLURM now has task arrays, which is great, but as far as I can see, doesn't
support the job error state of 100 (or equivalent).  Is this
planned/available?  Can we pay to have it added?

Let me explain.

SGE has a special job error state of 100 (ie. exit 100) which puts the job
in E state in the queue.  The job leaves the allocated node(s) and goes
back into the queue in E state.  This means we can easily know which jobs
have failed, look at their log, fix the problem (usually a system problem -
like an unmounted file system or crashed ypbind) and then clear the error
and the job goes into into Q state.  It then gets rescheduled back onto the
cluster.

We use this in our batch scripts like

#!/bin/bash

set -o pipefail
command1 | command2 | command3 || exit 100

If any of the commands fail, the job ends up in E state in the queue.

Thanks


-- 
Dr Stuart Midgley
sdm...@sdm900.com