[slurm-dev] Re: special job error state
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Stu, On 19/09/13 17:19, Stu Midgley wrote: > SGE has a special job error state of 100 (ie. exit 100) which puts > the job in E state in the queue. The first talk of the day today at the Slurm User Group was on fault tolerance coming in future versions of Slurm and it seems to me that using that framework to allow a job/user to report a node as bad should be possible. The slides are here: http://slurm.schedmd.com/SUG13/nonstop.pdf I suspect it'd be something that would need to be explicitly enabled by a config option though, I reckon many sites would have conniptions if users were able to take nodes out at random. ;-) cheers, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlI7xNAACgkQO2KABBYQAh9fAACdEgLQXJILOxU2o+e0mhsgVIvu CgEAn1f1qmJfOSKB+b3IHa5ulUUr4s+s =2SHx -END PGP SIGNATURE-
[slurm-dev] Re: slurmstepd using 100% cpu
building slurm with the intel compilers gets 2) to about 46MB/s On Fri, Sep 20, 2013 at 10:57 AM, Stu Midgley wrote: > Morning > > I've been evaluating slurm and I like it a lot. It puts the HP back into > HPC :) > > Anyway, as a simple test, I simulated a process that we will be doing a lot > > 1) dd if=/dev/zero bs=1024k | pv | srun -N1 -n1 -c1 -- dd of=/dev/null > > 2) dd if=/dev/zero bs=1024k | pv | srun -N1 -n1 -c1 -- cat > /dev/null > > Now, for 1) I see slurmstepd running at about 70% cpu utilisation on the > cluster node and get about 110MB/s transfer speeds, which is awesome out of > a single gig link. > > BUT for 2) I see slurmstepd at 100% cpu utilisation and get 31MB/s > transfer speeds. I suspect that slurmstepd is the bottle neck... anything > I can do to speed it up? > > does slurmstepd using double-buffering copies? > > Thanks. > > -- > Dr Stuart Midgley > sdm...@sdm900.com > -- Dr Stuart Midgley sdm...@sdm900.com
[slurm-dev] slurmstepd using 100% cpu
Morning I've been evaluating slurm and I like it a lot. It puts the HP back into HPC :) Anyway, as a simple test, I simulated a process that we will be doing a lot 1) dd if=/dev/zero bs=1024k | pv | srun -N1 -n1 -c1 -- dd of=/dev/null 2) dd if=/dev/zero bs=1024k | pv | srun -N1 -n1 -c1 -- cat > /dev/null Now, for 1) I see slurmstepd running at about 70% cpu utilisation on the cluster node and get about 110MB/s transfer speeds, which is awesome out of a single gig link. BUT for 2) I see slurmstepd at 100% cpu utilisation and get 31MB/s transfer speeds. I suspect that slurmstepd is the bottle neck... anything I can do to speed it up? does slurmstepd using double-buffering copies? Thanks. -- Dr Stuart Midgley sdm...@sdm900.com
[slurm-dev] Problems with priority multifactor being ignored.
Hey guys, Hopefully this is an easy one that maybe others have encountered, we are curious if any of the multi-factor priority plugins have trump values over others if they are maxed out? We are running slurm 2.5.4 on a cluster with 640 available slots. We currently have fairshare set to 5000, counting down to 0, age at 0 counting up to 3000, and partition priority same for everyone on partitions at 8000. In our example case we are back to our classic problem user that submits thousands of jobs to the default partition, and walks away for a week. She takes all of the slots immediately available, and the rest of her jobs are queued. Her fairshare value drops and as these are lengthy jobs, her age increments up... She hits her maxed value of 11000 (8000 + 3000 + 0) for her jobs waiting in the queue. A new user comes in, submits to the local parition as well... should come in with a higher priority by default simply based on the idea that their values summed are 8000 for partition, 5000 for fairshare, and 0 for age, so 13000. And yet we are seeing the jobs at 11000 still jumping the higher priority jobs and running... We thought perhaps there may be something about maxed out priority values jumping the queue, or what exactly we are missing here? Sample output from sprio -l: JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS NICE 202545 bem28 11000 3000 0 0 8000 0 0 202546 bem28 11000 3000 0 0 8000 0 0 202547 bem28 11000 3000 0 0 8000 0 0 202548 bem28 11000 3000 0 0 8000 0 0 202549 bem28 11000 3000 0 0 8000 0 0 202550 bem28 11000 3000 0 0 8000 0 0 202551 bem28 11000 3000 0 0 8000 0 0 202552 bem28 11000 3000 0 0 8000 0 0 202553 bem28 11000 3000 0 0 8000 0 0 202554 bem28 11000 3000 0 0 8000 0 0 202555 bem28 11000 3000 0 0 8000 0 0 202556 bem28 11000 3000 0 0 8000 0 0 202653 bem28 11000 3000 0 0 8000 0 0 203965 ter18 12862 402 4460 0 8000 0 0 203967 ter18 12862 402 4460 0 8000 0 0 203969 ter18 12861 402 4460 0 8000 0 0 203971 ter18 12861 402 4460 0 8000 0 0 203973 ter18 12861 402 4460 0 8000 0 0 203975 ter18 12861 402 4460 0 8000 0 0 203977 ter18 12861 402 4460 0 8000 0 0 203979 ter18 12861 402 4460 0 8000 0 0 203981 ter18 12861 402 4460 0 8000 0 0 In the example his jobs have been waiting for about 7 hours even... so he has a time factor in play too... but as of even a few minutes ago, the first user's jobs are still jumping the 2nd users now. So there is something we are missing we just don't know what. Sample output of squeue: 197043 lowmem full_per bem28 PD 0:00 1 (Priority) 197044 lowmem full_per bem28 PD 0:00 1 (Priority) 197045 lowmem full_per bem28 PD 0:00 1 (Priority) 197046 lowmem full_per bem28 PD 0:00 1 (Priority) 197047 lowmem full_per bem28 PD 0:00 1 (Priority) 197048 lowmem full_per bem28 PD 0:00 1 (Priority) 197049 lowmem full_per bem28 PD 0:00 1 (Priority) 197050 lowmem full_per bem28 PD 0:00 1 (Priority) 196887 lowmem full_per bem28 R 3:10 1 hardac-node01-1 196888 lowmem full_per bem28 R 3:10 1 hardac-node04-1 196886 lowmem full_per bem28 R 3:19 1 hardac-node07-2 196885 lowmem full_per bem28 R 7:04 1 hardac-node06-2 196884 lowmem full_per bem28 R 11:49 1 hardac-node06-1 196883 lowmem full_per bem28 R 13:40 1 hardac-node03-3 Thoughts from any other slurm users would be greatly appreciated. AC
[slurm-dev] Re: restricting submit-hosts to a chosen few
Morris Jette wrote: >See configuration parameter AllocNodes. > >Lech Nieroda wrote: >> >>Dear list, >> >>I'm looking for a way to specify which nodes are allowed to submit >>jobs. By default all nodes (compute nodes and frontends) may submit >>jobs with sbatch or salloc, but I'd like to restrict that privilege to >>the frontends only. >>We already have a job_submit_lua script running, but I haven't seen an >>attribute there that would specify the submit host. >>Any ideas? >> >>Regards, >>Lech >> >>-- >>Dipl.-Wirt.-Inf. Lech Nieroda >>Regionales Rechenzentrum der Universität zu Köln (RRZK) > >-- >Sent from my Android phone with K-9 Mail. Please excuse my brevity. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
[slurm-dev] update job requirements
Dear all, here is my job: $ scontrol show job=3106554 JobId=3106554 Name=uptime UserId=mark(19423) GroupId=swtest(50147) Priority=99685 Account=swtest QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A SubmitTime=2013-09-19T21:51:59 EligibleTime=2013-09-19T21:51:59 StartTime=2013-09-20T14:08:18 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=all AllocNode:Sid=tauruslogin1:5277 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=10M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/usr/bin/uptime WorkDir=/home/h3/mark I want to lower the memory requirement and get: $ scontrol update job=3106554 MinMemoryNode=1000 slurm_update error: Access/permission denied This works for root, but not for the job owner. Is this an intended behaviour? (It works as root, though.) Thanks, Ulf -- ___ Dr. Ulf Markwardt Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) 01062 Dresden, Germany smime.p7s Description: S/MIME Kryptografische Unterschrift
[slurm-dev] Re: restricting submit-hosts to a chosen few
See configuration parameter AllocNodes. Lech Nieroda wrote: > >Dear list, > >I'm looking for a way to specify which nodes are allowed to submit >jobs. By default all nodes (compute nodes and frontends) may submit >jobs with sbatch or salloc, but I'd like to restrict that privilege to >the frontends only. >We already have a job_submit_lua script running, but I haven't seen an >attribute there that would specify the submit host. >Any ideas? > >Regards, >Lech > >-- >Dipl.-Wirt.-Inf. Lech Nieroda >Regionales Rechenzentrum der Universität zu Köln (RRZK) -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
[slurm-dev] Re: special job error state
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 G'day Stu! ;-) On 19/09/13 17:19, Stu Midgley wrote: > SGE has a special job error state of 100 (ie. exit 100) which puts > the job in E state in the queue. The job leaves the allocated > node(s) and goes back into the queue in E state. This means we can > easily know which jobs have failed, look at their log, fix the > problem (usually a system problem - like an unmounted file system > or crashed ypbind) and then clear the error and the job goes into > into Q state. I can't comment on the special exit status, but we make much use of the health check within Slurm (and Torque before it) to spot system issues and mark nodes as DRAIN if we see something wrong. With Torque we would run the health check scripts from cron and pbs_mom would just run a script to cat the file the cron job produced (in /dev/shm) to avoid any blocking, and for Slurm we've just ported that directly across except changing the cat behaviour in the script invoked by slurmd to use scontrol to knock the node offline (or online if the checks are passing and it was an auto check that took it offline last). Works well for us and may help your situation. All the best, Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlI7GBwACgkQO2KABBYQAh+zAACeP+SPRJeLfroG9Za4rCzpR6Nw mBwAoJMMlPeLGTDDAcVv6qqNeDok9x2f =LEAV -END PGP SIGNATURE-
[slurm-dev] restricting submit-hosts to a chosen few
Dear list, I'm looking for a way to specify which nodes are allowed to submit jobs. By default all nodes (compute nodes and frontends) may submit jobs with sbatch or salloc, but I'd like to restrict that privilege to the frontends only. We already have a job_submit_lua script running, but I haven't seen an attribute there that would specify the submit host. Any ideas? Regards, Lech -- Dipl.-Wirt.-Inf. Lech Nieroda Regionales Rechenzentrum der Universität zu Köln (RRZK)
[slurm-dev] array jobs and --dependency
Hello, It seems like job dependency on array jobs doesn't work quite as I expected. If I submit an array job with 10 elements to it, and then a separate "collect my data" job with a dependency=afterok:array-job, that job starts when any of the array jobs finishes, and thus collects just part of the result. However, if I query slurm for the actual JOBIDs of the array jobs, and put that into the --dependency, it works. That presents another problem with large array jobs though - there's a 1024 character limit for --dependency, as we saw in messages: slurmctld[9962]: job_create_request: strlen(dependency) too big (1402 > 1024) So I'm wondering if it should work with having the parent array jobid as dependency? Can the dependency list use ranges, and if yes, is there a way to query slurm for the proper jobid range for an array job? This is to try and get around the 1024 character limit. I have scripts that will reproduce the above, if it would be of interest? With best regards, Andreas Loong -- Confidentiality Notice: This message is private and may contain confidential and proprietary information. If you have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this message is not permitted and may be unlawful.
[slurm-dev] RE: can't make "sacct"
I got this error when i try a sacct JOBID : SLURM accounting storage is disabled here my slurmctld log file ( tail -f /var/log/slurm-llnl/slurmctld.log) : [2013-09-19T10:58:20] sched: _slurm_rpc_allocate_resources JobId=179 NodeList=VM-669 usec=70 [2013-09-19T10:58:20] sched: _slurm_rpc_job_step_create: StepId=179.0 VM-669 usec=197 [2013-09-19T10:58:20] sched: _slurm_rpc_job_step_create: StepId=179.1 VM-669 usec=187 [2013-09-19T10:59:00] sched: _slurm_rpc_step_complete StepId=179.1 usec=17 [2013-09-19T10:59:00] completing job 179 [2013-09-19T10:59:00] sched: job_complete for JobId=179 successful [2013-09-19T10:59:00] sched: _slurm_rpc_step_complete StepId=179.0 usec=8 and my conf file (/etc/slurm-llnl/slurm.conf) : # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=VM-667 ControlAddr=192.168.2.26 #BackupController= #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #PrologSlurmctld= #FirstJobId=1 #MaxJobId=99 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 JobCheckpointDir=/var/lib/slurm-llnl/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/usr/bin/mail #MaxJobCount=5000 #MaxStepCount=4 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/pgid #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm #SrunEpilog= #SrunProlog= StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear #SelectTypeParameters= # # # JOB PRIORITY #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=cluster #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=VM-[669-671] CPUs=1 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN PartitionName=SLURM-debug Nodes=VM-[669-671] Default=YES MaxTime=INFINITE State=UP - Mail original - De: "Nancy Kritkausky" À: "slurm-dev" Envoyé: Mercredi 18 Septembre 2013 18:29:56 Objet: [slurm-dev] RE: can't make "sacct" Hello Siva, There is not a lot of information to go on from your email. What type of accounting do you have configured? What does your slurm.conf and slurmdbd.conf file look like? I would also suggest looking at your slurmdbd.log and slurmd.log to see what is going on, or sending them to the dev list. Nancy From: Sivasangari Nandy [mailto:sivasangari.na...@irisa.fr] Sent: Wednesday, September 18, 2013 9:02 AM To: slurm-dev Subject: [slurm-dev] can't make "sacct" Hi, Hey does anyone know why my "sacct" command doesn't work ? I got this : root@VM-667:/omaha-beach/workflow# sacct JobID JobName Partition Account AllocCPUS State Exit
[slurm-dev] special job error state
We are looking at moving from SGE (old) to SLURM for our production clusters. We make heavy use of task arrays and the special job error state of 100. SLURM now has task arrays, which is great, but as far as I can see, doesn't support the job error state of 100 (or equivalent). Is this planned/available? Can we pay to have it added? Let me explain. SGE has a special job error state of 100 (ie. exit 100) which puts the job in E state in the queue. The job leaves the allocated node(s) and goes back into the queue in E state. This means we can easily know which jobs have failed, look at their log, fix the problem (usually a system problem - like an unmounted file system or crashed ypbind) and then clear the error and the job goes into into Q state. It then gets rescheduled back onto the cluster. We use this in our batch scripts like #!/bin/bash set -o pipefail command1 | command2 | command3 || exit 100 If any of the commands fail, the job ends up in E state in the queue. Thanks -- Dr Stuart Midgley sdm...@sdm900.com