On 02/12/14 06:45, Will French wrote:
> Does anyone know of a Slurm equivalent to the Torque command tracejob
> (http://docs.adaptivecomputing.com/torque/4-1-7/Content/topics/11-troubleshooting/usingTracejobToLocateFailures.htm)?
> This command allows you to easily compare requested resources to actual
> usage, and is useful for troubleshooting when a user’s job dies.
I think sacct (if you've set up accounting) will give you a lot of
that. Here's an example from a trivial job of mine that just does
a sleep 60 and exit 1.
Apologies for the very long lines!
[samuel@barcoo BARCOO]$ sacct -j 2633455 -l
JobID JobName Partition MaxVMSize MaxVMSizeNode MaxVMSizeTask
AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode
MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks
AllocCPUS Elapsed State ExitCode AveCPUFreq ReqCPUFreq ReqMem
ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead
MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite
------------ ---------- ---------- ---------- -------------- --------------
---------- ---------- ---------- ---------- ---------- -------- ------------
-------------- ---------- ---------- ---------- ---------- ---------- --------
---------- ---------- ---------- -------- ---------- ---------- ----------
-------------- ------------ --------------- --------------- --------------
------------ ---------------- ---------------- --------------
2633455 failjob.sh main
1 00:01:00 FAILED 1:0 2Gc
2633455.bat+ batch 134884K barcoo001 0
106056K 316K barcoo001 0 316K 0 barcoo001
0 0 00:00:00 barcoo001 0 00:00:00 1
1 00:01:00 FAILED 1:0 2.70G 0 2Gc
0 0.01M barcoo001 0 0.01M 0.00M
barcoo001 0 0.00M
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: [email protected] Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci