On 02/12/14 06:45, Will French wrote:

> Does anyone know of a Slurm equivalent to the Torque command tracejob
> (http://docs.adaptivecomputing.com/torque/4-1-7/Content/topics/11-troubleshooting/usingTracejobToLocateFailures.htm)?
> This command allows you to easily compare requested resources to actual
> usage, and is useful for troubleshooting when a user’s job dies.

I think sacct (if you've set up accounting) will give you a lot of
that.  Here's an example from a trivial job of mine that just does
a sleep 60 and exit 1.

Apologies for the very long lines!

[samuel@barcoo BARCOO]$ sacct -j 2633455 -l
       JobID    JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  
AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   
MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks  
AllocCPUS    Elapsed      State ExitCode AveCPUFreq ReqCPUFreq     ReqMem 
ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask    AveDiskRead 
MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask   AveDiskWrite 
------------ ---------- ---------- ---------- -------------- -------------- 
---------- ---------- ---------- ---------- ---------- -------- ------------ 
-------------- ---------- ---------- ---------- ---------- ---------- -------- 
---------- ---------- ---------- -------- ---------- ---------- ---------- 
-------------- ------------ --------------- --------------- -------------- 
------------ ---------------- ---------------- -------------- 
2633455      failjob.sh       main                                              
                                                                                
                                                                                
 1   00:01:00     FAILED      1:0                              2Gc              
                                                                                
                                            
2633455.bat+      batch               134884K      barcoo001              0    
106056K       316K  barcoo001          0       316K        0    barcoo001       
       0          0   00:00:00  barcoo001          0   00:00:00        1        
  1   00:01:00     FAILED      1:0      2.70G          0        2Gc             
 0        0.01M       barcoo001               0          0.01M        0.00M     
   barcoo001                0          0.00M 


-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: [email protected] Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to