On 09/11/2012 05:56 PM, Moe Jette wrote:
Quoting Yuri D'Elia<[email protected]>:
Is it possible to have the current cpu/memory statistics from a running
job? sacct doesn't seem to report anything but "elapsed" time for those
jobs.
I have a couple of users that would like to profile the resource usage
of their jobs (such as current cpu load and time, current virt/rss,
etc). All of this data is already collected by the accounting plugin.
Any suggestions?
>
> Did you look at the sstat command? It would have to be run while the
> job is running, but captures much more information.
To answer myself, I wrote the attached script a couple of months ago in
response to this. It's named "sstop" as in "SLURM TOP".
Provides instant and average CPUS/TIME/ELAPSED/LOAD/%CPU for a job id.
The idea is that for jobs having non-trivial scheduling, some
steps/tasks might stall during part of the computation in a way that
cannot be reproduced when downscaled. To aid in this, you can just
inspect the current job status, and if needed attach a debugger.
After writing it by using a contrived combination of squeue/sstat I
promised myself to use the perl API directly, but so far had no time to
do it. Maybe somebody else could find it useful already as it is.
Hopefully it's not a problem if I attach it here (sorry otherwise).
#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Std;
sub parseTime
{
my $str = shift;
my ($days, $hours, $mins, $secs) = ($str =~
/^\s*(?:(?:(\d+)-)?(\d+):)?(\d+):(\d+(?:\.\d+)?)\s*$/);
$days = $days // 0;
$hours = $hours // 0;
return ($secs + $mins * 60 + $hours * 3600 + $days * 86400);
}
sub printTime
{
my $secs = shift;
my $days = int($secs / 86400);
$secs %= 86400;
my $hours = int($secs / 3600);
$secs %= 3600;
my $mins = int($secs / 60);
$secs %= 60;
return sprintf("%02d-%02d:%02d:%02d", $days, $hours, $mins, $secs);
}
sub fmtLine
{
my ($jobid, $np, $cpu, $wall, $load, $nload) = @_;
print sprintf("%10s %4s %11s %11s %8s %8s\n",
$jobid, $np, $cpu, $wall, $load, $nload);
}
sub fmtDataLine
{
my ($jobid, $np, $cpu, $wall, $load, $nload) = @_;
fmtLine($jobid, $np, printTime($cpu), printTime($wall),
sprintf("%.3f", $load), sprintf("%.3f", $nload));
}
sub main
{
my ($job, $delay) = @_;
# check for parameters (a valid and running job)
die("invalid job ID $job") if($job !~ /^\d+$/);
my $out = `sstat -n -o 'JobID' -j "$job.batch"`;
my $st = $? >> 8;
exit 1 if($st || $out =~ /^\s*$/);
# header line
fmtLine("JOBID", "CPUS", "TIME", "ELAPSED", "LOAD", "%CPU");
# fetch the number of allocated cpus for the task
my $np = `squeue -h -o '%C' -j "$job"` + 0;
# initial values wall/time
my $icpu = parseTime(`sstat -n -o AveCPU -j "$job.batch"`);
my $iwall = parseTime(`squeue -h -o '%M' -j "$job"`);
if($delay > 0)
{
# iterate until stopped
while(sleep $delay)
{
# check if the job is still running
my $out = `sstat -n -o 'JobID' -j "$job.batch"`;
my $st = $? >> 8;
last if($st || $out =~ /^\s*$/);
# calculate partials
my $ccpu = parseTime(`sstat -n -o AveCPU -j "$job.batch"`);
my $cwall = parseTime(`squeue -h -o '%M' -j "$job"`);
my $csecs = $cwall - $iwall;
next unless($csecs);
my $cload = ($ccpu - $icpu) / $csecs;
my $cnload = $cload / $np * 100;
fmtDataLine($job, $np, $ccpu, $cwall, $cload, $cnload);
$icpu = $ccpu;
$iwall = $cwall;
}
print STDERR "process ended\n";
}
else
{
# just the average
my $cload = $icpu / $iwall;
my $cnload = $cload / $np * 100;
fmtDataLine($job, $np, $icpu, $iwall, $cload, $cnload);
}
}
sub help
{
print "Usage: $0 [-d delay] JOBID\n";
exit 0;
}
# parse options
my %flags;
getopts('d:h', \%flags);
my $delay = $flags{'d'} || 0;
my ($job) = @ARGV;
help() if(!defined($job));
main($job, $delay);