Re: [Lustre-discuss] Interpreting stats files

Mohr Jr, Richard Frank (Rick Mohr) Mon, 10 Nov 2014 11:10:47 -0800

On Nov 10, 2014, at 1:14 PM, Brock Palen <bro...@umich.edu>
 wrote:

> This is cool never seen it before!  
> 
> https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#dbdoclet.jobstats
> 
> Question though, is it really per job? Or is it per node combining multi node 
> jobs into one set of stats?
> 
> In our case we allow multiple jobs on a node, would job A and job B on the 
> same node each have their own stats? Or will their stats overlap?



I believe each job on the same node should have their own stats.  If I am not 
mistaken, the jobstats feature is basically just tagging the requests with some 
user-defined string (which in this case is the contents of an env variable).  
When the requests reach the servers, all requests with the same "tag" get 
aggregated together.  

Keep in mind that each MDT/OST has their own jobstats file, so if you want to 
see stats on all the Lustre requests for a given job, you will need to pull 
those stats from each MDT/OST and aggregate the data.  You may also want to 
tweak the auto-cleanup interval.  By default, this is 10 minutes.  So if a job 
is busy computing and doesn't do I/O for more than 10 minutes, the Lustre 
servers may automatically clean out that job's stat info (which might not be 
what you want it to do).

One other tip: The examples in the lustre manual that show how to enable 
jobstats often use the "lctl conf_param" command.  This will cause all clients 
to use the same env variable for reporting jobstats.  However, it can be useful 
to customize this based on the client's functionality.  For example, you can 
use "lctl set_param jobid_var=PBS_JOBID" on compute nodes so that they report 
stats on a per-job basis.  Then you can use "lctl set_param 
jobid_var=procname_uid" on login nodes to reports stats based on process name 
and UID.  Then if your MDT gets slammed,  you should be able to easily tell if 
the traffic is coming from a batch job or a user running an interactive 
command.  And if it is an interactive command, you will have the process name 
and the user's UID.  (I was able to use this to track down a Lustre client that 
was slamming our MDT because it was misconfigured and trying to index our 
Lustre file system for the "locate" command database.)

-- 
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Interpreting stats files

Reply via email to