> On Aug 1, 2017, at 3:07 PM, Jason Williams <jas...@jhu.edu> wrote:

> 1)      Is 512 threads a reasonable setting or should it be lower?

Since your servers have enough memory to support 512 threads, then it is 
probably reasonable.  If your server load is ~100, that probably means most of 
those threads are sitting idle (which should be fine). I think you would only 
need to lower the value if you saw that all the threads were routinely busy and 
there was some evidence that having that many busy threads was causing an issue 
on your server.

> 2)      Is high load “normal” if the file system is under heavy use?  At the 
> time I see a lot of open and attr calls which I thought would load the MDS 
> over the OSS… but my under-the-hood understanding is limited at best.

A high load might very well be normal for your file system.  As you have seen, 
lots of requests can result in threads sitting in IO wait states which causes 
the load to increase.  On my servers, I don’t usually bat an eye if I see loads 
over 100.  However, there are still a few things you should probably look out 
for:

- If the storage is processing requests quickly, but there are more incoming 
requests than it can handle, the load will go up (which is normal).  But if the 
IO requests are getting backlogged because the storage is not handling requests 
as fast as it should, then that is a problem.  Running iostat should give you 
an idea which case you are running into.

- If the load starts approaching the number of ost threads, then you could be 
getting into a state where the server cannot accept any more incoming requests.

> 3)      Should I be looking at other tunables?

You could double check that read/write caching is enabled (which I think it is 
be default in Lustre 2.5).

One thing I would recommend would be to take a look at the brw_stats for the 
OSTs to see what sizes of IO requests you are getting.  If there are lots of 
small read/writes, this can cause IO requests to back up and drive up the load 
which in turn can cause adverse performance problems.  I don’t know what kinds 
of codes your users run, so I don’t know what their IO patterns are like.  When 
I see very high loads on my servers, I usually check to see if there are lots 
of small IO requests from a single user.  This can sometimes be an indication 
that their code is performing IO in a suboptimal manner which is negatively 
impacting the file system.  We have had a fair amount of success working with 
users to improve their IO patterns.  This not only helps alleviate load on our 
servers, but it also increases performance for other users.  In some cases, 
just having the user restripe a file can dramatically reduce load.

So in summary:

Q: Is it a problem to have a high load on my OSS servers?
A: It depends….

(Wish it could be a little more clear cut than that)

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to