Hi all,

Thanks for all the replies and info!

We were able to narrow the problem down to DNS timeouts from an internal
DNS server that had reached its limit for NF connection tracking. Once that
limit was increased, the issue went away.
Along with some forwarded insights from the folks at CMU and some isolated
testing, we were able to confirm that disabling dynamic root and DNS-based
server discovery on the cache manager also worked around issue.

Thanks again!
k-

On Fri, Nov 19, 2021 at 9:34 PM Jeffrey E Altman <jalt...@auristor.com>
wrote:

> On 11/10/2021 3:27 PM, Kendrick Hernandez (kendrick.hernan...@umbc.edu)
> wrote:
>
> Hi all,
>
> We host around 240 departmental and campus web sites (individual afs
> volumes) across 6 virtual web servers on AFS storage. The web servers are 4
> core, 16G VMs, and the 4 file servers are 4 core 32G VMs. All CentOS 7
> systems.
>
> In the past week or so, we've encountered high-load on the web servers
> (primary consumers being apache and afsd) during periods of increased
> traffic, and we're trying to identify ways to tune performance.
>
> "In the past week or so" appears to imply that the high-load was not
> observed previously.  If that is the case, one question to ask is "what
> changed?"  Analysis of the Apache access and error logs compared to the
> prior period might provide some important clues.
>
> After seeing the following in the logs:
>
> 2021 11 08 08:52:03 -05:00 virthost4 [kern.warning] kernel: afs: Warning:
>> We are having trouble keeping the AFS stat cache trimmed down under the
>> configured limit (current -stat setting: 3000, current vcache usage: 18116).
>> 2021 11 08 08:52:03 -05:00 virthost4 [kern.warning] kernel: afs: If AFS
>> access seems slow, consider raising the -stat setting for afsd.
>
> There is a one-to-one mapping between AFS vnodes and Linux inodes.  Unlike
> some other platforms with OpenAFS kernel modules, the Linux kernel module
> does not strictly enforce the vnode cache (aka vcache) limit.  When the
> limit is reached instead of finding a vnode to recycle, new vnodes are
> created and a background task attempts to prune excess vnodes.   Its that
> background task which is logging the text quoted above.
>
> I increased the disk cache to 10g and the -stat parameter to 100000, which
> has improved things somewhat, but we're not quite there yet.
>
> As Ben Kaduk mentioned in his reply, callback promises must be tracked by
> both the fileserver and the client.  Increasing the vcache (-stat) limit
> increases the number of vnodes for which callbacks must be tracked.  The
> umbc.edu cell is behind a firewall so its not possible for me to probe
> the fileserver statistics to determine if increasing to 100,000 on the
> clients also requires an increase on the fileservers.  If the fileserver
> callback table is full, then it might have to prematurely break callback
> promises to satisfy the new allocation.  A callback break requires issuing
> an RPC to the client whose promise is being broken.
>
> This is the current client cache configuration from one of the web servers:
>
> Chunk files:   281250
>> Stat caches:   100000
>> Data caches:   10000
>>
> The data cache might need to be increased if the web servers are serving
> content from more than 18,000 files
>
> Volume caches: 200
>>
> If the web servers are serving data from 240 volumes, then 200 volumes is
> too small.
>
> Chunk size:    1048576
>> Cache size:    9000000 kB
>> Set time:      no
>> Cache type:    disk
>
>
> Has anyone else experienced this? I think the bottleneck is with the cache
> manager and not the file servers themselves, because they don't seem to be
> impacted much during those periods of high load, and I can access files in
> those web volumes from my local client without any noticable lag.
>
> Apart from the cache settings how the web server is configured and how it
> accesses content from /afs matters.
>
> * Are the web servers configured with mod_waklog to obtain tokens for
> authenticated users?
>
> * Are PAGs in use?
>
> * How many active RX connections are there from the cache manager to the
> fileservers?
>
> * Are the volumes being served primarily RW volumes or RO volumes?
>
> * Are the contents of the volumes frequently changing?
>
> Finally, compared to the AuriStorFS and kafs clients, the OpenAFS cache
> manager suffers from a number of bottlenecks on multiprocessor systems due
> to reliance on a global lock to protect internal data structures.  The
> cache manager's callback service is another potential bottleneck because
> only one incoming RPC can be processed at a time and each incoming RPC must
> acquire the aforementioned global lock for the life of the call.
>
> Good luck,
>
> Jeffrey Altman
>
>
>

-- 
Kendrick Hernandez
*UNIX Systems Administrator*
Division of Information Technology
University of Maryland, Baltimore County

Reply via email to