> On Sep 17, 2020, at 4:23 PM, Frank van der Linden <[email protected]> wrote:
>
> On Thu, Sep 17, 2020 at 03:09:31PM -0400, bfields wrote:
>>
>> On Thu, Sep 17, 2020 at 05:01:11PM +0100, Daire Byrne wrote:
>>>
>>> ----- On 15 Sep, 2020, at 18:21, bfields [email protected] wrote:
>>>
>>>>> 4) With an NFSv4 re-export, lots of open/close requests (hundreds per
>>>>> second) quickly eat up the CPU on the re-export server and perf top
>>>>> shows we are mostly in native_queued_spin_lock_slowpath.
>>>>
>>>> Any statistics on who's calling that function?
>>>
>>> I've always struggled to reproduce this with a simple open/close
>>> simulation, so I suspect some other operations need to be mixed in too. But
>>> I have one production workload that I know has lots of opens & closes
>>> (buggy software) included in amongst the usual reads, writes etc.
>>>
>>> With just 40 clients mounting the reexport server (v5.7.6) using NFSv4.2,
>>> we see the CPU of the nfsd threads increase rapidly and by the time we have
>>> 100 clients, we have maxed out the 32 cores of the server with most of that
>>> in native_queued_spin_lock_slowpath.
>>
>> That sounds a lot like what Frank Van der Linden reported:
>>
>>
>> https://lore.kernel.org/linux-nfs/20200608192122.ga19...@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com/
>>
>> It looks like a bug in the filehandle caching code.
>>
>> --b.
>
> Yes, that does look like the same one.
>
> I still think that not caching v4 files at all may be the best way to go
> here, since the intent of the filecache code was to speed up v2/v3 I/O,
> where you end up doing a lot of opens/closes, but it doesn't make as
> much sense for v4.
>
> However, short of that, I tested a local patch a few months back, that
> I never posted here, so I'll do so now. It just makes v4 opens in to
> 'long term' opens, which do not get put on the LRU, since that doesn't
> make sense (they are in the hash table, so they are still cached).
>
> Also, the file caching code seems to walk the LRU a little too often,
> but that's another issue - and this change keeps the LRU short, so it's
> not a big deal.
>
> I don't particularly love this patch, but it does keep the LRU short, and
> did significantly speed up my testcase (by about 50%). So, maybe you can
> give it a try.
>
> I'll also attach a second patch, that converts the hash table to an
> rhashtable,
> which automatically grows and shrinks in size with usage. That patch also
> helped, but not by nearly as much (I think it yielded another 10%).
For what it's worth, I applied your two patches to my test server, along
with my patch that force-closes cached file descriptors during NFSv4
CLOSE processing. The patch combination improves performance (faster
elapsed time) for my workload as well.
--
Chuck Lever
--
Linux-cachefs mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cachefs