Re: High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release

2018-04-13 Thread Niels Kobschätzki
On 04/14/2018 03:49 AM, Rick Macklem wrote:
> Niels Kobschätzki wrote:
>> sorry for the cross-posting but so far I had no real luck on the forum
>> or on question, thus I want to try my luck here as well.
> I read email lists but don't do the other stuff, so I just saw this yesterday.
> Short answer, I haven't a clue why cache hits rate would have changed.
> 
> The code that decides if there is a hit/miss for the attribute cache is in
> ncl_getattrcache() and the code hasn't changed between 10.3->11.1,
> except the old code did a mtx_lock(), but I can't imagine how that
> would affect the code.
> 
> You might want to:
> # sysctl -a | fgrep vfs.nfs
> for both the 10.3 and 11.1 systems, to check if any defaults have somehow
> been changed. (I don't recall any being changed, but??)

I did that and there did nothing change.

> If you go into ncl_getattrcache() {it's in sys/fs/nfsclient/nfs_clsubs.c}
> and add a printf() for "time_second" and "np->n_mtime.tv_sec" near the
> top, where it calculates "timeo" from it.
> Running this hacked kernel might show you if either of these fields is bogus.
> (You could then printf() "timeo" and "np->n_attrtimeo" just before the "if"
> clause that increments "attrcache_misses", which is where the cache misses
> happen to see why it is missing the cache.)
> If you could do this for the 10.3 kernel as well, this might indicate why the
> miss rate has increased?

I will do this next week. On monday we switch for other reasons to other
nfs-servers and when we see that they run stable, I will do this next.

Btw. I calculated now the percentages. The old servers had a attr miss
rate of something like 0.004%, while the upgraded one has more like
2.7%. This is till low from what I've read (I remember that you should
start adjusting acreg* when you hit more than 40% misses) but far higher
than before.

nfsstat -c for one of the working servers looks like this (I did a -cz
before to reset it and did this a couple of seconds later):
Attr HitsMisses Lkup HitsMisses BioR HitsMisses BioW Hits
Misses
 10085375   255   9163995   577   540 0 0
 0
BioRLHitsMisses BioD HitsMisses DirE HitsMisses Accs Hits
Misses
 1380 0 0 0 0 0   9169427
   277

and for the non-working server:
Attr HitsMisses Lkup HitsMisses BioR HitsMisses BioW Hits
Misses
  1606365 20647   1418205   239   581 0 0
 0
BioRLHitsMisses BioD HitsMisses DirE HitsMisses Accs Hits
Misses
  895 0 0 0 0 0   1439080
   337


>> I upgraded a machine from 10.3-Prerelease (custom kernel with
>> tcp_fastopen added) to 11.1-Release (standard kernel) with
>> freebsd-update. I have two other machines that are still on
>> 10.3-Prerelease. Those machines mount an NFS-export from a
>> Linux-NFS-server and use NFSv3. The machine that got upgraded shows now
>> far more cache misses for getattr than on the 10.3-machines (we talk a
>> factor of 100) in munin. munin also shows a lot more cache-misses for
>> other metrics like biow, biorl, biod (where can I find what those
>> metrics mean…currently I have not even an understanding what these are)
>> etc.
>>
>> Can anybody help me how I can debug this problem or has an idea what
>> could cause the problem? The result of this behavior is that this
>> machine shows a lower performance than the others and I cannot upgrade
>> other machines before I didn't fix this bug.
> I haven't run a 10.x system in quite a while. When I get home in a few days,
> I might be able to reproduce this. If I can. I can poke at it, but it would 
> be at
> least a week before I might have an answer and I may not figure it out for a
> long time.

Ok, thanks a lot. That would be great.

Niels
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release

2018-04-13 Thread Rick Macklem
Niels Kobschätzki wrote:
>sorry for the cross-posting but so far I had no real luck on the forum
>or on question, thus I want to try my luck here as well.
I read email lists but don't do the other stuff, so I just saw this yesterday.
Short answer, I haven't a clue why cache hits rate would have changed.

The code that decides if there is a hit/miss for the attribute cache is in
ncl_getattrcache() and the code hasn't changed between 10.3->11.1,
except the old code did a mtx_lock(), but I can't imagine how that
would affect the code.

You might want to:
# sysctl -a | fgrep vfs.nfs
for both the 10.3 and 11.1 systems, to check if any defaults have somehow
been changed. (I don't recall any being changed, but??)

If you go into ncl_getattrcache() {it's in sys/fs/nfsclient/nfs_clsubs.c}
and add a printf() for "time_second" and "np->n_mtime.tv_sec" near the
top, where it calculates "timeo" from it.
Running this hacked kernel might show you if either of these fields is bogus.
(You could then printf() "timeo" and "np->n_attrtimeo" just before the "if"
clause that increments "attrcache_misses", which is where the cache misses
happen to see why it is missing the cache.)
If you could do this for the 10.3 kernel as well, this might indicate why the
miss rate has increased?

>I upgraded a machine from 10.3-Prerelease (custom kernel with
>tcp_fastopen added) to 11.1-Release (standard kernel) with
>freebsd-update. I have two other machines that are still on
>10.3-Prerelease. Those machines mount an NFS-export from a
>Linux-NFS-server and use NFSv3. The machine that got upgraded shows now
>far more cache misses for getattr than on the 10.3-machines (we talk a
>factor of 100) in munin. munin also shows a lot more cache-misses for
>other metrics like biow, biorl, biod (where can I find what those
>metrics mean…currently I have not even an understanding what these are)
>etc.
>
>Can anybody help me how I can debug this problem or has an idea what
>could cause the problem? The result of this behavior is that this
>machine shows a lower performance than the others and I cannot upgrade
>other machines before I didn't fix this bug.
I haven't run a 10.x system in quite a while. When I get home in a few days,
I might be able to reproduce this. If I can. I can poke at it, but it would be 
at
least a week before I might have an answer and I may not figure it out for a
long time.

rick
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 221317] Netmap issue after ixgbe driver update in r320897

2018-04-13 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317

Stephen Hurd  changed:

   What|Removed |Added

 Attachment #192502|0   |1
is obsolete||

--- Comment #28 from Stephen Hurd  ---
Created attachment 192505
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=192505=edit
Additional debugging in ixgbe_stop()

This patch won't solve the problem, but it will log errors encountered in
ixgbe_stop() if any.

If there are no errors logged in dmesg, I'm curious if that delay needs to be
at the beginning of the call to stop, or if it can be moved to just before the
init_locked() call.

If there's an error, possibly just retrying after a short delay will help, but
if not, I'll see if I can get an 11-STABLE system up and running this weekend.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 221317] Netmap issue after ixgbe driver update in r320897

2018-04-13 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317

--- Comment #27 from Sylvain Galliano  ---
(In reply to Stephen Hurd from comment #25)

In my first test, I used commit r332481 (with msec_delay moved in netmap code)
-> worked with netmap only (not for ifconfig down/up)

I've just tested your attached patch (ixgbe_qflush(ifp) in ixgbe_netmap.c and I
reproduce issue after several netmap start/stop

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 221317] Netmap issue after ixgbe driver update in r320897

2018-04-13 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317

--- Comment #26 from Sylvain Galliano  ---
(In reply to Stephen Hurd from comment #25)

Unfortunately it's not working.

Here is the patch I applied:

--- sys/dev/ixgbe/if_ix.c   (revision 332482)
+++ sys/dev/ixgbe/if_ix.c   (working copy)
@@ -3568,6 +3568,7 @@
mtx_assert(>core_mtx, MA_OWNED);

INIT_DEBUGOUT("ixgbe_stop: begin\n");
+   ixgbe_qflush(ifp);
ixgbe_disable_intr(adapter);
callout_stop(>timer);

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 221317] Netmap issue after ixgbe driver update in r320897

2018-04-13 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317

--- Comment #25 from Stephen Hurd  ---
(In reply to Sylvain Galliano from comment #24)

Hrm, could you try putting an ixgbe_qflush(ipf) in ixgbe_stop() before the
interrupt is disabled?  My current theory is that the TX queue is being left in
a bad state (which is why the delay helps).

I don't current have an 11-STABLE system with an ixgbe in it to test on.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 221317] Netmap issue after ixgbe driver update in r320897

2018-04-13 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317

--- Comment #24 from Sylvain Galliano  ---
(In reply to Stephen Hurd from comment #22)

Hello Stephen,

Your patch is working when using netmap, but issue with ifconfig down/up in
loop is back (see little script in comment #14)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 221317] Netmap issue after ixgbe driver update in r320897

2018-04-13 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317

Stephen Hurd  changed:

   What|Removed |Added

 Attachment #191979|0   |1
is obsolete||

--- Comment #23 from Stephen Hurd  ---
Created attachment 192502
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=192502=edit
Attempt to remove 1-second spin

Assuming the previous commit still works around the issue, please try the
attached patch.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 221317] Netmap issue after ixgbe driver update in r320897

2018-04-13 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317

--- Comment #22 from Stephen Hurd  ---
Can you test with r332481 and ensure it still works around the issue?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Bug 221317] Netmap issue after ixgbe driver update in r320897

2018-04-13 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317

--- Comment #21 from commit-h...@freebsd.org ---
A commit references this bug:

Author: shurd
Date: Fri Apr 13 17:45:54 UTC 2018
New revision: 332481
URL: https://svnweb.freebsd.org/changeset/base/332481

Log:
  Move 1-second spin into ixgbe_netmap_reg()

  This should still work around the netmap issue, but should not impact other
  calls to ixgbe_stop().

  PR:   221317
  Sponsored by: Limelight Networks

Changes:
  stable/11/sys/dev/ixgbe/if_ix.c
  stable/11/sys/dev/ixgbe/ixgbe_netmap.c

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"