Re: High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release
On 04/14/2018 03:49 AM, Rick Macklem wrote: > Niels Kobschätzki wrote: >> sorry for the cross-posting but so far I had no real luck on the forum >> or on question, thus I want to try my luck here as well. > I read email lists but don't do the other stuff, so I just saw this yesterday. > Short answer, I haven't a clue why cache hits rate would have changed. > > The code that decides if there is a hit/miss for the attribute cache is in > ncl_getattrcache() and the code hasn't changed between 10.3->11.1, > except the old code did a mtx_lock(), but I can't imagine how that > would affect the code. > > You might want to: > # sysctl -a | fgrep vfs.nfs > for both the 10.3 and 11.1 systems, to check if any defaults have somehow > been changed. (I don't recall any being changed, but??) I did that and there did nothing change. > If you go into ncl_getattrcache() {it's in sys/fs/nfsclient/nfs_clsubs.c} > and add a printf() for "time_second" and "np->n_mtime.tv_sec" near the > top, where it calculates "timeo" from it. > Running this hacked kernel might show you if either of these fields is bogus. > (You could then printf() "timeo" and "np->n_attrtimeo" just before the "if" > clause that increments "attrcache_misses", which is where the cache misses > happen to see why it is missing the cache.) > If you could do this for the 10.3 kernel as well, this might indicate why the > miss rate has increased? I will do this next week. On monday we switch for other reasons to other nfs-servers and when we see that they run stable, I will do this next. Btw. I calculated now the percentages. The old servers had a attr miss rate of something like 0.004%, while the upgraded one has more like 2.7%. This is till low from what I've read (I remember that you should start adjusting acreg* when you hit more than 40% misses) but far higher than before. nfsstat -c for one of the working servers looks like this (I did a -cz before to reset it and did this a couple of seconds later): Attr HitsMisses Lkup HitsMisses BioR HitsMisses BioW Hits Misses 10085375 255 9163995 577 540 0 0 0 BioRLHitsMisses BioD HitsMisses DirE HitsMisses Accs Hits Misses 1380 0 0 0 0 0 9169427 277 and for the non-working server: Attr HitsMisses Lkup HitsMisses BioR HitsMisses BioW Hits Misses 1606365 20647 1418205 239 581 0 0 0 BioRLHitsMisses BioD HitsMisses DirE HitsMisses Accs Hits Misses 895 0 0 0 0 0 1439080 337 >> I upgraded a machine from 10.3-Prerelease (custom kernel with >> tcp_fastopen added) to 11.1-Release (standard kernel) with >> freebsd-update. I have two other machines that are still on >> 10.3-Prerelease. Those machines mount an NFS-export from a >> Linux-NFS-server and use NFSv3. The machine that got upgraded shows now >> far more cache misses for getattr than on the 10.3-machines (we talk a >> factor of 100) in munin. munin also shows a lot more cache-misses for >> other metrics like biow, biorl, biod (where can I find what those >> metrics mean…currently I have not even an understanding what these are) >> etc. >> >> Can anybody help me how I can debug this problem or has an idea what >> could cause the problem? The result of this behavior is that this >> machine shows a lower performance than the others and I cannot upgrade >> other machines before I didn't fix this bug. > I haven't run a 10.x system in quite a while. When I get home in a few days, > I might be able to reproduce this. If I can. I can poke at it, but it would > be at > least a week before I might have an answer and I may not figure it out for a > long time. Ok, thanks a lot. That would be great. Niels ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release
Niels Kobschätzki wrote: >sorry for the cross-posting but so far I had no real luck on the forum >or on question, thus I want to try my luck here as well. I read email lists but don't do the other stuff, so I just saw this yesterday. Short answer, I haven't a clue why cache hits rate would have changed. The code that decides if there is a hit/miss for the attribute cache is in ncl_getattrcache() and the code hasn't changed between 10.3->11.1, except the old code did a mtx_lock(), but I can't imagine how that would affect the code. You might want to: # sysctl -a | fgrep vfs.nfs for both the 10.3 and 11.1 systems, to check if any defaults have somehow been changed. (I don't recall any being changed, but??) If you go into ncl_getattrcache() {it's in sys/fs/nfsclient/nfs_clsubs.c} and add a printf() for "time_second" and "np->n_mtime.tv_sec" near the top, where it calculates "timeo" from it. Running this hacked kernel might show you if either of these fields is bogus. (You could then printf() "timeo" and "np->n_attrtimeo" just before the "if" clause that increments "attrcache_misses", which is where the cache misses happen to see why it is missing the cache.) If you could do this for the 10.3 kernel as well, this might indicate why the miss rate has increased? >I upgraded a machine from 10.3-Prerelease (custom kernel with >tcp_fastopen added) to 11.1-Release (standard kernel) with >freebsd-update. I have two other machines that are still on >10.3-Prerelease. Those machines mount an NFS-export from a >Linux-NFS-server and use NFSv3. The machine that got upgraded shows now >far more cache misses for getattr than on the 10.3-machines (we talk a >factor of 100) in munin. munin also shows a lot more cache-misses for >other metrics like biow, biorl, biod (where can I find what those >metrics mean…currently I have not even an understanding what these are) >etc. > >Can anybody help me how I can debug this problem or has an idea what >could cause the problem? The result of this behavior is that this >machine shows a lower performance than the others and I cannot upgrade >other machines before I didn't fix this bug. I haven't run a 10.x system in quite a while. When I get home in a few days, I might be able to reproduce this. If I can. I can poke at it, but it would be at least a week before I might have an answer and I may not figure it out for a long time. rick ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
[Bug 221317] Netmap issue after ixgbe driver update in r320897
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317 Stephen Hurdchanged: What|Removed |Added Attachment #192502|0 |1 is obsolete|| --- Comment #28 from Stephen Hurd --- Created attachment 192505 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=192505=edit Additional debugging in ixgbe_stop() This patch won't solve the problem, but it will log errors encountered in ixgbe_stop() if any. If there are no errors logged in dmesg, I'm curious if that delay needs to be at the beginning of the call to stop, or if it can be moved to just before the init_locked() call. If there's an error, possibly just retrying after a short delay will help, but if not, I'll see if I can get an 11-STABLE system up and running this weekend. -- You are receiving this mail because: You are on the CC list for the bug. ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
[Bug 221317] Netmap issue after ixgbe driver update in r320897
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317 --- Comment #27 from Sylvain Galliano--- (In reply to Stephen Hurd from comment #25) In my first test, I used commit r332481 (with msec_delay moved in netmap code) -> worked with netmap only (not for ifconfig down/up) I've just tested your attached patch (ixgbe_qflush(ifp) in ixgbe_netmap.c and I reproduce issue after several netmap start/stop -- You are receiving this mail because: You are on the CC list for the bug. ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
[Bug 221317] Netmap issue after ixgbe driver update in r320897
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317 --- Comment #26 from Sylvain Galliano--- (In reply to Stephen Hurd from comment #25) Unfortunately it's not working. Here is the patch I applied: --- sys/dev/ixgbe/if_ix.c (revision 332482) +++ sys/dev/ixgbe/if_ix.c (working copy) @@ -3568,6 +3568,7 @@ mtx_assert(>core_mtx, MA_OWNED); INIT_DEBUGOUT("ixgbe_stop: begin\n"); + ixgbe_qflush(ifp); ixgbe_disable_intr(adapter); callout_stop(>timer); -- You are receiving this mail because: You are on the CC list for the bug. ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
[Bug 221317] Netmap issue after ixgbe driver update in r320897
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317 --- Comment #25 from Stephen Hurd--- (In reply to Sylvain Galliano from comment #24) Hrm, could you try putting an ixgbe_qflush(ipf) in ixgbe_stop() before the interrupt is disabled? My current theory is that the TX queue is being left in a bad state (which is why the delay helps). I don't current have an 11-STABLE system with an ixgbe in it to test on. -- You are receiving this mail because: You are on the CC list for the bug. ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
[Bug 221317] Netmap issue after ixgbe driver update in r320897
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317 --- Comment #24 from Sylvain Galliano--- (In reply to Stephen Hurd from comment #22) Hello Stephen, Your patch is working when using netmap, but issue with ifconfig down/up in loop is back (see little script in comment #14) -- You are receiving this mail because: You are on the CC list for the bug. ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
[Bug 221317] Netmap issue after ixgbe driver update in r320897
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317 Stephen Hurdchanged: What|Removed |Added Attachment #191979|0 |1 is obsolete|| --- Comment #23 from Stephen Hurd --- Created attachment 192502 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=192502=edit Attempt to remove 1-second spin Assuming the previous commit still works around the issue, please try the attached patch. -- You are receiving this mail because: You are on the CC list for the bug. ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
[Bug 221317] Netmap issue after ixgbe driver update in r320897
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317 --- Comment #22 from Stephen Hurd--- Can you test with r332481 and ensure it still works around the issue? -- You are receiving this mail because: You are on the CC list for the bug. ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
[Bug 221317] Netmap issue after ixgbe driver update in r320897
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221317 --- Comment #21 from commit-h...@freebsd.org --- A commit references this bug: Author: shurd Date: Fri Apr 13 17:45:54 UTC 2018 New revision: 332481 URL: https://svnweb.freebsd.org/changeset/base/332481 Log: Move 1-second spin into ixgbe_netmap_reg() This should still work around the netmap issue, but should not impact other calls to ixgbe_stop(). PR: 221317 Sponsored by: Limelight Networks Changes: stable/11/sys/dev/ixgbe/if_ix.c stable/11/sys/dev/ixgbe/ixgbe_netmap.c -- You are receiving this mail because: You are on the CC list for the bug. ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"