I had an email conversation with Rick Macklem about six months ago about NFS server bottlenecks. I'm now in a position to observe my large-scale NFS server under an actual production load, so I thought I would update folks on what it looks like. This is a 9.1 prerelease kernel (I hope 9.1 will be released soon as I have four moe of these servers to deploy!). When under nearly 100% load on an 8-core (16-thread) Quanta QSSC-S99Q storage server, with a 10G network interface, pmcstat tells me this:
PMC: [INST_RETIRED.ANY_P] Samples: 2727105 (100.0%) , 27 unresolved Key: q => exiting... %SAMP IMAGE FUNCTION CALLERS 29.3 kernel _mtx_lock_sleep nfsrvd_updatecache:10.0 nfsrvd_getcache:7.4 ... 9.5 kernel cpu_search_highest cpu_search_highest:8.1 sched_idletd:1.4 7.4 zfs.ko lzjb_decompress zio_decompress 4.3 kernel _mtx_lock_spin turnstile_trywait:2.2 pmclog_reserve:1.0 ... 4.0 zfs.ko fletcher_4_native zio_checksum_error:3.1 zio_checksum_compute:0.8 3.6 kernel cpu_search_lowest cpu_search_lowest 3.3 kernel nfsrc_trimcache nfsrvd_getcache:1.6 nfsrvd_updatecache:1.6 2.3 kernel ipfw_chk ipfw_check_hook 2.1 pmcstat _init 1.1 kernel _sx_xunlock 0.9 kernel _sx_xlock 0.9 kernel spinlock_exit This does seem to confirm my original impression that the NFS replay cache is quite expensive. Running a gprof(1) analysis on the same PMC data reveals a bit more detail (I've removed some uninteresting parts of the call graph): called/total parents index %time self descendents called+self name index called/total children 4881.00 2004642.70 932627/932627 svc_run_internal [2] [4] 45.1 4881.00 2004642.70 932627 nfssvc_program [4] 13199.00 504436.33 584319/584319 nfsrvd_updatecache [9] 23075.00 403396.18 468009/468009 nfsrvd_getcache [14] 1032.25 416249.44 2239/2284 svc_sendreply_mbuf [15] 6168.00 381770.44 11618/11618 nfsrvd_dorpc [24] 3526.87 86869.88 112478/112514 nfsrvd_sentcache [74] 890.00 50540.89 4252/4252 svc_getcred [101] 14876.60 32394.26 4177/24500 crfree <cycle 3> [263] 11550.11 25150.73 3243/24500 free <cycle 3> [102] 1348.88 15451.66 2716/16831 m_freem [59] 4066.61 216.81 1434/1456 svc_freereq [321] 2342.15 677.40 557/1459 malloc_type_freed [265] 59.14 1916.84 134/2941 crget [113] 1602.25 0.00 322/9682 bzero [105] 690.93 0.00 43/44 getmicrotime [571] 287.22 7.33 138/1205 prison_free [384] 233.61 0.00 60/798 PHYS_TO_VM_PAGE [358] 203.12 0.00 94/230 nfsrv_mallocmget_limit [632] 151.76 0.00 51/1723 pmap_kextract [309] 0.78 70.28 9/3281 _mtx_unlock_sleep [154] 19.22 16.88 38/400403 nfsrc_trimcache [26] 11.05 21.74 7/197 crsetgroups [532] 30.37 0.00 11/6592 critical_enter [190] 25.50 0.00 9/36 turnstile_chain_unlock [844] 24.86 0.00 3/7 nfsd_errmap [913] 12.36 8.57 8/2145 in_cksum_skip [298] 9.10 3.59 5/12455 mb_free_ext [140] 1.84 4.85 2/2202 VOP_UNLOCK_APV [269] ----------------------------------------------- 0.49 0.15 1/1129009 uhub_explore [1581] 0.49 0.15 1/1129009 tcp_output [10] 0.49 0.15 1/1129009 pmap_remove_all [1141] 0.49 0.15 1/1129009 vm_map_insert [236] 0.49 0.15 1/1129009 vnode_create_vobject [281] 0.49 0.15 1/1129009 biodone [351] 0.49 0.15 1/1129009 vm_object_madvise [670] 0.49 0.15 1/1129009 xpt_done [483] 0.49 0.15 1/1129009 vputx [80] 0.49 0.15 1/1129009 vm_map_delete <cycle 3> [49] 0.49 0.15 1/1129009 vm_object_deallocate <cycle 3> [356] 0.49 0.15 1/1129009 vm_page_unwire [338] 0.49 0.15 1/1129009 pmap_change_wiring [318] 0.98 0.31 2/1129009 getnewvnode [227] 0.98 0.31 2/1129009 pmap_clear_reference [1004] 0.98 0.31 2/1129009 usbd_do_request_flags [1282] 0.98 0.31 2/1129009 vm_object_collapse <cycle 3> [587] 0.98 0.31 2/1129009 vm_object_page_remove [122] 1.48 0.46 3/1129009 mpt_pci_intr [487] 1.48 0.46 3/1129009 pmap_extract [355] 1.48 0.46 3/1129009 vm_fault_unwire [171] 1.97 0.62 4/1129009 vgonel [270] 1.97 0.62 4/1129009 vm_object_shadow [926] 1.97 0.62 4/1129009 zone_alloc_item [434] 2.46 0.77 5/1129009 vnlru_free [235] 2.46 0.77 5/1129009 insmntque1 [737] 2.95 0.93 6/1129009 zone_free_item [409] 3.94 1.24 8/1129009 pmap_enter_object [805] 4.92 1.55 10/1129009 vfs_ref [722] 4.92 1.55 10/1129009 pmap_clear_modify [814] 5.41 1.70 11/1129009 vn_isdisk [391] 5.90 1.86 12/1129009 rtalloc1_fib [158] 5.90 1.86 12/1129009 soreceive_generic [99] 6.40 2.01 13/1129009 pmap_remove_pages [403] 6.89 2.16 14/1129009 _mtx_lock_flags [565] 6.89 2.16 14/1129009 random_yarrow_read [542] 6.89 2.16 14/1129009 nfsrv_checkgetattr [1133] 7.38 2.32 15/1129009 svc_vc_recv [55] 7.38 2.32 15/1129009 ixgbe_rxeof [12] 8.85 2.78 18/1129009 _cv_timedwait_sig [859] 9.35 2.94 19/1129009 vref [242] 10.82 3.40 22/1129009 pmap_enter_quick [690] 11.31 3.56 23/1129009 tcp_do_segment [35] 13.28 4.18 27/1129009 sosend_generic [25] 15.74 4.95 32/1129009 g_io_schedule_down [76] 19.19 6.03 39/1129009 vn_start_write [400] 20.66 6.49 42/1129009 ip_output [32] 23.12 7.27 47/1129009 pmap_copy [467] 23.12 7.27 47/1129009 __mnt_vnode_next_all [520] 29.52 9.28 60/1129009 vm_page_free_toq [149] 30.50 9.59 62/1129009 vm_fault_hold [268] 36.90 11.60 75/1129009 g_io_request [172] 38.37 12.06 78/1129009 prison_hold [330] 38.86 12.22 79/1129009 taskqueue_member [423] 39.36 12.37 80/1129009 vm_page_deactivate [812] 39.85 12.53 81/1129009 vm_page_activate [790] 40.83 12.83 83/1129009 in_matroute [197] 45.75 14.38 93/1129009 vm_page_alloc [128] 48.70 15.31 99/1129009 prison_free [384] 66.90 21.03 136/1129009 vholdl [275] 73.30 23.04 149/1129009 vfs_unbusy [470] 82.65 25.98 168/1129009 key_addref [233] 85.60 26.91 174/1129009 vn_finished_write [407] 91.50 28.76 186/1129009 _key_freesp [286] 92.48 29.07 188/1129009 vfs_busy [288] 102.32 32.16 208/1129009 _vm_object_allocate [738] 107.24 33.71 218/1129009 taskqueue_enqueue [68] 112.16 35.26 228/1129009 xprt_active [267] 125.94 39.59 256/1129009 vm_pageq_remove [650] 128.89 40.51 262/1129009 svc_freereq [321] 128.89 40.51 262/1129009 pmap_remove [106] 153.49 48.25 312/1129009 vdropl [251] 154.96 48.71 315/1129009 vfs_busyfs [239] 180.05 56.60 366/1129009 pmap_enter [245] 182.02 57.22 370/1129009 svc_run_internal [2] 213.50 67.11 434/1129009 ixgbe_msix_que [11] 248.43 78.09 505/1129009 key_allocsp [221] 255.32 80.26 519/1129009 camisr [107] 373.87 117.52 760/1129009 ipmi_poll [635] 404.87 127.27 823/1129009 g_io_deliver [162] 440.78 138.56 896/1129009 g_io_schedule_up [137] 462.91 145.51 941/1129009 taskqueue_run_locked [43] 744.30 233.97 1513/1129009 _sleep [150] 906.15 284.84 1842/1129009 dastrategy [116] 1170.32 367.88 2379/1129009 mps_intr_msi [194] 1329.22 417.83 2702/1129009 ixgbe_handle_que [56] 15078.43 4739.79 30651/1129009 nfsrc_unlock [174] 25869.62 8131.91 52587/1129009 _vm_map_lock [129] 44082.69 13857.05 89610/1129009 nfsrvd_sentcache [74] 113077.19 35544.94 229860/1129009 nfsrc_trimcache [26] 149533.34 47004.65 303967/1129009 nfsrvd_getcache [14] 198698.26 62459.27 403908/1129009 nfsrvd_updatecache [9] [5] 16.4 555404.00 174586.97 1129009 _mtx_lock_sleep [5] 9074.71 139536.06 87523/87603 turnstile_trywait [45] 3085.07 11049.12 3870/3964 turnstile_wait [207] 166.20 10020.76 4390/4728 lockstat_nsecs [229] 505.86 276.39 322/324 turnstile_cancel [552] 516.23 123.24 384/341743 spinlock_exit <cycle 1> [136] 161.51 0.00 61/9744 spinlock_enter [156] 44.18 0.00 16/6592 critical_enter [190] 23.20 0.00 4/35 turnstile_setowner [767] 1.35 3.10 1/1531 propagate_priority [280] ----------------------------------------------- 13199.00 504436.33 584319/584319 nfssvc_program [4] [9] 11.6 13199.00 504436.33 584319 nfsrvd_updatecache [9] 198698.26 62459.27 403908/1129009 _mtx_lock_sleep [5] 107453.19 94387.86 212423/400403 nfsrc_trimcache [26] 1974.56 16386.18 2411/3646 nfsrc_freecache [148] 120.15 10877.79 1393/3281 _mtx_unlock_sleep [154] 139.67 10006.86 2/12 lapic_handle_intr [93] 257.17 946.79 431/6821 m_copym [186] 539.54 0.00 124/26922 bcopy [53] 2.07 31.89 20/87603 turnstile_trywait [45] 10.68 23.27 3/24500 free <cycle 3> [102] 25.54 6.10 19/341743 spinlock_exit <cycle 1> [136] 28.33 0.00 10/36 turnstile_chain_unlock [844] 13.58 10.56 8/30514 uma_zalloc_arg [70] 18.35 0.00 7/37 nfsrc_wanted [851] 10.90 0.83 5/284 turnstile_lookup [582] 0.11 6.85 3/4728 lockstat_nsecs [229] ----------------------------------------------- 23075.00 403396.18 468009/468009 nfssvc_program [4] [14] 9.6 23075.00 403396.18 468009 nfsrvd_getcache [14] 149533.34 47004.65 303967/1129009 _mtx_lock_sleep [5] 95069.59 83510.00 187942/400403 nfsrc_trimcache [26] 1104.58 13395.38 993/8846 malloc [47] 117.22 10612.28 1359/3281 _mtx_unlock_sleep [154] 1570.18 1088.69 1016/2145 in_cksum_skip [298] 134.35 0.00 27/9682 bzero [105] 127.38 0.00 46/1196 in_cksumdata [347] 28.23 6.74 21/341743 spinlock_exit <cycle 1> [136] 1.87 28.70 18/87603 turnstile_trywait [45] 25.50 0.00 9/36 turnstile_chain_unlock [844] 11.88 9.24 7/30514 uma_zalloc_arg [70] 13.08 0.99 6/284 turnstile_lookup [582] 0.04 2.28 1/4728 lockstat_nsecs [229] ----------------------------------------------- 19.22 16.88 38/400403 nfssvc_program [4] 95069.59 83510.00 187942/400403 nfsrvd_getcache [14] 107453.19 94387.86 212423/400403 nfsrvd_updatecache [9] [26] 8.5 202542.00 177914.74 400403 nfsrc_trimcache [26] 113077.19 35544.94 229860/1129009 _mtx_lock_sleep [5] 657.00 19128.82 5754/5754 nfsrv_checksockseqnum [183] 1011.44 8393.58 1235/3646 nfsrc_freecache [148] 14.25 31.02 4/24500 free <cycle 3> [102] 2.48 28.45 5/16831 m_freem [59] 0.93 14.35 9/87603 turnstile_trywait [45] 7.86 0.00 3/37 nfsrc_wanted [851] 1.57 0.86 1/324 turnstile_cancel [552] ----------------------------------------------- The full gprof output for the kernel is at <http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.txt>. An amazing visualization of this (thanks to gprof2dot.py and dot) is at <http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.png>. There are some functions missing from the visualization because the gprof buckets overflowed. Is there a documented way to merge the kernel and ZFS gprof data output by pmcstat(1) so that those functions will be properly accounted for? -GAWollman _______________________________________________ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"