We know that mac_tx has two call path to go through, one is syscall path like following: dld`str_mdata_fastpath_put+0xa4 ip`tcp_send_data+0x8b4 ip`tcp_output+0x7ea ip`squeue_enter+0x330 ip`tcp_sendmsg+0xfb sockfs`so_sendmsg+0x1c7 sockfs`socket_sendmsg+0x61 sockfs`sendit+0x167 sockfs`send+0x78 sockfs`send32+0x22 unix`_sys_sysenter_post_swapgs+0x14b
The other path is worker thread path like followng: dld`str_mdata_fastpath_put+0xa4 ip`tcp_send_data+0x8b4 ip`tcp_send+0xb01 ip`tcp_wput_data+0x721 ip`tcp_rput_data+0x33d1 ip`squeue_drain+0x179 ip`squeue_enter+0x3f4 ip`ip_input+0xc17 mac`mac_rx_soft_ring_drain+0xdf mac`mac_soft_ring_worker+0x111 unix`thread_start+0x8 I tried to know the number of mac_tx call path in 10 seconds on another dterm. Before I using dtrace comand to improve performance, the call path distribution is Syscall Path: 641615 Worker Thread Path: 482210 After using dtrace command to improve performance, the call path distribution is: Syscall Path: 319273 Worker Thread Path: 1061620 Thanks Zhihui 2009/4/1 zhihui Chen <zhchen3 at gmail.com> > Thanks, I have tried your method and the poll function is disabled. After > that, the context switches is decreased very much, but the performance is > still remained at 8.8Gbps. Mpstat output like following:CPU minf mjf xcal > intr ithr csw icsw migr smtx srw syscl usr sys wt idl > 0 0 0 0 33 1 15 8 0 1366 0 134084 3 97 0 > 0 > 1 10 0 0 56 16 35208 6 8 1736 0 9 0 48 0 > 52 > 2 4 0 37 19646 19618 58 0 10 408 0 154 0 34 0 > 66 > 3 0 0 0 308 107 118 0 6 1 0 185 0 0 0 > 100 > > > Then I use the same dtrace command again, the performance is improved to > 9.5Gbps and context switches is also reduced from 35000 to 6300. Mpstat > output likes following: > CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt > idl > 0 0 0 0 17 3 1858 8 3 605 0 142000 3 81 0 > 16 > 1 0 0 0 15 6 4472 0 2 2516 0 0 0 92 0 > 8 > 2 0 0 0 19740 19679 126 0 6 253 0 289 0 46 0 > 54 > 3 0 0 9 509 208 66 0 4 0 0 272 0 0 0 > 100 > > Because this is TX heavy workload, maybe we should care more about thread > mac_soft_ring_worker??? > > Thanks > Zhihui > > 2009/4/1 Sunay Tripathi <Sunay.Tripathi at sun.com> > >> Sure. Just run this and polling will get disabled >> % dladm create-vnic -l ixgbe0 vnic1 >> >> Let us know what you get with polling disabled. We don't have a >> tunable to disable polling but since ixgbe can't assign rx rings to >> VNIC yet, it disable polling for primary NIC as well. >> >> Cheers, >> Sunay >> >> zhihui Chen wrote: >> >>> During my test for 10GBE(Intel Ixgbe) with snv_110, I find the context >>> switch is a big problem for the performance. Benchmark: Netperf-2.4.4 >>> Workload: TCP_STREAM (sending 8KB-size tcp packets from SUT to remote >>> machine) >>> >>> Crossbow use two kernel threads ( mac_soft_ring_worker and >>> mac_rx_srs_poll_ring) to help send and recv packets in the kernel. On >>> multi-core or multi-processor system, these two threads and interrupt for >>> nic can run on different CPU. Considering following scenario on my 4-core >>> system: >>> mac_soft_ring_worker: CPU 1 >>> mac_rx_srs_poll_ring: CPU 1 >>> Interrupt: CPU 2 >>> >>> I run the workload and bind the application to free CPU 0, then I get the >>> performance results at 8.8Gbps and mpstat output like following: CPU >>> minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl >>> 0 0 0 0 17 3 21 9 0 1093 0 134501 3 97 0 >>> 0 >>> 1 0 0 0 29 13 56972 2 7 992 0 2 0 50 0 >>> 50 >>> 2 14 0 0 19473 19455 37 0 8 0 0 149 0 28 0 >>> 72 >>> 3 0 0 1 305 104 129 0 4 1 0 9 0 0 0 >>> 100 >>> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt >>> idl >>> 0 0 0 0 14 2 2 7 0 1120 0 133511 3 97 0 >>> 0 >>> 1 0 0 0 24 12 54501 2 6 971 0 2 0 48 0 >>> 52 >>> 2 0 0 0 19668 19648 45 0 9 0 0 149 0 28 0 >>> 72 >>> 3 0 0 0 306 104 128 0 6 0 0 11 0 0 0 >>> 100 >>> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt >>> idl >>> 0 0 0 0 14 2 21 8 2 1107 0 134569 3 97 0 >>> 0 >>> 1 0 0 0 32 16 57522 2 6 928 0 2 0 50 0 >>> 50 >>> 2 0 0 0 19564 19542 46 0 10 1 0 140 0 28 0 >>> 72 >>> 3 0 0 0 306 104 122 0 7 0 0 58 0 0 0 >>> 100 >>> >>> >>> Next step, I just run one dtrace command: dtrace -n 'mac_tx:entry{@ >>> [probefunc,stack()]=count();}' >>> Then I can get performance at 9.57Gbps and mpstat output like following: >>> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt >>> idl >>> 0 0 0 0 23 5 2055 9 4 529 0 142719 3 81 0 >>> 15 >>> 1 0 0 1 21 8 24343 0 2 2523 0 0 0 88 0 >>> 12 >>> 2 14 0 5 19678 19537 81 0 5 0 0 150 0 43 0 >>> 57 >>> 3 0 0 6 308 104 93 0 5 2 0 278 0 0 0 >>> 100 >>> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt >>> idl >>> 0 0 0 2 19 4 1998 9 6 556 0 142911 3 82 0 >>> 16 >>> 1 0 0 0 20 8 23543 1 2 2556 0 0 0 88 0 >>> 12 >>> 2 0 0 6 19647 19499 106 0 8 1 0 266 0 43 0 >>> 57 >>> 3 0 0 2 308 104 70 0 5 1 0 28 0 0 0 >>> 100 >>> CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt >>> idl >>> 0 0 0 0 21 3 1968 10 4 556 0 144547 3 82 0 >>> 15 >>> 1 0 0 0 20 10 23334 0 2 2622 0 0 0 90 0 >>> 10 >>> 2 0 0 9 19797 19658 92 0 10 1 0 274 0 44 0 >>> 56 >>> 3 0 0 0 307 104 95 0 6 2 0 182 0 0 0 >>> 99 >>> >>> I dont think dtrace can help improve the performance of nic. If you >>> compare the mpstat output, the biggest difference is that context switch has >>> been reduced very much from 55000 to 23000. This leads to my point that >>> two much context switches hinders the performance of crossbow. >>> If I make these two kernel threads and interrupt run in totally different >>> cores, the performance can be reduced to about 7.8Gbps while context >>> switches will be increase to at about 80000 per second, but the interrupts >>> remains at about 19500/s >>> >>> In crossbow, thread mac_soft_ring_worker will be wakeup by >>> mac_rx_srs_poll_ring and interrupt through calling the function >>> mac_soft_ring_worker_wakeup. I just think that if I can disable the polling >>> function, then context switches should be reduced. >>> Thanks >>> Zhihui >>> >>> >>> >>> 2009/4/1 rajagopal kunhappan <rajagopal.kunhappan at sun.com <mailto: >>> rajagopal.kunhappan at sun.com>> >>> >>> Hi Zhihui, >>> >>> >>> In crossbow, each mac_srs has a kernel thread called >>> "mac_rx_srs_poll_ring" >>> to poll the hardware and crossbow will wakeup this thread to >>> poll packets >>> from the hardware automatically. Does crossbow provide any >>> method to disable >>> the polling mechanism, for example disabling the this kernel >>> thread? >>> >>> >>> Presently no. Can we know why you would want to do that? >>> >>> Thanks, >>> -krgopi >>> >>> Thanks >>> Zhihui >>> >>> >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> crossbow-discuss mailing list >>> crossbow-discuss at opensolaris.org >>> <mailto:crossbow-discuss at opensolaris.org> >>> http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss >>> >>> >>> >>> -- >>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/crossbow-discuss/attachments/20090401/a27260e1/attachment.html>