On 07/23/2012 12:41 AM, Pekka Riikonen wrote: > Hi, > > In our 64 byte packet test with 12 10GbE ports we encountered some > interesting softlockups and interrupt rates. For some reason suddenly we > started seeing softlockups usually in kworker (doing various work) while > processing packets. In this test we sent a total 40 Mpps to all ports and > we use heavily modified ixgbe from sourceforge.net, pause frames off. > > Softlockups such as: > > [ 250.133274] BUG: soft lockup - CPU#10 stuck for 22s! [kworker/10:1:77] > [ 250.133404] Process kworker/10:1 (pid: 77, threadinfo ffff88107c7c0000, > [ 250.133441] Call Trace: > [ 250.133444] <IRQ> > [ 250.133456] [<ffffffffa0048a89>] ixgbe_clean_rx_irq+0x269/0x4e0 [ixgbe] > [ 250.133464] [<ffffffffa004932c>] ixgbe_poll+0x25c/0x660 [ixgbe] > [ 250.133472] [<ffffffff81397e87>] net_rx_action+0xa7/0x2a0 > [ 250.133481] [<ffffffff8103c108>] __do_softirq+0x98/0x120 > [ 250.133489] [<ffffffff81469f8c>] call_softirq+0x1c/0x30 > [ 250.133497] [<ffffffff810044fd>] do_softirq+0x4d/0x80 > [ 250.133503] [<ffffffff8103c3b5>] irq_exit+0x65/0x70 > [ 250.133508] [<ffffffff8100439e>] do_IRQ+0x5e/0xd0 > [ 250.133517] [<ffffffff81468813>] common_interrupt+0x13/0x13 > [ 250.133521] <EOI> > [ 250.133528] [<ffffffffa004eb5f>] ? ixgbe_update_stats+0x13f/0xca0 [ixgbe] > [ 250.133535] [<ffffffff8146880e>] ? common_interrupt+0xe/0x13 > [ 250.133543] [<ffffffffa004fd55>] ixgbe_service_task+0x695/0x970 [ixgbe] > [ 250.133551] [<ffffffffa004f6c0>] ? ixgbe_update_stats+0xca0/0xca0 [ixgbe] > [ 250.133558] [<ffffffff8104c0c1>] process_one_work+0x101/0x390 > [ 250.133564] [<ffffffff8104c91f>] worker_thread+0x15f/0x350 > [ 250.133569] [<ffffffff8104c7c0>] ? manage_workers.isra.32+0x220/0x220 > [ 250.133577] [<ffffffff81050b27>] kthread+0x87/0x90 > [ 250.133584] [<ffffffff81469e94>] kernel_thread_helper+0x4/0x10 > [ 250.133590] [<ffffffff81050aa0>] ? kthread_worker_fn+0x130/0x130 > [ 250.133595] [<ffffffff81469e90>] ? gs_change+0xb/0xb > > I traced the problem to the the NAPI poll return value in ixgbe_poll() > when exiting polling mode. In that case ixgbe returns 0, and not the > actual value of work done. This helps throughput but also makes NET_RX > run longer in hardirq context. OTOH, if I change it to the true work done > value the throughput suffered too much so I settled on workdone >> 2, as a > hack. > > But I still wanted to know why this problem happens because even if 0 is > returned in poll() the softirqs aren't designed to run forever. So I > started looking at the interrupt rate and noticed that in this particular > test it oscillated a lot, sometimes going up to 300k+ ints/sec, even > though the traffic is stable. Apparently the interrupt rate was so high > that it could starve the user context. > > The problem went a way with the hack in ixgbe_poll() but it got me > thinking why the ITR value is not updated always in ixgbe_poll() and not > only after napi_complete()? It should be more stable if it was updated at > each poll(). > > The problem went away also when reducing number of ports, which makes me > think this problem will reappear when we finally start testing with 16-24 > ports. > > And of course, softlockups went away when the traffic was stopped. > > Now, my analysis could be wrong and I have to say that we have heavily > modified ixgbe driver and kernel so it's possible that this problem > doesn't happen with vanilla driver and kernel. > > Pekka > Hello Pekka,
You say you heavily modified the ixgbe driver. I was wondering if you are able to see the same issue with an unmodified driver? Based on your description it sounds like there may be an issue with the interrupt moderation for the adapter. You might try using our ethregs utility available at e1000.sf.net to dump the contents of the EITR registers for the adapter. This way you could at least verify what the interrupt rate is that is being programmed into the adapters. Thanks, Alex ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired