On Sat, Jan 06, 2018 at 12:27:11AM +0300, Ozgur wrote: > > > 06.01.2018, 00:20, "Tobias Hommel" <netdev-l...@genoetigt.de>: > > Hi, > > Hi Tobias, > > > I'm running into a NULL pointer dereference after updating from Linux 4.1.6 > > to > > 4.14.11 (see kernel log below). I tried 4.14.3 initially which did not work > > either. > > Anyone has an idea what is happening here? > > > > The affected machine has 2 active ethernet interfaces (igb driver) and acts > > as > > a VPN gateway running strongswan. There are several hundreds of IPSec > > roadwarriors connecting to eth1. eth0 connects to an infrastructure running > > an > > HTTP server. > > During my tests these roadwarriors connect to the gateway, sometimes > > download a > > large file from the HTTP server, disconnect and after a random delay repeat > > these steps. > > > > Some observations I made: > > * SMP Affinity for IRQs of the NICs Rx/Tx queues > > (/proc/irq/$IRQ/smp_affinity) > > * all affinities set to default ff is broken > > * setting affinity for all queues of both interfaces to the same CPU > > seems to > > work fine (running stable for more than 1 day now) > > * setting affinity of eth0 queues to CPU 1 and affinity of eth1 queues to > > CPU > > 2 is broken and seems to always trigger the bug on CPU 1 > > * the top 6 entries of the call trace are the same every time the system > > crashes, the other entries differ sometimes > > > > The bug is 100% reproducible on the Intel Atom machine from the log below > > and > > also on a HP ProLiant Gen6 (also igb driver). > > I can, of course, provide further information (CPU, NIC, kernel config, more > > traces, etc.) if required. > > If helpful I could also run tests on HP ProLiant Gen9 which has different > > NICs > > (tg3). > > > > [ 7998.489094] BUG: unable to handle kernel NULL pointer dereference at > > 0000000000000020 > > [ 7998.496993] IP: xfrm_lookup+0x2a/0x7e0 > > [ 7998.500759] PGD 0 P4D 0 > > [ 7998.503316] Oops: 0000 [#1] SMP PTI > > [ 7998.506835] Modules linked in: > > [ 7998.509929] CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.14.11 #3 > > [ 7998.516244] Hardware name: To be filled by O.E.M. CAR-2051/CAR, BIOS > > 1.01 07/11/2016 > > [ 7998.524039] task: ffff8826bb118000 task.stack: ffff947ac00f0000 > > [ 7998.530004] RIP: 0010:xfrm_lookup+0x2a/0x7e0 > > [ 7998.534298] RSP: 0018:ffff947ac00f3b60 EFLAGS: 00010246 > > [ 7998.539550] RAX: 0000000000000000 RBX: ffffffff93074040 RCX: > > 0000000000000000 > > [ 7998.546709] RDX: ffff947ac00f3bd8 RSI: 0000000000000000 RDI: > > ffffffff93074040 > > [ 7998.553868] RBP: ffffffff93074040 R08: 0000000000000002 R09: > > 0000000000000001 > > [ 7998.561026] R10: 0000000000000032 R11: 0000000000000000 R12: > > ffff947ac00f3bd8 > > [ 7998.568212] R13: 0000000000000000 R14: 0000000000000002 R15: > > ffff8826b69a8078 > > [ 7998.575395] FS: 0000000000000000(0000) GS:ffff8826bfc80000(0000) > > knlGS:0000000000000000 > > [ 7998.583550] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 7998.589324] CR2: 0000000000000020 CR3: 00000001781da000 CR4: > > 00000000001006e0 > > [ 7998.596482] Call Trace: > > [ 7998.598959] __xfrm_route_forward+0xa4/0x110 > > [ 7998.603263] ip_forward+0x3e0/0x450 > > [ 7998.606778] ? ip_rcv_finish+0x61/0x3a0 > > [ 7998.610645] ip_rcv+0x2c4/0x390 > > [ 7998.613818] ? inet_del_offload+0x30/0x30 > > [ 7998.617857] __netif_receive_skb_core+0x751/0xb00 > > [ 7998.622562] ? skb_send_sock+0x40/0x40 > > [ 7998.626356] ? netif_receive_skb_internal+0x47/0xf0 > > [ 7998.631252] netif_receive_skb_internal+0x47/0xf0 > > [ 7998.635987] napi_gro_receive+0x70/0x90 > > [ 7998.639835] gro_cell_poll+0x53/0x90 > > [ 7998.643439] net_rx_action+0x1fc/0x310 > > [ 7998.647210] ? rebalance_domains+0x101/0x2b0 > > [ 7998.651500] __do_softirq+0xd5/0x1cf > > [ 7998.655105] run_ksoftirqd+0x14/0x30 > > [ 7998.658712] smpboot_thread_fn+0xf9/0x150 > > [ 7998.662723] kthread+0xef/0x130 > > [ 7998.665893] ? sort_range+0x20/0x20 > > [ 7998.669404] ? kthread_park+0x60/0x60 > > [ 7998.673098] ret_from_fork+0x1f/0x30 > > [ 7998.676674] Code: 00 41 57 41 56 45 89 c6 41 55 41 54 49 89 f5 55 53 49 > > 89 d4 48 89 fb 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 > > <48> 8b 46 20 48 85 c9 44 0f b7 38 c7 44 24 0c 00 00 00 00 0f 84 > > [ 7998.695681] RIP: xfrm_lookup+0x2a/0x7e0 RSP: ffff947ac00f3b60 > > [ 7998.701479] CR2: 0000000000000020 > > [ 7998.704799] ---[ end trace 0544b1946919baad ]--- > > [ 7998.709442] Kernel panic - not syncing: Fatal exception in interrupt > > [ 7998.715918] Kernel Offset: 0x11000000 from 0xffffffff81000000 > > (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > > > this error doesn't look like the last version kernel, I think this problem > NIC driver. > What is the use network ethernet card model? This is what lspci shows for both NICs: # lspci -nns 00:14.0 00:14.0 Ethernet controller [0200]: Intel Corporation Ethernet Connection I354 [8086:1f41] (rev 03)
I have currently no access to the other hardware where this is happening but I could get further information after the weekend. > And which driver version you use? # ethtool -i eth0 # same for eth1 driver: igb version: 5.4.0-k firmware-version: 0.0.0 expansion-rom-version: bus-info: 0000:00:14.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes > > > Best regards, > > > > Tobias Hommel > > Ozgur