Hey Joe,

Thank you so much for responding! After 10 days of trying to figure this
out I'm at a loss.

root@node-8:~# modinfo openvswitch
filename:
/lib/modules/3.13.0-106-generic/kernel/net/openvswitch/openvswitch.ko
license:        GPL
description:    Open vSwitch switching datapath
srcversion:     94294A72258BA583D666607
depends:        libcrc32c,vxlan,gre
intree:         Y
vermagic:       3.13.0-106-generic SMP mod_unload modversions


Everything you've mentioned is what I've understood so far including the
line of code that's triggered. That is what led me to upgrade the kernel to
3.13.0-106 because it claims that the CHECKSUM problems are fixed which I
thought this might be related, guess not.

You're saying that skb_headlen is too short for the ethernet header. Do you
know what would cause this? This hardware configuration has been running
for 400+ days of uptime with no errors or problems and this suddenly
started to happen and no matter how many time we reboot things it doesn't
go away.  I assume given your interpretation we should try to restart the
switches connected to the servers. Is there any way to log what packet is
causing this issue? Perhaps that would provide more insight?

As far as 4.4/newer kernel - I wish. I tried to go that far up but Ubuntu
wouldn't even boot. The best I could do is 3.13.0-106. I'll try to report
it over there as well.

Thanks again.

Uri


On Thu, Jan 5, 2017 at 10:16 PM, Joe Stringer <j...@ovn.org> wrote:

> On 5 January 2017 at 17:13, Uri Foox <u...@zoey.com> wrote:
> > Hi,
> >
> > Since about 10 days ago, every few hours, one of our 10 compute nodes on
> > our Openstack cluster kernel panics at the host level kernel panics
> > (captured through netconsole). The kernel panic is identical across all
> 10
> > nodes and happens at random times but at least 1 node kernel panics every
> > 3-12 hours. We have tried numerous things including upgrading the kernel
> > (Ubuntu 12.04 LTS running 3.1.0-106-generic), modifying sysctl,
> restarting
> > switches, restarting all openstack networking services, changing BIOS
> > settings etc...but no luck. We have not restarted the control nodes or
> the
> > Juniper switch that routes all inbound internet traffic.
> >
> > Based on research we did around skbuff.h we found two kernel patches to
> > address a checksum failure and also some OVS discussions about it. I was
> > hoping that the kernel upgrade would solve it but it did not. I do not
> know
> > if Openstack will tolerate us upgrading OVS and the fact that it started
> > completely randomly leads me to believe it's some other factor that we
> are
> > unaware of.
> >
> >
> >    - https://patchwork.ozlabs.org/patch/512625/
> >    -
> >    https://github.com/openvswitch/ovs/commit/
> 51b7a90217369f6bbbf164ba471f54ec2817665e
> >    - https://patchwork.kernel.org/patch/7475491/
> >    - https://patchwork.ozlabs.org/patch/523632/
> >
> >
> > Here is one of them. If you have any ideas what we can do, please let me
> > know.
> >
> > Thanks,
> > Uri
> >
> >
> > Connection from 172.25.2.157 port 5404 [udp/*] accepted
> > [68240.441681] ------------[ cut here ]------------
> > [68240.496918] kernel BUG at
> > /build/linux-lts-trusty-D60X6T/linux-lts-trusty-3.13.
> 0/include/linux/skbuff.h:1486!
> > [68240.615520] invalid opcode: 0000 [#1] SMP
> > [68240.664751] Modules linked in: netconsole configfs xt_mac xt_physdev
> > xt_set ip_set_hash_ip ip_set nfnetlink vhost_net macvtap macvlan vhost
> veth
> > bridge stp llc ipt_REJECT xt_state xt_conntrack xt_multiport xt_CT
> > xt_comment iptable_raw xt_CHECKSUM xt_tcpudp iptable_mangle
> ipt_MASQUERADE
> > iptable_nat nf_nat_ipv4 nf_nat ip6table_filter ip6_tables iptable_filter
> > ip_tables ebtable_nat ebtables x_tables kvm_intel kvm nbd ib_iser rdma_cm
> > ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
> > scsi_transport_iscsi openvswitch vxlan ip_tunnel gre nfsd nfs_acl
> > auth_rpcgss nfs fscache lockd sunrpc dm_multipath gpio_ich dcdbas scsi_dh
> > mei_me shpchp sb_edac mei edac_core lpc_ich joydev acpi_power_meter
> > nf_conntrack_ipv6 mac_hid nf_defrag_ipv6 wmi nf_conntrack_ipv4 ipmi_si
> xfs
> > nf_conntrack nf_defrag_ipv4 lp parport igb btrfs hid_generic dca
> > i2c_algo_bit usbhid raid6_pq ptp ahci bnx2x hid libahci mdio megaraid_sas
> > pps_core xor libcrc32c
> > [68241.670838] CPU: 33 PID: 0 Comm: swapper/33 Not tainted
> > 3.13.0-106-generic #153~precise1-Ubuntu
> > [68241.774871] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.2.3
> > 07/09/2014
> > [68241.864406] task: ffff881028b94800 ti: ffff881028ba0000 task.ti:
> > ffff881028ba0000
> > [68241.953939] RIP: 0010:[<ffffffffa052b4fe>]  [<ffffffffa052b4fe>]
> > __skb_pull.part.7+0x4/0x6 [openvswitch]
> > [68242.067531] RSP: 0018:ffff88203fb03b08  EFLAGS: 00010297
> > [68242.131087] RAX: ffff88165c791966 RBX: ffff88202639e900 RCX:
> > ffff88165c791900
> > [68242.216458] RDX: 0000000000000210 RSI: 000000000000001a RDI:
> > 0000000000000214
> > [68242.301842] RBP: ffff88203fb03b08 R08: 0000000000000000 R09:
> > 0000000000000140
> > [68242.387207] R10: 000000000000000c R11: 0000000072221c0c R12:
> > ffff88203fb03b70
> > [68242.472576] R13: ffff88402794d0c0 R14: ffff88203fb03b70 R15:
> > ffff88302324e180
> > [68242.557945] FS:  0000000000000000(0000) GS:ffff88203fb00000(0000)
> > knlGS:0000000000000000
> > [68242.654780] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [68242.723550] CR2: 00007f9c7466ab90 CR3: 000000302689e000 CR4:
> > 00000000000427e0
> > [68242.808931] Stack:
> > [68242.832981]  ffff88203fb03b38 ffffffffa0524e64 ffffffff8112d1e1
> > ffff88202639e900
> > [68242.921980]  ffffe8e000305800 ffff88402794d0c0 ffff88203fb03c28
> > ffffffffa0523a80
> > [68243.010963]  ffff88203fb13180 ffff88203fb03b90 ffffffff810a090b
> > 0000000100000000
> > [68243.099945] Call Trace:
> > [68243.129188]  <IRQ>
> > [68243.152204]  [<ffffffffa0524e64>] ovs_flow_extract+0x664/0x720
> > [openvswitch]
> > [68243.238893]  [<ffffffff8112d1e1>] ? tracing_record_cmdline+0x21/0x50
> > [68243.314912]  [<ffffffffa0523a80>]
> > ovs_dp_process_received_packet+0x60/0x130 [openvswitch]
> > [68243.412793]  [<ffffffff810a090b>] ? ttwu_do_wakeup+0xfb/0x110
> > [68243.481559]  [<ffffffffa0529e3a>] ovs_vport_receive+0x2a/0x30
> > [openvswitch]
> > [68243.564884]  [<ffffffffa052b374>] gre_rcv+0xa4/0xb8 [openvswitch]
> > [68243.637802]  [<ffffffffa03e2795>] gre_cisco_rcv+0x75/0xbc [gre]
> > [68243.708621]  [<ffffffffa03e22f5>] gre_rcv+0x65/0x90 [gre]
> > [68243.773214]  [<ffffffff816941d8>] ip_local_deliver_finish+0xa8/0x220
> > [68243.849244]  [<ffffffff816944db>] ip_local_deliver+0x4b/0x90
> > [68243.916951]  [<ffffffff81693ed1>] ip_rcv_finish+0x121/0x380
> > [68243.983627]  [<ffffffff816947a6>] ip_rcv+0x286/0x380
> > [68244.043023]  [<ffffffff8165b80a>] __netif_receive_skb_core+
> 0x61a/0x760
> > [68244.121122]  [<ffffffff8165b971>] __netif_receive_skb+0x21/0x70
> > [68244.191942]  [<ffffffff8165c131>] process_backlog+0xb1/0x190
> > [68244.259642]  [<ffffffff8165ca09>] net_rx_action+0x139/0x280
> > [68244.326305]  [<ffffffff8107367d>] __do_softirq+0xed/0x360
> > [68244.390887]  [<ffffffff81073c8e>] irq_exit+0x11e/0x140
> > [68244.452358]  [<ffffffff8177d873>] do_IRQ+0x63/0xe0
> > [68244.509674]  [<ffffffff817728ad>] common_interrupt+0x6d/0x6d
> > [68244.577366]  <EOI>
> > [68244.600371]  [<ffffffff8109e353>] ? finish_task_switch+0x53/0x160
> > [68244.675630]  [<ffffffff8176e47e>] __schedule+0x38e/0x720
> > [68244.739175]  [<ffffffff8176e8c9>] schedule+0x29/0x70
> > [68244.798567]  [<ffffffff8176ebee>] schedule_preempt_disabled+0xe/0x10
> > [68244.874582]  [<ffffffff810c7f95>] cpu_idle_loop+0x255/0x2a0
> > [68244.941246]  [<ffffffff810ddba2>] ?
> > clockevents_register_device+0xe2/0x140
> > [68245.023512]  [<ffffffff810c804b>] cpu_startup_entry+0x6b/0x70
> > [68245.092269]  [<ffffffff81045bbd>] start_secondary+0xcd/0xd0
> > [68245.158929] Code: c7 e8 cb 52 a0 89 45 f8 e8 50 2e b4 e0 c6 05 15 2e
> 00
> > 00 01 8b 45 f8 eb 0c 8b 16 48 8b 38 31 f6 e8 38 b9 15 e1 c9 c3 55 48 89
> e5
> > <0f> 0b 8b 57 68 55 31 c0 48 89 e5 39 f2 72 13 2b 57 6c 29 d6 e8
> > [68245.392237] RIP  [<ffffffffa052b4fe>] __skb_pull.part.7+0x4/0x6
> > [openvswitch]
> > [68245.477737]  RSP <ffff88203fb03b08>
> > [68245.520082] ---[ end trace 383bac9f3e676970 ]---
> > [68245.583665] Kernel panic - not syncing: Fatal exception in interrupt
> > [68245.661910] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation
> > range: 0xffffffff80000000-0xffffffff9fffffff)
> > [68245.792179] ------------[ cut here ]------------
> > [68245.847479] WARNING: CPU: 33 PID: 0 at
> > /build/linux-lts-trusty-D60X6T/linux-lts-trusty-3.13.
> 0/arch/x86/kernel/smp.c:124
> > native_smp_send_reschedule+0x5e/0x60()
> > [68246.017113] Modules linked in: netconsole configfs xt_mac xt_physdev
> > xt_set ip_set_hash_ip ip_set nfnetlink vhost_net macvtap macvlan vhost
> veth
> > bridge stp llc ipt_REJECT xt_state xt_conntrack xt_multiport xt_CT
> > xt_comment iptable_raw xt_CHECKSUM xt_tcpudp iptable_mangle
> ipt_MASQUERADE
> > iptable_nat nf_nat_ipv4 nf_nat ip6table_filter ip6_tables iptable_filter
> > ip_tables ebtable_nat ebtables x_tables kvm_intel kvm nbd ib_iser rdma_cm
> > ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
> > scsi_transport_iscsi openvswitch vxlan ip_tunnel gre nfsd nfs_acl
> > auth_rpcgss nfs fscache lockd sunrpc dm_multipath gpio_ich dcdbas scsi_dh
> > mei_me shpchp sb_edac mei edac_core lpc_ich joydev acpi_power_meter
> > nf_conntrack_ipv6 mac_hid nf_defrag_ipv6 wmi nf_conntrack_ipv4 ipmi_si
> xfs
> > nf_conntrack nf_defrag_ipv4 lp parport igb btrfs hid_generic dca
> > i2c_algo_bit usbhid raid6_pq ptp ahci bnx2x hid libahci mdio megaraid_sas
> > pps_core xor libcrc32c
> > [68247.030510] CPU: 33 PID: 0 Comm: swapper/33 Tainted: G      D
> > 3.13.0-106-generic #153~precise1-Ubuntu
> > [68247.147123] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.2.3
> > 07/09/2014
> > [68247.236723]  0000000000000000 ffff88203fb03530 ffffffff81765c15
> > 0000000000000000
> > [68247.326029]  000000000000007c ffff88203fb03570 ffffffff8106e2fc
> > 239c8806e0dc2058
> > [68247.415346]  0000000000000000 0000000000000021 ffff88103fa13180
> > ffff88203fb13180
> > [68247.504655] Call Trace:
> > [68247.533961]  <IRQ>  [<ffffffff81765c15>] dump_stack+0x64/0x82
> > [68247.603052]  [<ffffffff8106e2fc>] warn_slowpath_common+0x8c/0xc0
> > [68247.674969]  [<ffffffff8106e34a>] warn_slowpath_null+0x1a/0x20
> > [68247.744808]  [<ffffffff8104458e>] native_smp_send_reschedule+
> 0x5e/0x60
> > [68247.822965]  [<ffffffff810b0c3e>] trigger_load_balance+0x17e/0x1f0
> > [68247.896964]  [<ffffffff810a1e9f>] scheduler_tick+0xaf/0xf0
> > [68247.962645]  [<ffffffff8107d871>] update_process_times+0x61/0x80
> > [68248.034566]  [<ffffffff810e0293>] tick_sched_handle.isra.12+0x33/0x70
> > [68248.111675]  [<ffffffff810e03bc>] tick_sched_timer+0x4c/0x80
> > [68248.179432]  [<ffffffff810967a7>] __run_hrtimer+0x77/0x270
> > [68248.245113]  [<ffffffff810d87a2>] ? ktime_get_update_offsets+0x52/
> 0xf0
> > [68248.323263]  [<ffffffff810e0370>] ? tick_nohz_handler+0xa0/0xa0
> > [68248.394139]  [<ffffffff81097147>] hrtimer_interrupt+0x107/0x260
> > [68248.465015]  [<ffffffff81446875>] ? erst_write+0x135/0x150
> > [68248.530692]  [<ffffffff81446b40>] ? erst_writer+0x2b0/0x380
> > [68248.597413]  [<ffffffff8104752b>] local_apic_timer_interrupt+
> 0x3b/0x60
> > [68248.675571]  [<ffffffff8177d933>] smp_apic_timer_interrupt+0x43/0x60
> > [68248.751650]  [<ffffffff8177c29d>] apic_timer_interrupt+0x6d/0x80
> > [68248.823564]  [<ffffffff817584b0>] ? panic+0x19e/0x1e1
> > [68248.884046]  [<ffffffff81758412>] ? panic+0x100/0x1e1
> > [68248.944529]  [<ffffffff81773a5a>] oops_end+0x14a/0x160
> > [68249.006059]  [<ffffffff810196d8>] die+0x58/0x90
> > [68249.060303]  [<ffffffff8177315b>] do_trap+0xcb/0x170
> > [68249.119748]  [<ffffffff810166ec>] do_invalid_op+0xac/0x110
> > [68249.185436]  [<ffffffffa052b4fe>] ? __skb_pull.part.7+0x4/0x6
> > [openvswitch]
> > [68249.268785]  [<ffffffff8165dc62>] ? __dev_queue_xmit+0x92/0x500
> > [68249.339663]  [<ffffffff8177cd5e>] invalid_op+0x1e/0x30
> > [68249.401185]  [<ffffffffa052b4fe>] ? __skb_pull.part.7+0x4/0x6
> > [openvswitch]
> > [68249.484530]  [<ffffffffa0524e64>] ovs_flow_extract+0x664/0x720
> > [openvswitch]
> > [68249.568918]  [<ffffffff8112d1e1>] ? tracing_record_cmdline+0x21/0x50
> > [68249.644994]  [<ffffffffa0523a80>]
> > ovs_dp_process_received_packet+0x60/0x130 [openvswitch]
> > [68249.742909]  [<ffffffff810a090b>] ? ttwu_do_wakeup+0xfb/0x110
> > [68249.811710]  [<ffffffffa0529e3a>] ovs_vport_receive+0x2a/0x30
> > [openvswitch]
> > [68249.895055]  [<ffffffffa052b374>] gre_rcv+0xa4/0xb8 [openvswitch]
> > [68249.968009]  [<ffffffffa03e2795>] gre_cisco_rcv+0x75/0xbc [gre]
> > [68250.038879]  [<ffffffffa03e22f5>] gre_rcv+0x65/0x90 [gre]
> > [68250.103522]  [<ffffffff816941d8>] ip_local_deliver_finish+0xa8/0x220
> > [68250.179595]  [<ffffffff816944db>] ip_local_deliver+0x4b/0x90
> > [68250.247354]  [<ffffffff81693ed1>] ip_rcv_finish+0x121/0x380
> > [68250.314069]  [<ffffffff816947a6>] ip_rcv+0x286/0x380
> > [68250.373512]  [<ffffffff8165b80a>] __netif_receive_skb_core+
> 0x61a/0x760
> > [68250.451662]  [<ffffffff8165b971>] __netif_receive_skb+0x21/0x70
> > [68250.522533]  [<ffffffff8165c131>] process_backlog+0xb1/0x190
> > [68250.590293]  [<ffffffff8165ca09>] net_rx_action+0x139/0x280
> > [68250.657019]  [<ffffffff8107367d>] __do_softirq+0xed/0x360
> > [68250.721659]  [<ffffffff81073c8e>] irq_exit+0x11e/0x140
> > [68250.783185]  [<ffffffff8177d873>] do_IRQ+0x63/0xe0
> > [68250.840557]  [<ffffffff817728ad>] common_interrupt+0x6d/0x6d
> > [68250.908314]  <EOI>  [<ffffffff8109e353>] ?
> finish_task_switch+0x53/0x160
> > [68250.988818]  [<ffffffff8176e47e>] __schedule+0x38e/0x720
> > [68251.052421]  [<ffffffff8176e8c9>] schedule+0x29/0x70
> > [68251.111863]  [<ffffffff8176ebee>] schedule_preempt_disabled+0xe/0x10
> > [68251.187933]  [<ffffffff810c7f95>] cpu_idle_loop+0x255/0x2a0
> > [68251.254651]  [<ffffffff810ddba2>] ?
> > clockevents_register_device+0xe2/0x140
> > [68251.336958]  [<ffffffff810c804b>] cpu_startup_entry+0x6b/0x70
> > [68251.405751]  [<ffffffff81045bbd>] start_secondary+0xcd/0xd0
> > [68251.472464] ---[ end trace 383bac9f3e676971 ]---
> > _______________________________________________
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
> Is this using openvswitch-dkms? ("modinfo openvswitch") - my guess is
> no, but it may have different behaviour in this regard.
>
> I see ovs_vport_receive() from GRE tunnel executing through to
> ovs_flow_extract(), which makes a call to __skb_pull(), which likely
> complains on this line:
>
> BUG_ON(skb->len < skb->data_len);
>
> Probably when ovs_flow_extract() attempts to pull the ethernet
> addresses out of the packet, the skb_headlen is too short for the
> ethernet header. There are multiple explicit comments in and around
> this function that states that the callers are responsible for
> ensuring it holds at least the ethernet header, so something further
> up the stack is doing something wrong.
>
> For what it's worth, 3.13 is still pretty old in terms of Linux
> tunnelling, if you are able to try this with something a bit more
> recent (eg, the current Ubuntu LTS 4.4 kernel) that may provide more
> insight.
>
> It may also be worth reporting this issue on launchpad against your
> 3.13 kernel version.
>



-- 
Uri Foox | Zoey | Founder
http://www.zoey.com
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to