On Tue, 2016-06-28 at 10:16 +0000, Luca Boccassi wrote: > On Tue, 2016-05-24 at 14:06 +0800, Wenzhuo Lu wrote: > > This patch set addes the support of the mailbox interruption on VF. > > So, VF can receice the messges for physical link down/up. > > > > PS: This patch set is splitted from a previous patch set, *automatic > > link recovery on ixgbe/igb VF*. > > > > Wenzhuo Lu (2): > > ixgbe: VF supports mailbox interruption for PF link up/down > > igb: VF supports mailbox interruption for PF link up/down > > > > doc/guides/rel_notes/release_16_07.rst | 6 ++ > > drivers/net/e1000/igb_ethdev.c | 159 > > +++++++++++++++++++++++++++++++++ > > drivers/net/ixgbe/ixgbe_ethdev.c | 85 +++++++++++++++++- > > 3 files changed, 247 insertions(+), 3 deletions(-) > > > > Hi, > > After backporting these patches to 16.04 or 2.2, we get a segmentation > fault when using interface bonding when the interfaces go down. The > scenario is: > > - Host has a X540-AT2 10gb card using the ixgbe driver, 2 VFs are > created and passes to the qemu/kvm guest VM via libvirt > - Guest creates a bonded link using the 2 VFs > - Host sets the VFs state to down via ip link > - Guess DPDK app segfaults > > Backtrace: > > #0 0x0000000000000000 in ?? () > No symbol table info available. > #1 0x00007ffff5003957 in bond_ethdev_slave_link_status_change_monitor ( > cb_arg=0x727748 <rte_eth_devices@@DPDK_2.2+4168>) > at /usr/src/packages/BUILD/drivers/net/bonding/rte_eth_bond_pmd.c:1938 > internals = 0x7fffeb8f5ec0 > i = 0 > polling_slave_found = 0 > #2 0x00007ffff68ea88c in eal_alarm_callback (hdl=<optimized out>, > arg=<optimized out>) > at /usr/src/packages/BUILD/lib/librte_eal/linuxapp/eal/eal_alarm.c:120 > now = {tv_sec = 356, tv_nsec = 551082574} > ap = 0x7fffebc22380 > #3 0x00007ffff68e926d in eal_intr_process_interrupts (nfds=<optimized out>, > events=<optimized out>) > at > /usr/src/packages/BUILD/lib/librte_eal/linuxapp/eal/eal_interrupts.c:752 > bytes_read = <optimized out> > buf = {uio_intr_count = 1, vfio_intr_count = 1, timerfd_num = 1, > charbuf = "\001\000\000\000\000\000\000\000D\260~\363\377\177\000"} > n = 0 > src = 0x7fffeb8d2640 > cb = 0x7fffeb8d2d80 > next = <optimized out> > active_cb = <optimized out> > #4 eal_intr_handle_interrupts (totalfds=<optimized out>, pfd=12) > at > /usr/src/packages/BUILD/lib/librte_eal/linuxapp/eal/eal_interrupts.c:800 > events = 0x7fffefb1ba20 > nfds = 1 > #5 eal_intr_thread_main (arg=<optimized out>) > at > /usr/src/packages/BUILD/lib/librte_eal/linuxapp/eal/eal_interrupts.c:870 > pipe_event = {events = 3, data = {ptr = 0x6, fd = 6, u32 = 6, u64 = > 6}} > src = <optimized out> > numfds = <optimized out> > pfd = 12 > ev = {events = 3, data = {ptr = 0xf7df02e500000005, fd = 5, u32 = 5, > u64 = 17860997829745442821}} > __func__ = "eal_intr_thread_main" > #6 0x00007ffff37eb0a4 in start_thread (arg=0x7fffefb3c700) at > pthread_create.c:309 > __res = <optimized out> > pd = 0x7fffefb3c700 > now = <optimized out> > unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737214924544, > 2510814068564645188, 1, > 140737354125408, 140737336548072, 140737214924544, > -2510779380161489596, > -2510806361332034236}, mask_was_saved = 0}}, priv = {pad = > {0x0, 0x0, 0x0, 0x0}, > data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} > not_first_call = <optimized out> > pagesize_m1 = <optimized out> > sp = <optimized out> > freesize = <optimized out> > __PRETTY_FUNCTION__ = "start_thread" > #7 0x00007ffff1b8287d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > > It dies in this bit: > > /* Update slave link status */ > (*slave_ethdev->dev_ops->link_update)(slave_ethdev, > internals->slaves[i].link_status_wait_to_complete); > > (gdb) print rte_eth_devices[internals->slaves[i].port_id] > $7 = {rx_pkt_burst = 0x0, tx_pkt_burst = 0x0, data = 0x0, driver = 0x0, > dev_ops = 0x0, {pci_dev = 0x0, > vmbus_dev = 0x0}, link_intr_cbs = {tqh_first = 0x0, tqh_last = 0x0}, > post_rx_burst_cbs = { > 0x0 <repeats 256 times>}, pre_tx_burst_cbs = {0x0 <repeats 256 times>}, > attached = 0 '\000', > dev_type = RTE_ETH_DEV_UNKNOWN} > > I'm assuming it's not a simply matter of checking the dev_type or for > nulls. Do you have any suggestions/insight? I'm delving into the issue, > but it's the first time I look at the bonding code so any help or > pointers would be greatly appreciated. > > Note that I also tried to backport the additional patches for reset that > are currently under review on top of these, but there's no difference. > But I have not yet used the new reset API in our app though. > > Thanks!
I noticed that we were used 2 of the patches that were self-nacked, and they were causing the crash. I'll switch to the new version that is under review instead. -- Kind regards, Luca Boccassi