Re: Re: dumpcap coredump for 82599 NIC
Yes, I think you are right. After adding some debug information, I can confirm that it's probably an initialization issue with the ixgbe driver. Secondary processes should initialize some callback functions, but they seem to be missing. I made some minor modifications by moving the ixgbe_init_shared_code(hw) position before the secondary processes. While this brought about some changes, there still occurred a core dump. I suspect there might be other issues or that such modification might not be appropriate. [root@xc03-compute3 /]# /dpdk/app/dpdk-dumpcap -i :18:00.0 mlx5_net: Cannot attach mlx5 shared data mlx5_net: Unable to init PMD global data: No such file or directory mlx5_common: Failed to load driver mlx5_eth EAL: Requested device :3b:00.0 cannot be used mlx5_net: Cannot attach mlx5 shared data mlx5_net: Unable to init PMD global data: No such file or directory mlx5_common: Failed to load driver mlx5_eth EAL: Requested device :3b:00.1 cannot be used File: /tmp/dpdk-dumpcap_0_:18:00.0_20240314091910.pcapng Capturing on ':18:00.0' Packets captured: 2 Primary process is no longer active, exiting... EAL: Fail to recv reply for request /var/run/dpdk/rte/mp_socket:mp_pdump pdump_prepare_client_request(): client request for pdump enable/disable failed Floating point exception (core dumped) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index d6cf00317e77b64f9822c155115f388ae62241eb..0bf885d7eaba3689fb9b98cdcaa6a928aa787985 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -1104,6 +1104,24 @@ eth_ixgbe_dev_init(struct rte_eth_dev *eth_dev, void *init_params __rte_unused) eth_dev->tx_pkt_burst = &ixgbe_xmit_pkts; eth_dev->tx_pkt_prepare = &ixgbe_prep_pkts; + /* Vendor and Device ID need to be set before init of shared code */ + hw->device_id = pci_dev->id.device_id; + hw->vendor_id = pci_dev->id.vendor_id; + hw->hw_addr = (void *)pci_dev->mem_resource[0].addr; + hw->allow_unsupported_sfp = 1; + + /* Initialize the shared code (base driver) */ +#ifdef RTE_LIBRTE_IXGBE_BYPASS + diag = ixgbe_bypass_init_shared_code(hw); +#else + diag = ixgbe_init_shared_code(hw); +#endif /* RTE_LIBRTE_IXGBE_BYPASS */ + + if (diag != IXGBE_SUCCESS) { + PMD_INIT_LOG(ERR, "Shared code init failed: %d", diag); + return -EIO; + } + /* * For secondary processes, we don't initialise any further as primary * has already done this work. Only check we don't need a different @@ -1135,24 +1153,6 @@ eth_ixgbe_dev_init(struct rte_eth_dev *eth_dev, void *init_params __rte_unused) rte_eth_copy_pci_info(eth_dev, pci_dev); eth_dev->data->dev_flags |= RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS; - /* Vendor and Device ID need to be set before init of shared code */ - hw->device_id = pci_dev->id.device_id; - hw->vendor_id = pci_dev->id.vendor_id; - hw->hw_addr = (void *)pci_dev->mem_resource[0].addr; - hw->allow_unsupported_sfp = 1; - - /* Initialize the shared code (base driver) */ -#ifdef RTE_LIBRTE_IXGBE_BYPASS - diag = ixgbe_bypass_init_shared_code(hw); -#else - diag = ixgbe_init_shared_code(hw); -#endif /* RTE_LIBRTE_IXGBE_BYPASS */ - - if (diag != IXGBE_SUCCESS) { - PMD_INIT_LOG(ERR, "Shared code init failed: %d", diag); - return -EIO; - } - if (hw->mac.ops.fw_recovery_mode && hw->mac.ops.fw_recovery_mode(hw)) { PMD_INIT_LOG(ERR, "\nERROR: " "Firmware recovery mode detected. Limiting functionality.\n" Additionally, I'm using a debug build, but the printed call stack still doesn't feel clear enough, which is quite strange. meson -Dc_args="-mno-avx512f" -Ddisable_drivers=net/ark,net/atlantic,net/avp,net/axgbe,net/pfe,net/netvsc -Dmax_numa_nodes=8 -Dmax_ethports=128 --buildtype=debug --optimization=0 build ninja -C build install junwan...@cestc.cn From: Stephen Hemminger Date: 2024-03-14 00:29 To: junwan...@cestc.cn CC: dev Subject: Re: dumpcap coredump for 82599 NIC On Wed, 13 Mar 2024 10:00:17 +0800 "junwan...@cestc.cn" wrote: > Hi, when I use dumpcap to capture packets on the 82559 network card, coredump > appears. > The network card bound to ovs-dpdk is 82599, but when I capture packets in > other non-82599 network cards (mellanox CX5/C6 or Intel's E810), it is > normal. , > the dpdk version I am using is 22.11.1, but I see that the call stack is > strange, so I am asking you for help. > > > > > > I thought the new version of dpdk might solve it, so I upgraded the dpdk > version to 23.11, but the problem is still t
Re: Re: dumpcap coredump for 82599 NIC
"Other link thread is running now!"); } - } else { - PMD_DRV_LOG(ERR, - "Other link thread is running now!"); } + return rte_eth_linkstatus_set(dev, &link); } - return rte_eth_linkstatus_set(dev, &link); } link.link_status = RTE_ETH_LINK_UP; junwan...@cestc.cn From: junwan...@cestc.cn Date: 2024-03-14 17:22 To: Stephen Hemminger CC: dev Subject: Re: Re: dumpcap coredump for 82599 NIC Yes, I think you are right. After adding some debug information, I can confirm that it's probably an initialization issue with the ixgbe driver. Secondary processes should initialize some callback functions, but they seem to be missing. I made some minor modifications by moving the ixgbe_init_shared_code(hw) position before the secondary processes. While this brought about some changes, there still occurred a core dump. I suspect there might be other issues or that such modification might not be appropriate. [root@xc03-compute3 /]# /dpdk/app/dpdk-dumpcap -i :18:00.0 mlx5_net: Cannot attach mlx5 shared data mlx5_net: Unable to init PMD global data: No such file or directory mlx5_common: Failed to load driver mlx5_eth EAL: Requested device :3b:00.0 cannot be used mlx5_net: Cannot attach mlx5 shared data mlx5_net: Unable to init PMD global data: No such file or directory mlx5_common: Failed to load driver mlx5_eth EAL: Requested device :3b:00.1 cannot be used File: /tmp/dpdk-dumpcap_0_:18:00.0_20240314091910.pcapng Capturing on ':18:00.0' Packets captured: 2 Primary process is no longer active, exiting... EAL: Fail to recv reply for request /var/run/dpdk/rte/mp_socket:mp_pdump pdump_prepare_client_request(): client request for pdump enable/disable failed Floating point exception (core dumped) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index d6cf00317e77b64f9822c155115f388ae62241eb..0bf885d7eaba3689fb9b98cdcaa6a928aa787985 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -1104,6 +1104,24 @@ eth_ixgbe_dev_init(struct rte_eth_dev *eth_dev, void *init_params __rte_unused) eth_dev->tx_pkt_burst = &ixgbe_xmit_pkts; eth_dev->tx_pkt_prepare = &ixgbe_prep_pkts; + /* Vendor and Device ID need to be set before init of shared code */ + hw->device_id = pci_dev->id.device_id; + hw->vendor_id = pci_dev->id.vendor_id; + hw->hw_addr = (void *)pci_dev->mem_resource[0].addr; + hw->allow_unsupported_sfp = 1; + + /* Initialize the shared code (base driver) */ +#ifdef RTE_LIBRTE_IXGBE_BYPASS + diag = ixgbe_bypass_init_shared_code(hw); +#else + diag = ixgbe_init_shared_code(hw); +#endif /* RTE_LIBRTE_IXGBE_BYPASS */ + + if (diag != IXGBE_SUCCESS) { + PMD_INIT_LOG(ERR, "Shared code init failed: %d", diag); + return -EIO; + } + /* * For secondary processes, we don't initialise any further as primary * has already done this work. Only check we don't need a different @@ -1135,24 +1153,6 @@ eth_ixgbe_dev_init(struct rte_eth_dev *eth_dev, void *init_params __rte_unused) rte_eth_copy_pci_info(eth_dev, pci_dev); eth_dev->data->dev_flags |= RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS; - /* Vendor and Device ID need to be set before init of shared code */ - hw->device_id = pci_dev->id.device_id; - hw->vendor_id = pci_dev->id.vendor_id; - hw->hw_addr = (void *)pci_dev->mem_resource[0].addr; - hw->allow_unsupported_sfp = 1; - - /* Initialize the shared code (base driver) */ -#ifdef RTE_LIBRTE_IXGBE_BYPASS - diag = ixgbe_bypass_init_shared_code(hw); -#else - diag = ixgbe_init_shared_code(hw); -#endif /* RTE_LIBRTE_IXGBE_BYPASS */ - - if (diag != IXGBE_SUCCESS) { - PMD_INIT_LOG(ERR, "Shared code init failed: %d", diag); - return -EIO; - } - if (hw->mac.ops.fw_recovery_mode && hw->mac.ops.fw_recovery_mode(hw)) { PMD_INIT_LOG(ERR, "\nERROR: " "Firmware recovery mode detected. Limiting functionality.\n" Additionally, I'm using a debug build, but the printed call stack still doesn't feel clear enough, which is quite strange. meson -Dc_args="-mno-avx512f" -Ddisable_drivers=net/ark,net/atlantic,net/avp,net/axgbe,net/pfe,net/netvsc -Dmax_numa_nodes=8 -Dmax_ethports=128 --buildtype=debug --optimization=0 build ninja -C build install junwan...@cestc.cn From: Stephen Hemminger Date: 2024-03-14 00:29 To: junwan...@cestc.cn CC: dev Subject: Re: dumpcap coredump for 82599 NIC On Wed, 13 Mar 2024
Re: [PATCH] net/ixgbe: do not update link status in secondary process
I tried this modification and it works as well. [root@compute3 /]# /dpdk/app/dpdk-dumpcap -i :18:00.0 File: /tmp/dpdk-dumpcap_0_:18:00.0_20240321043451.pcapng Capturing on ':18:00.0' Packets captured: 499 ^C Packets received/dropped on interface ':18:00.0': 499/0 (100.0) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index c61c52b2966b..86ccbdd78292 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -4293,6 +4293,9 @@ ixgbe_dev_link_update_share(struct rte_eth_dev *dev, int wait = 1; u32 esdp_reg; + if (rte_eal_process_type() != RTE_PROC_PRIMARY) + return -1; + memset(&link, 0, sizeof(link)); link.link_status = RTE_ETH_LINK_DOWN; link.link_speed = RTE_ETH_SPEED_NUM_NONE; junwan...@cestc.cn From: Stephen Hemminger Date: 2024-03-21 01:33 To: dev CC: junwang01; Stephen Hemminger Subject: [PATCH] net/ixgbe: do not update link status in secondary process The code to update link status is not safe in secondary process. If called from secondary it will crash, example from dumpcap: ixgbe_dev_link_update_share() ixgbe_dev_link_update() rte_eth_link_get() Signed-off-by: Stephen Hemminger Reported-by: Jun Wang --- Simpler version of earlier patch, and add explanation. drivers/net/ixgbe/ixgbe_ethdev.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index c61c52b2966b..86ccbdd78292 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -4293,6 +4293,9 @@ ixgbe_dev_link_update_share(struct rte_eth_dev *dev, int wait = 1; u32 esdp_reg; + if (rte_eal_process_type() != RTE_PROC_PRIMARY) + return -1; + memset(&link, 0, sizeof(link)); link.link_status = RTE_ETH_LINK_DOWN; link.link_speed = RTE_ETH_SPEED_NUM_NONE; -- 2.43.0