Hello mlx5e/bridge/bonding maintainers,

Technical Context
------------------------

In our Kubernetes production environment, we have enabled SR-IOV on
ConnectX-6 DX NICs to allow multiple containers to share a single
Physical Function (PF). In other words, each container is assigned its
own Virtual Function (VF).

We have two CX6-DX NICs on a single host, configured in a bonded
interface. The VFs and the bonding device are connected via a Linux
bridge. Both CX6-DX NICs are operating in switchdev mode, and the
Linux bridge is offloaded. The topology is as follows:

         Container0
                |
         VF reppresentor
                |
       linux bridge (offloaded)
                |
              bond
                |
    +----------------------+
    |                            |
eth0 (PF)               eth1 (PF)


Both eth0 and eth1 are on switchdev mode.

  $ cat /sys/class/net/eth0/compat/devlink/mode
   switchdev
  $ cat /sys/class/net/eth1/compat/devlink/mode
   swithcdev

This setup follows the guidance provided in the official NVIDIA
documentation [0].

The bond is configured in 802.3ad (LACP) mode. We enabled the ARP
broadcast feature on the bonding device, which is functionally similar
to the patchset referenced here:

   https://lore.kernel.org/all/[email protected]/

Issue Description
------------------------

When pinging another host from Container0, we observe that only one
ARP request is sent out, even though both eth0 and eth1 appear to be
generating ARP requests at the PF level, and the doorbell is being
triggered.

tcpdump output:

  $ tcpdump -i any -n arp host 10.247.209.128
  17:50:16.309758 eth1_5 B   ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
  17:50:16.309777 eth0_5 Out ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
  17:50:16.309784 bond0 Out ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
  17:50:16.309786 eth0  Out ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
  17:50:16.309788 eth1  Out ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
  17:50:16.309758 bridge0 B   ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38

No ARP reply is captured by tcpdump.

However, ARP broadcast appears to be functional, as two ARP requests
are triggering the doorbell, as traced by the following bpftrace
script:

kprobe:mlx5e_sq_xmit_wqe
{
        $skb = (struct sk_buff *)arg1;

        if ($skb->protocol != 0x608) { // arp
                return;
        }

        $iph = (struct iphdr *)($skb->head + $skb->network_header);
        $arp_data = $skb->head + $skb->network_header + sizeof(struct arphdr);
        $arph = (struct arphdr *)($skb->head + $skb->network_header);
        $smac = $arp_data;

        $sip = $arp_data + 6;
        $tip = $arp_data + 16;
        // 10.247.209.128
        if (!($tip[0] == 10 && $tip[1] == 247 && $tip[2] == 209 &&
$tip[3] == 128)) {
                return;
        }

        $dev = $skb->dev;
        $dev_name = $skb->dev->name;

        printf("Device:%s(%02x:%02x:%02x:%02x:%02x:%02x) [%d] Sender
IP:%d.%d.%d.%d (%02x:%02x:%02x:%02x:%02x:%02x), ",
                $dev_name, $dev->dev_addr[0], $dev->dev_addr[1],
$dev->dev_addr[2], $dev->dev_addr[3],
                $dev->dev_addr[4], $dev->dev_addr[5], $dev->ifindex,
                $sip[0], $sip[1], $sip[2], $sip[3],
                $smac[0], $smac[1], $smac[2], $smac[3], $smac[4], $smac[5]);

        printf("Target IP:%d.%d.%d.%d, OP:%d\n",
                $tip[0], $tip[1], $tip[2], $tip[3],
                 (($arph->ar_op & 0xFF00) >> 8) | (($arph->ar_op &
0x00FF) << 8));
}

bpftrace output:

  Device:eth0_5(ba:d2:2d:ff:80:82) [68] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1
  Device:eth0(e0:9d:73:c3:d2:3e) [2] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1
  Device:eth1(e0:9d:73:c3:d2:3e) [3] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1

And the detailed callstack with bpftrace:

Device:eth0_5(ba:d2:2d:ff:80:82) [68] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1

        mlx5e_sq_xmit_wqe+1
        dev_hard_start_xmit+142
        sch_direct_xmit+161
        __dev_xmit_skb+482
        __dev_queue_xmit+637
        br_dev_queue_push_xmit+194
        br_forward_finish+83
        br_nf_hook_thresh+220
        br_nf_forward_finish+381
        br_nf_forward_arp+647
        nf_hook_slow+65
        __br_forward+214
        maybe_deliver+188
        br_flood+118
        br_handle_frame_finish+421
        br_handle_frame+781
        __netif_receive_skb_core.constprop.0+651
        __netif_receive_skb_list_core+291
        netif_receive_skb_list_internal+459
        napi_complete_done+122
        mlx5e_napi_poll+358
        __napi_poll.constprop.0+46
        net_rx_action+680
        __do_softirq+271
        irq_exit_rcu+82
        common_interrupt+142
        asm_common_interrupt+39
        cpuidle_enter_state+237
        cpuidle_enter+52
        cpuidle_idle_call+261
        do_idle+124
        cpu_startup_entry+32
        start_secondary+296
        secondary_startup_64_no_verify+229

Device:eth0(e0:9d:73:c3:d2:3e) [2] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1

        mlx5e_sq_xmit_wqe+1
        dev_hard_start_xmit+142
        sch_direct_xmit+161
        __dev_xmit_skb+482
        __dev_queue_xmit+637
        bond_dev_queue_xmit+43
        __bond_start_xmit+590
        bond_start_xmit+70
        dev_hard_start_xmit+142
        __dev_queue_xmit+1260
        br_dev_queue_push_xmit+194
        br_forward_finish+83
        br_nf_hook_thresh+220
        br_nf_forward_finish+381
        br_nf_forward_arp+647
        nf_hook_slow+65
        __br_forward+214
        br_flood+266
        br_handle_frame_finish+421
        br_handle_frame+781
        __netif_receive_skb_core.constprop.0+651
        __netif_receive_skb_list_core+291
        netif_receive_skb_list_internal+459
        napi_complete_done+122
        mlx5e_napi_poll+358
        __napi_poll.constprop.0+46
        net_rx_action+680
        __do_softirq+271
        irq_exit_rcu+82
        common_interrupt+142
        asm_common_interrupt+39
        cpuidle_enter_state+237
        cpuidle_enter+52
        cpuidle_idle_call+261
        do_idle+124
        cpu_startup_entry+32
        start_secondary+296
        secondary_startup_64_no_verify+229

Device:eth1(e0:9d:73:c3:d2:3e) [3] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1

        mlx5e_sq_xmit_wqe+1
        dev_hard_start_xmit+142
        sch_direct_xmit+161
        __dev_xmit_skb+482
        __dev_queue_xmit+637
        bond_dev_queue_xmit+43
        __bond_start_xmit+590
        bond_start_xmit+70
        dev_hard_start_xmit+142
        __dev_queue_xmit+1260
        br_dev_queue_push_xmit+194
        br_forward_finish+83
        br_nf_hook_thresh+220
        br_nf_forward_finish+381
        br_nf_forward_arp+647
        nf_hook_slow+65
        __br_forward+214
        br_flood+266
        br_handle_frame_finish+421
        br_handle_frame+781
        __netif_receive_skb_core.constprop.0+651
        __netif_receive_skb_list_core+291
        netif_receive_skb_list_internal+459
        napi_complete_done+122
        mlx5e_napi_poll+358
        __napi_poll.constprop.0+46
        net_rx_action+680
        __do_softirq+271
        irq_exit_rcu+82
        common_interrupt+142
        asm_common_interrupt+39
        cpuidle_enter_state+237
        cpuidle_enter+52
        cpuidle_idle_call+261
        do_idle+124
        cpu_startup_entry+32
        start_secondary+296
        secondary_startup_64_no_verify+229

Additionally, traffic captured on the uplink switch confirms that ARP
requests are only being sent via one NIC.

This suggests a potential issue with bridge offloading. However, we
lack the capability to trace hardware-level behavior directly.
Notably, `ethtool -S ethX` shows no packet drops or errors.

Questions
--------------

1. How can we further trace hardware behavior to diagnose this issue?
2. Is this a known limitation of bridge offloading in this configuration?
3. Are there any recommended solutions or workarounds?

This issue is reproducible. We are willing to recompile the mlx
drivers if additional information is needed.

Current driver version:

  $ ethtool -i eth0
  driver: mlx5_core
  version: 24.10-1.1.4
  firmware-version: 22.43.2026 (MT_0000000359)
  expansion-rom-version:
  bus-info: 0000:21:00.0
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: no
  supports-register-dump: no
  supports-priv-flags: yes

[0].https://docs.nvidia.com/networking/display/public/sol/technology+preview+of+kubernetes+cluster+deployment+with+accelerated+bridge+cni+and+nvidia+ethernet+networking#src-119742530_TechnologyPreviewofKubernetesClusterDeploymentwithAcceleratedBridgeCNIandNVIDIAEthernetNetworking-Application

-- 
Regards
Yafang

Reply via email to