Hi All,

I've been investigating a segmentation fault caused by the incorrect setup of 
TX queues for netdev-dpdk. It occurs in the following scenario.

Running OVS with DPDK on a system with 72 cores (Hyper threading enabled) and 
using an Intel XL710 Network Card.

Default behavior in OVS when adding a DPDK physical port is to attempt to setup 
1 tx queue for each core detected on the system plus one more queue for non_pmd 
threads.

In this case 73 tx queues will be requested in total.

The standard behavior when initializing a DPDK port is to check the number of 
queues being requested against the max number of queues available for the 
device itself.

This is done in dpdk_eth_dev_init() with the following code segment
...
    rte_eth_dev_info_get(dev->port_id, &info);
    dev->up.n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
    dev->real_n_txq = MIN(info.max_tx_queues, dev->up.n_txq);

    diag = rte_eth_dev_configure(dev->port_id, dev->up.n_rxq, dev->real_n_txq, 
&port_conf);
...

The smaller of the two values is selected as the real amount of tx queues that 
can be setup. This accommodates a situation where we could have more cores on a 
system than we have tx queues on the network device in DPDK.

This has worked fine with the previous generation of Intel interfaces such as 
Intel 82599. However it will not work with the XL710.

In DPDK the XL710 has a total of 316 tx queues that can be used. From the check 
above we would think we can allocate 73 of these tx queues without issue. But 
the 316 queues available are subdivided between different queue types.

For a DPDK host application (In this case OVS) queues 1 - 64 inclusive can be 
used. However queue 65 to 96 are strictly for SRIOV tx queue use.

The check for max_tx_queues above will identify the total number of queues 
available (316), compare it to the number of queues being requested (73) and 
will select 73 as the real_n_txq. But this is not the correct number of tx 
queues that are usable by OVS (64).

We can cause the switch to segfault by doing the following

Add a dpdk physical port

sudo $OVS_DIR/utilities/ovs-vsctl add-br br0 -- set Bridge br0 
datapath_type=netdev
sudo $OVS_DIR/utilities/ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 
type=dpdk

This will output the following warning
ovs-vsctl: Error detected while setting up 'dpdk0'.  See ovs-vswitchd log for 
details.

Looking at the log we see

PMD: i40e_dev_tx_queue_setup(): Using simple tx path
PMD: i40e_pf_get_vsi_by_qindex(): queue_idx out of range. VMDQ configured?
2015-07-15T01:22:48Z|00019|dpdk|ERR|eth dev tx queue setup error -5
2015-07-15T01:22:48Z|00020|dpif_netdev|ERR|dpdk0, cannot set multiq
2015-07-15T01:22:48Z|00021|dpif|WARN|netdev@ovs-netdev: failed to add dpdk0 as 
port: Resource temporarily unavailable

This is as expected. This warning will be reported in dpdk_eth_dev_init() by 
the following code segment when it attempts to initialize the 65th queue

    for (i = 0; i < dev->real_n_txq; i++) {
        diag = rte_eth_tx_queue_setup(dev->port_id, i, NIC_PORT_TX_Q_SIZE,
                                      dev->socket_id, NULL);
        if (diag) {
            VLOG_ERR("eth dev tx queue setup error %d",diag);
            return -diag;
        }
    }

Then add an internal port type to the same bridge

sudo $OVS_DIR/utilities/ovs-vsctl add-port br0 testif1 -- set interface testif1 
type=internal

I was surprised to see that after adding the internal port , the DPDK port that 
failed previously is now added as well. Is this expected behavior?
Looking at the vswitch log I can see that both the internal port and the DPDK 
port have port IDs now.

2015-07-15T01:23:19Z|00024|bridge|INFO|bridge br0: added interface testif1 on 
port 1
2015-07-15T01:23:19Z|00025|dpif_netdev|INFO|Created 1 pmd threads on numa node 0
2015-07-15T01:23:19Z|00001|dpif_netdev(pmd40)|INFO|Core 0 processing port 
'dpdk0'
2015-07-15T01:23:19Z|00002|dpif_netdev(pmd40)|INFO|Core 0 processing port 
'dpdk0'
2015-07-15T01:23:19Z|00026|bridge|INFO|bridge br0: added interface dpdk0 on 
port 2
2015-07-15T01:23:19Z|00027|bridge|INFO|bridge br0: using datapath ID 
00006805ca2d3cb8

If we assign an IP to the internal port we will segfault the vswitch
sudo ip addr add 192.168.1.1/24 dev testif1

This is caused by the internal interface broadcasting an ICMP6 neighbor 
solicitation message. This packet is copied from kernel space memory to DPDK 
memory in the netdev_dpdk_send__() function.
The issue is that the qid passed to netdev_dpdk_send__ function is 72. This 
packet will eventually be transmitted with with rte_eth_tx_burst with a tx qid 
of 72.
In DPDK, queue 72 for the XL710 is for SRIOV use only and so will not be 
initialized during the rte_eth_tx_queue_setup process above and so the switch 
segfaults when an attempt is made to access it.


In terms of a solution to this I would appreciate some feedback on what people 
think is the best approach.

Ideally DPDK could extend the number of sequential queues supported for host 
DPDK applications.
Previous generation cards supported 128 TX queues that could be used with a 
host application, hence why this issue is not seen with them.
This however would not fix the immediate issue and would be more of a long term 
solution. It could be flagged in the documentation as a known issue/corner case 
that is not supported in the mean time.

Alternatively OVS could attempt to setup as many queues as possible on the DPDK 
device itself. If an error is detected the appropriate fields would have to be 
updated such as dev->real_n_txq.
In this case we would setup 64 of the requested 73, log a warning message to 
the user. However there may be issues with how the pmd threads map to the 
correct tx queue IDs.
I've noticed that when netdev_dpdk_send__ is called the qid is 72, and this 
value comes from dp_execute_cb() where the tx_qid is taken from the 
dp_netdev_pmd_thread.

Any Feedback would be appreciated.

Thanks
Ian




_______________________________________________
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss

Reply via email to