Hi All,
I've been investigating a segmentation fault caused by the incorrect setup of
TX queues for netdev-dpdk. It occurs in the following scenario.
Running OVS with DPDK on a system with 72 cores (Hyper threading enabled) and
using an Intel XL710 Network Card.
Default behavior in OVS when adding a DPDK physical port is to attempt to setup
1 tx queue for each core detected on the system plus one more queue for non_pmd
threads.
In this case 73 tx queues will be requested in total.
The standard behavior when initializing a DPDK port is to check the number of
queues being requested against the max number of queues available for the
device itself.
This is done in dpdk_eth_dev_init() with the following code segment
...
rte_eth_dev_info_get(dev->port_id, &info);
dev->up.n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
dev->real_n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
diag = rte_eth_dev_configure(dev->port_id, dev->up.n_rxq, dev->real_n_txq,
&port_conf);
...
The smaller of the two values is selected as the real amount of tx queues that
can be setup. This accommodates a situation where we could have more cores on a
system than we have tx queues on the network device in DPDK.
This has worked fine with the previous generation of Intel interfaces such as
Intel 82599. However it will not work with the XL710.
In DPDK the XL710 has a total of 316 tx queues that can be used. From the check
above we would think we can allocate 73 of these tx queues without issue. But
the 316 queues available are subdivided between different queue types.
For a DPDK host application (In this case OVS) queues 1 - 64 inclusive can be
used. However queue 65 to 96 are strictly for SRIOV tx queue use.
The check for max_tx_queues above will identify the total number of queues
available (316), compare it to the number of queues being requested (73) and
will select 73 as the real_n_txq. But this is not the correct number of tx
queues that are usable by OVS (64).
We can cause the switch to segfault by doing the following
Add a dpdk physical port
sudo $OVS_DIR/utilities/ovs-vsctl add-br br0 -- set Bridge br0
datapath_type=netdev
sudo $OVS_DIR/utilities/ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0
type=dpdk
This will output the following warning
ovs-vsctl: Error detected while setting up 'dpdk0'. See ovs-vswitchd log for
details.
Looking at the log we see
PMD: i40e_dev_tx_queue_setup(): Using simple tx path
PMD: i40e_pf_get_vsi_by_qindex(): queue_idx out of range. VMDQ configured?
2015-07-15T01:22:48Z|00019|dpdk|ERR|eth dev tx queue setup error -5
2015-07-15T01:22:48Z|00020|dpif_netdev|ERR|dpdk0, cannot set multiq
2015-07-15T01:22:48Z|00021|dpif|WARN|netdev@ovs-netdev: failed to add dpdk0 as
port: Resource temporarily unavailable
This is as expected. This warning will be reported in dpdk_eth_dev_init() by
the following code segment when it attempts to initialize the 65th queue
for (i = 0; i < dev->real_n_txq; i++) {
diag = rte_eth_tx_queue_setup(dev->port_id, i, NIC_PORT_TX_Q_SIZE,
dev->socket_id, NULL);
if (diag) {
VLOG_ERR("eth dev tx queue setup error %d",diag);
return -diag;
}
}
Then add an internal port type to the same bridge
sudo $OVS_DIR/utilities/ovs-vsctl add-port br0 testif1 -- set interface testif1
type=internal
I was surprised to see that after adding the internal port , the DPDK port that
failed previously is now added as well. Is this expected behavior?
Looking at the vswitch log I can see that both the internal port and the DPDK
port have port IDs now.
2015-07-15T01:23:19Z|00024|bridge|INFO|bridge br0: added interface testif1 on
port 1
2015-07-15T01:23:19Z|00025|dpif_netdev|INFO|Created 1 pmd threads on numa node 0
2015-07-15T01:23:19Z|00001|dpif_netdev(pmd40)|INFO|Core 0 processing port
'dpdk0'
2015-07-15T01:23:19Z|00002|dpif_netdev(pmd40)|INFO|Core 0 processing port
'dpdk0'
2015-07-15T01:23:19Z|00026|bridge|INFO|bridge br0: added interface dpdk0 on
port 2
2015-07-15T01:23:19Z|00027|bridge|INFO|bridge br0: using datapath ID
00006805ca2d3cb8
If we assign an IP to the internal port we will segfault the vswitch
sudo ip addr add 192.168.1.1/24 dev testif1
This is caused by the internal interface broadcasting an ICMP6 neighbor
solicitation message. This packet is copied from kernel space memory to DPDK
memory in the netdev_dpdk_send__() function.
The issue is that the qid passed to netdev_dpdk_send__ function is 72. This
packet will eventually be transmitted with with rte_eth_tx_burst with a tx qid
of 72.
In DPDK, queue 72 for the XL710 is for SRIOV use only and so will not be
initialized during the rte_eth_tx_queue_setup process above and so the switch
segfaults when an attempt is made to access it.
In terms of a solution to this I would appreciate some feedback on what people
think is the best approach.
Ideally DPDK could extend the number of sequential queues supported for host
DPDK applications.
Previous generation cards supported 128 TX queues that could be used with a
host application, hence why this issue is not seen with them.
This however would not fix the immediate issue and would be more of a long term
solution. It could be flagged in the documentation as a known issue/corner case
that is not supported in the mean time.
Alternatively OVS could attempt to setup as many queues as possible on the DPDK
device itself. If an error is detected the appropriate fields would have to be
updated such as dev->real_n_txq.
In this case we would setup 64 of the requested 73, log a warning message to
the user. However there may be issues with how the pmd threads map to the
correct tx queue IDs.
I've noticed that when netdev_dpdk_send__ is called the qid is 72, and this
value comes from dp_execute_cb() where the tx_qid is taken from the
dp_netdev_pmd_thread.
Any Feedback would be appreciated.
Thanks
Ian
_______________________________________________
discuss mailing list
[email protected]
http://openvswitch.org/mailman/listinfo/discuss