> So, "deamon" and "server" may try using the same queue sometimes, correct?
> Synchronizing all access to the single queue should work in this case.

That is correct.

> BTW, rte_eth_tx_burst() returning >0 does not mean the packets have been sent.
> It only means they have been enqueued for sending.
> At some point the NIC will complete sending,
> only then the PMD can free the mbuf (or decrement its reference count).
> For most PMDs, this happens on a subsequent call to rte_eth_tx_burst().
> Which PMD and HW is it?

Here is the output of 'dpdk-devbind.py --status':

Network devices using DPDK-compatible driver
============================================
0000:65:00.1 'Ethernet Controller 10G X550T 1563' drv=vfio-pci
unused=uio_pci_generic


> Have you tried to print as many stats as possible when rte_eth_tx_burst()
> can't consume all packets (rte_eth_stats_get(), rte_eth_xstats_get())?

In setting this up, I discovered that this error only occurs when the
primary process on the other host exits (due to an error) or is not
initially running (the NIC is "down" in this case?). It happens
consistently when I only launch the processes on one of the two
machines. ***But*** counterintuitively, it looks like packets are
successfully "sent" by the daemon until the other process begins to
run. In case it is useful, I summarize the stats for this case below.

Note that I am also seeing another error. Sometimes, rather than tx
failing, my app detects incorrect/corrupted mbuf contents and exits
immediately. It appears that mbufs are being re-allocated when they
should not be. I thought I had finally solved this (see my earlier
threads) but with multi-core concurrency this problem has returned. It
is very possible that this error is somewhere in my own library code,
as it looks like the accompanying non-DPDK structures are also being
corrupted (probably first).

For background, I maintain a hash table of header structs to track
individual mbufs. The sequence numbers in the headers should match
those contained in the mbuf's payload. This check is failing after a
few hundred successful data messages have been exchanged between the
hosts. The sequence number in the mbuf shows that it is in the wrong
hash bucket, and the sequence number in the header is a large
corrupted value which is out of range for my sequence numbers (and
also not matching the bucket).

Back to the issue of failed tx bursts: Here are the stats I observed
after a packet failed to send from the daemon (after only launching
the primary+secondary processes on one of the machines). This failure
occurred after the daemon had successfully "sent" hundreds of
handshake packets (to nowhere, presumably?), and the failure occurred
as soon as the second process had finished initialization:

ipackets:0, opackets:0, ibytes:0, obytes:0, ierrors:0, oerrors:0
Got 146 xstats
Port:0, tx_q0_packets:1138
Port:0, tx_q0_bytes:125180
Port:0, mac_local_errors:2
Port:0, out_pkts_untagged:5
(All other stats had a value of 0 and are omitted).

I will continue investigating the corruption bug in the (likely) case
that it is in my library code. In the meantime please let me know if I
am using DPDK incorrectly. Thank you again!
-Alan

Reply via email to