Have you tried to use "git bisect" to find which patch fixes this issue?
— Damjan > On 4 Jun 2020, at 22:15, Bly, Mike via lists.fd.io > <mbly=ciena....@lists.fd.io> wrote: > > Hello, > > We are observing a small percentage of frames being discarded in simple > 2-port L2 xconnect setup when a constant, same frame, single (full duplex) > traffic profile is offered to the system. The frames are being discarded due > to a failed VLAN classification when all frames offered have the same VLAN > present, i.e. send two sets of 1B of the same frame in two directions (A <-> > B), see x% discarded due to random VLAN classification issues. > > We did not see this issue in v18.07.1. At the start of the year we upgraded > to 19.08 and started seeing this issue during scale testing. We have been > trying to root cause it and are at a point where we need some assistance. > Moving from our integrated VPP solution to using stock VPP built in an Ubuntu > container, we have found this issue to be present in all releases between > 19.08 – 20.01, but appears fixed in 20.05. We are not in a position where we > can immediately upgrade to v20.05, so we need a solution for the v19.08 code > base, based on key changes v20.01 -> v20.05. As such, we are looking for > guidance on potentially relevant changes made between v20.01 and v20.05. > > VPP configuration used: > create sub-interfaces TenGigabitEthernet19/0/0 100 dot1q 100 > create sub-interfaces TenGigabitEthernet19/0/1 100 dot1q 100 > set interface state TenGigabitEthernet19/0/0 up > set interface state TenGigabitEthernet19/0/0.100 up > set interface state TenGigabitEthernet19/0/1 up > set interface state TenGigabitEthernet19/0/1.100 up > set interface l2 xconnect TenGigabitEthernet19/0/0.100 > TenGigabitEthernet19/0/1.100 > set interface l2 xconnect TenGigabitEthernet19/0/1.100 > TenGigabitEthernet19/0/0.100 > > Traffic/setup: > Two traffic generator connections to 10G physical NICs, each connection > having a single traffic stream, where all frames are the same > No NIC offloading being used, no RSS, single worker thread separate from > master > 64B frames with fixed/cross-matching unicast L2 MAC addresses, non-IP Etype, > incrementing payload > 1 billion frames full duplex, offered at max “lossless” throughput, e.g. > approx. 36% of 10Gb/s for v20.05 > “lossless” is maximum throughput allowed without observing “show interface” > -> “rx-miss” statistics > > Resulting statistics: > > Working v18.07.1 with proper/expected “error” statistics: > vpp# show version > vpp v18.07.1 > > vpp# show errors > Count Node Reason > 2000000000 l2-output L2 output packets > 2000000000 l2-input L2 input packets > > Non-Working v20.01 with unexpected “error” statistics: > vpp# show version > vpp v20.01-release > > vpp# show errors > Count Node Reason > 1999974332 l2-output L2 output packets > 1999974332 l2-input L2 input packets > 25668 ethernet-input l3 mac mismatch <-- > we should NOT be seeing these > > Working v20.05 with proper/expected “error” statistics: > vpp# show version > vpp v20.05-release > > vpp# show errors > Count Node Reason > 2000000000 l2-output L2 output packets > 2000000000 l2-input L2 input packets > > Issue found: > > In eth_input_process_frame() calls to eth_input_get_etype_and_tags() are > sometimes failing to properly parse/store the “etype” and/or “tag” values, > which then results later on in failed VLAN classification and resultant “l3 > mac mismatch” discards due to parent L3 mode. > > Here is a sample debug profiling of the discards. We implement some > down-n-dirty debug statistics as shown: > bad_l3_frm_offset[256] is showing which frame in “n_left” sequence of a given > batch was discarded > bad_l3_batch_size[256] is showing the size of each batch of frames being > processed when a discard occurs > > (gdb) p bad_l3_frm_offset > $1 = {1078, 1078, 1078, 1078, 0 <repeats 12 times>, 383, 383, 383, 383, 0 > <repeats 236 times>} > > (gdb) p bad_l3_batch_size > $2 = {0 <repeats 251 times>, 1424, 0, 0, 1356, 3064} > > I did manage to find the following thread, which seems to be possibly related > to our issue: https://lists.fd.io/g/vpp-dev/message/15488 > <https://lists.fd.io/g/vpp-dev/message/15488> Sharing just in case it is in > fact relevant. > > Finally, are VPP performance regressions monitoring/checking “vpp show > errors” content? We are looking to understand how this may have gone > unnoticed between v18.07.1 and 20.05 release efforts given the simplicity of > the configuration and test stimulus. > > -Mike >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#16661): https://lists.fd.io/g/vpp-dev/message/16661 Mute This Topic: https://lists.fd.io/mt/74679715/21656 Mute #vpp: https://lists.fd.io/mk?hashtag=vpp&subid=1480452 Mute #vnet: https://lists.fd.io/mk?hashtag=vnet&subid=1480452 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-