Never mind. I have not caught up to newer messages in the thread when I replied. -John
From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of John Lo (loj) via lists.fd.io Sent: Friday, June 05, 2020 4:40 PM To: dmar...@me.com; m...@ciena.com Cc: vpp-dev@lists.fd.io Subject: Re: [vpp-dev] #vpp #vnet apparent buffer prefetch issue - seeing "l3 mac mismatch" discards May be it is this one? https://gerrit.fd.io/r/c/vpp/+/26961 -John From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> On Behalf Of Damjan Marion via lists.fd.io Sent: Friday, June 05, 2020 11:51 AM To: m...@ciena.com<mailto:m...@ciena.com> Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> Subject: Re: [vpp-dev] #vpp #vnet apparent buffer prefetch issue - seeing "l3 mac mismatch" discards Have you tried to use "git bisect" to find which patch fixes this issue? — Damjan On 4 Jun 2020, at 22:15, Bly, Mike via lists.fd.io<http://lists.fd.io> <mbly=ciena....@lists.fd.io<mailto:mbly=ciena....@lists.fd.io>> wrote: Hello, We are observing a small percentage of frames being discarded in simple 2-port L2 xconnect setup when a constant, same frame, single (full duplex) traffic profile is offered to the system. The frames are being discarded due to a failed VLAN classification when all frames offered have the same VLAN present, i.e. send two sets of 1B of the same frame in two directions (A <-> B), see x% discarded due to random VLAN classification issues. We did not see this issue in v18.07.1. At the start of the year we upgraded to 19.08 and started seeing this issue during scale testing. We have been trying to root cause it and are at a point where we need some assistance. Moving from our integrated VPP solution to using stock VPP built in an Ubuntu container, we have found this issue to be present in all releases between 19.08 – 20.01, but appears fixed in 20.05. We are not in a position where we can immediately upgrade to v20.05, so we need a solution for the v19.08 code base, based on key changes v20.01 -> v20.05. As such, we are looking for guidance on potentially relevant changes made between v20.01 and v20.05. VPP configuration used: create sub-interfaces TenGigabitEthernet19/0/0 100 dot1q 100 create sub-interfaces TenGigabitEthernet19/0/1 100 dot1q 100 set interface state TenGigabitEthernet19/0/0 up set interface state TenGigabitEthernet19/0/0.100 up set interface state TenGigabitEthernet19/0/1 up set interface state TenGigabitEthernet19/0/1.100 up set interface l2 xconnect TenGigabitEthernet19/0/0.100 TenGigabitEthernet19/0/1.100 set interface l2 xconnect TenGigabitEthernet19/0/1.100 TenGigabitEthernet19/0/0.100 Traffic/setup: · Two traffic generator connections to 10G physical NICs, each connection having a single traffic stream, where all frames are the same · No NIC offloading being used, no RSS, single worker thread separate from master · 64B frames with fixed/cross-matching unicast L2 MAC addresses, non-IP Etype, incrementing payload · 1 billion frames full duplex, offered at max “lossless” throughput, e.g. approx. 36% of 10Gb/s for v20.05 o “lossless” is maximum throughput allowed without observing “show interface” -> “rx-miss” statistics Resulting statistics: Working v18.07.1 with proper/expected “error” statistics: vpp# show version vpp v18.07.1 vpp# show errors Count Node Reason 2000000000 l2-output L2 output packets 2000000000 l2-input L2 input packets Non-Working v20.01 with unexpected “error” statistics: vpp# show version vpp v20.01-release vpp# show errors Count Node Reason 1999974332 l2-output L2 output packets 1999974332 l2-input L2 input packets 25668 ethernet-input l3 mac mismatch <-- we should NOT be seeing these Working v20.05 with proper/expected “error” statistics: vpp# show version vpp v20.05-release vpp# show errors Count Node Reason 2000000000 l2-output L2 output packets 2000000000 l2-input L2 input packets Issue found: In eth_input_process_frame() calls to eth_input_get_etype_and_tags() are sometimes failing to properly parse/store the “etype” and/or “tag” values, which then results later on in failed VLAN classification and resultant “l3 mac mismatch” discards due to parent L3 mode. Here is a sample debug profiling of the discards. We implement some down-n-dirty debug statistics as shown: · bad_l3_frm_offset[256] is showing which frame in “n_left” sequence of a given batch was discarded · bad_l3_batch_size[256] is showing the size of each batch of frames being processed when a discard occurs (gdb) p bad_l3_frm_offset $1 = {1078, 1078, 1078, 1078, 0 <repeats 12 times>, 383, 383, 383, 383, 0 <repeats 236 times>} (gdb) p bad_l3_batch_size $2 = {0 <repeats 251 times>, 1424, 0, 0, 1356, 3064} I did manage to find the following thread, which seems to be possibly related to our issue: https://lists.fd.io/g/vpp-dev/message/15488 Sharing just in case it is in fact relevant. Finally, are VPP performance regressions monitoring/checking “vpp show errors” content? We are looking to understand how this may have gone unnoticed between v18.07.1 and 20.05 release efforts given the simplicity of the configuration and test stimulus. -Mike
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#16674): https://lists.fd.io/g/vpp-dev/message/16674 Mute This Topic: https://lists.fd.io/mt/74679715/21656 Mute #vpp: https://lists.fd.io/mk?hashtag=vpp&subid=1480452 Mute #vnet: https://lists.fd.io/mk?hashtag=vnet&subid=1480452 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-