Have you tried to use "git bisect" to find which patch fixes this issue?

— 
Damjan


> On 4 Jun 2020, at 22:15, Bly, Mike via lists.fd.io 
> <mbly=ciena....@lists.fd.io> wrote:
> 
> Hello,
>
> We are observing a small percentage of frames being discarded in simple 
> 2-port L2 xconnect setup when a constant, same frame, single (full duplex) 
> traffic profile is offered to the system. The frames are being discarded due 
> to a failed VLAN classification when all frames offered have the same VLAN 
> present, i.e. send two sets of 1B of the same frame in two directions (A <-> 
> B), see x% discarded due to random VLAN classification issues.
>
> We did not see this issue in v18.07.1. At the start of the year we upgraded 
> to 19.08 and started seeing this issue during scale testing. We have been 
> trying to root cause it and are at a point where we need some assistance. 
> Moving from our integrated VPP solution to using stock VPP built in an Ubuntu 
> container, we have found this issue to be present in all releases between 
> 19.08 – 20.01, but appears fixed in 20.05. We are not in a position where we 
> can immediately upgrade to v20.05, so we need a solution for the v19.08 code 
> base, based on key changes v20.01 -> v20.05. As such, we are looking for 
> guidance on potentially relevant changes made between v20.01 and v20.05.
>
> VPP configuration used:
> create sub-interfaces TenGigabitEthernet19/0/0 100 dot1q 100
> create sub-interfaces TenGigabitEthernet19/0/1 100 dot1q 100
> set interface state TenGigabitEthernet19/0/0 up
> set interface state TenGigabitEthernet19/0/0.100 up
> set interface state TenGigabitEthernet19/0/1 up
> set interface state TenGigabitEthernet19/0/1.100 up
> set interface l2 xconnect TenGigabitEthernet19/0/0.100 
> TenGigabitEthernet19/0/1.100
> set interface l2 xconnect TenGigabitEthernet19/0/1.100 
> TenGigabitEthernet19/0/0.100
>
> Traffic/setup:
> Two traffic generator connections to 10G physical NICs, each connection 
> having a single traffic stream, where all frames are the same
> No NIC offloading being used, no RSS, single worker thread separate from 
> master
> 64B frames with fixed/cross-matching unicast L2 MAC addresses, non-IP Etype, 
> incrementing payload
> 1 billion frames full duplex, offered at max “lossless” throughput, e.g. 
> approx. 36% of 10Gb/s for v20.05
> “lossless” is maximum throughput allowed without observing “show interface” 
> -> “rx-miss” statistics
>
> Resulting statistics:
>
> Working v18.07.1 with proper/expected “error” statistics:
> vpp# show version
> vpp v18.07.1
>
> vpp# show errors
>    Count                    Node                  Reason
> 2000000000                l2-output               L2 output packets
> 2000000000                l2-input                L2 input packets
>
> Non-Working v20.01 with unexpected “error” statistics:
> vpp# show version   
> vpp v20.01-release
>
> vpp# show errors                                     
>    Count                    Node                  Reason
> 1999974332                l2-output               L2 output packets
> 1999974332                l2-input                L2 input packets
>      25668             ethernet-input             l3 mac mismatch         <-- 
> we should NOT be seeing these
>
> Working v20.05 with proper/expected “error” statistics:
> vpp# show version
> vpp v20.05-release
>
> vpp# show errors
>    Count                    Node                  Reason
> 2000000000                l2-output               L2 output packets
> 2000000000                l2-input                L2 input packets
>
> Issue found:
>
> In eth_input_process_frame() calls to eth_input_get_etype_and_tags() are 
> sometimes failing to properly parse/store the “etype” and/or “tag” values, 
> which then results later on in failed VLAN classification and resultant “l3 
> mac mismatch” discards due to parent L3 mode.
>
> Here is a sample debug profiling of the discards. We implement some 
> down-n-dirty debug statistics as shown:
> bad_l3_frm_offset[256] is showing which frame in “n_left” sequence of a given 
> batch was discarded
> bad_l3_batch_size[256] is showing the size of each batch of frames being 
> processed when a discard occurs
>
> (gdb) p bad_l3_frm_offset
> $1 = {1078, 1078, 1078, 1078, 0 <repeats 12 times>, 383, 383, 383, 383, 0 
> <repeats 236 times>}
>
> (gdb) p bad_l3_batch_size
> $2 = {0 <repeats 251 times>, 1424, 0, 0, 1356, 3064}
>
> I did manage to find the following thread, which seems to be possibly related 
> to our issue: https://lists.fd.io/g/vpp-dev/message/15488 
> <https://lists.fd.io/g/vpp-dev/message/15488> Sharing just in case it is in 
> fact relevant.
>
> Finally, are VPP performance regressions monitoring/checking “vpp show 
> errors” content? We are looking to understand how this may have gone 
> unnoticed between v18.07.1 and 20.05 release efforts given the simplicity of 
> the configuration and test stimulus.
>
> -Mike
> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16661): https://lists.fd.io/g/vpp-dev/message/16661
Mute This Topic: https://lists.fd.io/mt/74679715/21656
Mute #vpp: https://lists.fd.io/mk?hashtag=vpp&subid=1480452
Mute #vnet: https://lists.fd.io/mk?hashtag=vnet&subid=1480452
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to