Hello VPP experts.

I have encountered an unexpected pattern in performance results.
So I wonder, is there a bug somewhere (VPP or CSIT),
or is there a subtle reason why the performance pattern should be expected?

Usually, the more processing VPP in a particular test has to do,
the smaller forwarding rate it achieves.
That is why l2patch tests are usually the fastest.
(I am talking about MRR here; so maximal offered load,
and we measure the rate of packets making it through back to Traffic Generator).

But I noticed this stops being true in some cases.
Specifically on Cascadelake testbeds, when VPP uses 4 physical cores
(HT on, so 8 VPP workers), l2patch no longer has the best MRR.
(Also seen on Skylake.)

After some examination, I selected two tests to compare.
Bidirectional traffic, single Intel-xxv710 NIC, two ports (one per direction),
4 receive queues per port (9 transmit queues as automatically selected by VPP),
AVF driver, and both tests use l2 cross-connect.
The difference being one test handles dot1q on top of that.

Even though the dot1q test has larger [1] vectors/call than the other [2]
(expected, as the loop with dot1q has more work to do),
and dot1q shows small number of rx discards [3] (the other shows none [4]),
while neither showing anything bad in "show error",
still the dot1q forwards above 35 Mpps (see Message here [5] for the 10 trial 
results),
compared to the other test's under 31 Mpps [6].

The same pattern can be seen in other tests, although there usually are
details differing between the dot1q and plain tests.
For example with DPDK driver, there is not-that-small amount of "rx missed",
the dot1q already showing smaller number [7] than the other [8].
(I have even seen tx error in l2patch tests, but that could be a separate 
issue.)

So, have you seen this behavior? Can you explain it?
The only guess I have is that the faster test is polling rx queues more 
frequently,
and that somehow slows down the NICs ability to actually receive packets fast 
enough?
I give that less than 1% probability of explaining the difference.
The workers are loaded in a fairly uniform way, so that is not an issue.
Perhaps something in dot1q handling makes it inherently cheaper for NIC to 
process?
Even then, I do not think that explains why dot1q-l2xc becomes faster than 
l2patch.

Vratko.

[1] 
https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k10-k1-k1-k4-k1
[2] 
https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s2-t1-k2-k9-k1-k1-k4-k1
[3] 
https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k10-k1-k8-k1
[4] 
https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s2-t1-k2-k9-k1-k8-k1
[5] 
https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s1-t1
[6] 
https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s2-t1
[7] 
https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s3-t1-k2-k10-k1-k8-k1
[8] 
https://logs.fd.io/production/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-clx/153/archives/log.html.gz#s1-s1-s1-s1-s4-t1-k2-k9-k1-k8-k1
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16041): https://lists.fd.io/g/vpp-dev/message/16041
Mute This Topic: https://lists.fd.io/mt/72901093/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to