Hi Elias,

As the problem only arise with VPP rdma driver and not the DPDK driver, it is 
fair to say it is a VPP rdma driver issue.
I'll try to reproduce the issue on my setup and keep you posted.
In the meantime I do not see a big issue increasing the rx-queue-size to 
mitigate it.

ben

> -----Original Message-----
> From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Elias Rudberg
> Sent: vendredi 14 février 2020 16:56
> To: vpp-dev@lists.fd.io
> Subject: [vpp-dev] VPP ip4-input drops packets due to "ip4 length > l2
> length" errors when using rdma with Mellanox mlx5 cards
> 
> Hello VPP developers,
> 
> We have a problem with VPP used for NAT on Ubuntu 18.04 servers
> equipped with Mellanox ConnectX-5 network cards (ConnectX-5 EN network
> interface card; 100GbE dual-port QSFP28; PCIe3.0 x16; tall bracket;
> ROHS R6).
> 
> VPP is dropping packets in the ip4-input node due to "ip4 length > l2
> length" errors, when we use the RDMA plugin.
> 
> The interfaces are configured like this:
> 
> create int rdma host-if enp101s0f1 name Interface101 num-rx-queues 1
> create int rdma host-if enp179s0f1 name Interface179 num-rx-queues 1
> 
> (we have set num-rx-queues 1 now to simplify while troubleshooting, in
> production we use num-rx-queues 4)
> 
> We see some packets dropped due to "ip4 length > l2 length" for example
> in TCP tests with around 100 Mbit/s -- running such a test for a few
> seconds already gives some errors. More traffic gives more errors and
> it seems to be unrelated to the contents of the packets, it seems to
> happen quite randomly and already at such moderate amounts of traffic,
> very far below what should be the capacity of the hardware.
> 
> Only a small fraction of packets are dropped: in tests at 100 Mbit/s
> and packet size 500, for each million packets about 3 or 4 packets get
> the "ip4 length > l2 length" drop problem. However, the effect appears
> stronger for larger amounts of traffic and has impacted some of our end
> users who observe decresed TCP speed as a result of these drops.
> 
> The "ip4 length > l2 length" errors can be seen using vppctl "show
> errors":
> 
>     142                ip4-input               ip4 length > l2 length
> 
> To get more info about the "ip4 length > l2 length" error we printed
> the involved sizes when the error happens (ip_len0 and cur_len0 in
> src/vnet/ip/ip4_input.h), which shows that the actual packet size is
> often much smaller than the ip_len0 value which is what the IP packet
> size should be according to the IP header. For example, when
> ip_len0=500 as is the case for many of our packets in the test runs,
> the cur_len0 value is sometimes much smaller. The smallest case we have
> seen was cur_len0 = 59 with ip_len0 = 500 -- the IP header said the IP
> packet size was 500 bytes, but the actual size was only 59 bytes. So it
> seems some data is lost, packets have been truncated, sometimes large
> parts of the packets are missing.
> 
> The problems disappear if we skip using the RDMA plugin and use the
> (old?) dpdk way of handling the interfaces, then there are no "ip4
> length > l2 length" drops at all. That makes us think there is
> something wrong with the rdma plugin, perhaps a bug or something wrong
> with how it is configured.
> 
> We have tested this with both the current master branch and the
> stable/1908 branch, we see the same problem for both.
> 
> We tried updating the Mellanox driver from v4.6 to v4.7 (latest
> version) but that did not help.
> 
> After trying some different values of the rx-queue-size parameter to
> the "create int rdma" command, it seems like the "ip4 length > l2
> length" becomes smaller as the rx-queue-size is increased, perhaps
> indicating the problem has to do with what happens when the end of that
> queue is reached.
> 
> Do you agree that the above points to a problem with the RDMA plugin in
> VPP?
> 
> Are there known bugs or other issues that could explain the "ip4 length
> > l2 length" drops?
> 
> Does it seem like a good idea to set a very large value of the rx-
> queue-size parameter if that alleviates the "ip4 length > l2 length"
> problem, or are there big downsides of using a large rx-queue-size
> value?
> 
> What else could we do to troubleshoot this further, are there
> configuration options to the RDMA plugin that could be used to solve
> this and/or get more information about what is happening?
> 
> Best regards,
> Elias
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#15431): https://lists.fd.io/g/vpp-dev/message/15431
Mute This Topic: https://lists.fd.io/mt/71273976/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to