Hello VPP experts, There seems to be a problem with the RDMA driver in VPP when using Mellanox ConnectX5 network interfaces. This problem appears for the master branch and for the stable/2005 branch, while stable/2001 does not have this problem.
The problem is that when a frame with 2 packets is to be sent, only the first packets is sent directly while the second packet gets delayed. It seems like the second packet is only sent later, when some other frame with other packets is to be sent then the delayed earlier packet is also sent. Perhaps this can go undetected if there is lots of traffic all the time, if there is always new traffic to flush out any delayed packets from earlier. So to reproduce it, it seems best to have a testing setup with very little traffic such that there are several seconds without any traffic, then it seems like packets can get delayed for several seconds. Note that the delay is not seen inside VPP where packet traces look like the packets are sent directly, VPP thinks they are sent but it seems some packets are held in the NIC and only sent later on. Monitoring traffic arriving at the other end shows that there was a delay. The behavior seems reproducible, except when there is other traffic being sent soon after since that causes the delayed packets to be sent. The specific case when this came up for us was when using VPP for NAT with ipfix logging turned on, and doing some ping tests. Then when a single ping echo request packet is to be NATed, that usually works fine but sometimes there is also a ipfix logging packet to be sent, that ends up in the same frame so that the frame has 2 packets. Then the ipfix logging packet gets sent directly while the ICMP packet is delayed, sometimes so much that the ping failed, it timed out. I don't think the problem has anything to do with NAT or ipfix logging, it seems like a more general problem with the rdma plugin. Testing previous commits indicates that the problem started with this commit: dc812d9a7 (rdma: introduce direct verb for Cx4/5 tx, 2019-12-16) That commit exists in master and in stable/2005 but not in stable/2001 which fits with that this problem is seen for master and stable/2005 but not for stable/2001. Tried updating to the latest Mellanox driver (v5.0-2.1.8) but that did not help. In the code in src/plugins/rdma/output.c it seems like the function rdma_device_output_tx_mlx5() is handling the packets, but I was not able to fully understand how it works. There is a concept of a "doorbell" function call there, apparently the idea is that when packets are to be sent, info about the packets is prepared and then the "doorbell" is used to alert the NIC that there are things to send. From my limited understanding, it seems like the doorbell currently results in only the first packet is really being physically sent by the NIC directly, while remaining packets are somehow stored and sent later. So far I don't understand exactly why that happens or how to fix it. As a workaround, it seems to work to simply revert the entire rdma plugin to the way it looks in the stable/2001 branch, then the problem seems to disappear. But that probably means we lose performance gains and other improvements in the newer code. Can someone with insight in the rdma plugin please help try to fix this? Best regards, Elias
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#16822): https://lists.fd.io/g/vpp-dev/message/16822 Mute This Topic: https://lists.fd.io/mt/75120690/21656 Mute #mellanox: https://lists.fd.io/g/fdio+vpp-dev/mutehashtag/mellanox Mute #rdma: https://lists.fd.io/g/fdio+vpp-dev/mutehashtag/rdma Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-